Fwd: Re: Text Extraction Again (from Taxonomic e-text)

P. Bryan Heidorn heidorn at ALEXIA.LIS.UIUC.EDU
Sun Jan 25 21:20:36 CST 2004

Hello Mary,

I agree with your assessment and would like to add a comment or two.

The information within a taxonomic description or treatment varies greatly
in the level of structure. For example, the name section including taxonomy
and synonymy is relatively structured in most modern publication. It seems
reasonable to assume that computer programs will be able to extract useful
information from that section. Likewise, the political distribution
information is fairly structured and therefore amenable to information
extraction. Even the reference section which can have great variability
across publications can be mined successfully partly because of the effort
that has gone into solving this problem in the general digital library
context. Citeseer is a good example http://citeseer.org/.

Of course the difficulty comes with morphological descriptions and the
desire to make interactive keys on the cheap. This section of the treatment
is much more variable and complex.  In many cases it is not only complex
but incomplete. Authors leave out information sometimes intentionally,
sometimes unintentionally. Space and editorial restrictions sometimes are
the justifiable cause. Sometimes it is just error.

Even with the variability, it is possible to get some useful information
out of a flora. For example, it is not difficult to extract numeric data
such as the plant height range, the size range of leaves and leaflets, and
chromosome number. It is also possible but not as easy to recognize that
the word "round" is used in the description of the leaves. There will
certainly be information in the description that a computer will not be
able to extract. There will also be information that would be needed for a
key but is missing from an individual description. Even the knowledge that
information is missing is potentially useful since it can lead authors and
editors to reevaluate a description.
Mike Dallwitz pointed out the Terminator program for extracting information
about nematodes (http://math.ucdavis.edu/~milton/genisys.html). It is
useful for extracting information. The difficulty is getting such rules
used in that system to apply to new taxonomic descriptions. One example of
this can be found in Hong Cui's preliminary dissertation work

Since there are many thousands of descriptions already written, I believe
it is useful to mine some information from the descriptions even is the
application to some interactive keys is incomplete or questionable.

Happy parsing.

Bryan Heidorn
At 05:53 PM 1/25/2004 -0700, Mary Barkworth wrote:
>I thank Mike for the references and echo his skepticism about the value
>of what is being attempted although I would be somewhat kinder about the
>quality of descriptions in existing treatments, probably because I have
>written  some of them. I agree that existing descriptions (including my
>own) are usually not parallel.  Humans do not need complete parallelism;
>they have a brain. I had a class use an interactive key to Utah plants.
>The success rate was no higher than using ordinary keys.  Moreover, at
>the end of the exercise, the students had learned little about the
>plants they were identifying (the program is really irrelevant; it was
>not Delta).
>Granted interactive keys, which require much greater parallelism because
>they are being interpreted by a computer not a human, have the potential
>of permitting identification of fragmentary material. But we need to
>invest in developing better, more complete descriptions, not extracting
>information from descriptions that were never intended for use by a
>machine.  This means funding basic taxonomic research, complete with
>provision of technical support, not extracting information from old
>descriptions. Even just marking up existing descriptions for rather
>general characters requires, in my admittedly limited experience,
>careful reading of an existing description to get all the bits in the
>right place.  Of course, all Jim mentioned were names and "other taxon
>attributes".  Perhaps he is not even thinking of interactive
>I have used the interactive key for Utah plants - it is helpful in
>narrowing down the possibilities when someone brings in fragmentary
>material, but I do not consider it particularly useful for beginners. It
>is certainly not within the budget of most individuals, this despite
>being based on information in printed volumes with, I suspect, no fee
>being paid for use of the information brought together by taxonomists.

P. Bryan Heidorn    Graduate School of Library and Information Science
pheidorn at uiuc.edu   University of Illinois at Urbana-Champaign
(V)217/ 244-7792    501 East Daniel St., Champaign, IL  61820-6212
(F)217/ 244-3302    http://alexia.lis.uiuc.edu/~heidorn

More information about the Taxacom mailing list