[Taxacom] Chameleons, GBIF, and the Red List

David Campbell pleuronaia at gmail.com
Thu Aug 28 15:45:32 CDT 2014

> 3) Over-reliance on non-automated, human-mitigated processes = DEATH.
> This is because everyone is already too busy.

Better automation is highly desirable (I've been grumbling a good deal
lately about the fact that our copier requires me to push the same 32
buttons every time I scan a page out of the 344 page 1890's reference that
I have on interlibrary loan, because the copier doesn't seem to believe in
saving settings).  However, overreliance on automated, unmonitored
processes = garbage with a misleading impression of life.

There are two major challenges to automation.  First, there is
determination of how to get from the input that you're trying to work with
to the output that you want.  But secondly, once you have a system set up,
it is necessary to determine whether particular input is valid or

We're talking about (and sometimes confounding) both issues here.  Are the
data-mining programs generating the taxonomic databases working properly?
In some cases, no.  Databases have fictional entries due to incorrect
synthesis or processing of input.  Current optical character recognition
technology doesn't seem to be good enough to be trustworthy in detailed
applications (for example, BHL's automated generation of names found on a
page has a very high false positive and false negative rate in many
publications).  Just as Microsoft routinely has bright ideas about what I
want to happen that have to be laboriously undone, so the database program
sometimes is at fault.

But there's also the problem of inappropriate input.  A flawless
database-generating program is going to give lousy results when the input
data is of poor quality.  And a meta-analysis based on the assumption that
a database has comprehensive and authoritative information will give
impressive-looking but unreliable results if the database is actually very
much a work in progress.  The latter is a variant of the general black box
syndrome - because I fed my data into the computer and got this back out,
it must be true and overthrows all previous work on the topic.  As long as
the data are appropriately formatted, a computer will process it, whether
it's complete nonsense or reasonably valid.  Some checking is necessary to
see if the results are in fact valid.  In part, this can be addressed by
more sophisticated programming.  For example, locating scientific names in
text could be improved if someone figured out how to take advantage of the
index.  Often BHL reports that a taxon is mentioned in the index but misses
the text citation; I would guess that reflects the greater ease of
recognizing an isolated bit of text than in the midst of a paragraph.
Similarly, adding more checks of higher taxonomic affinities or other
additional information could be useful in several contexts (if the source
identifies itself as dealing with boreal montane dicots, then a purported
record of hermatypic corals is probably wrong).  Nevertheless, having some
sort of check on the data and results is necessary.

Of course, "unverified computer-generated data - use at your own risk" has
its usefulness.  Perhaps again this is a place where we need some sort of
tag, in this case automatic, showing what steps have been taken towards
data verification.  However, caution is necessary to confirm that the
result is not merely being checked back against the same poor-quality

More information about the Taxacom mailing list