names vs. "names"

Charles Hussey c.hussey at NHM.AC.UK
Wed Feb 9 11:00:01 CST 2005

In reply to Paul van Rijckevorsel:

"Actually, this very strongly reminds me of an attitude all too often found
among the compilers of databases.
"Yes, we know this name will be erroneous. Yes, we realize that including it
in the database will perpetuate and propagate the error. However, if it was
once used in a book (even if in clear error) we are including it. We realize
this means the world will go to hell in a handbasket, but we don't care. As
long as as our database is 'complete' we are happy."

This "first layer" is the step where relevant information is excluded.
If too much information is excluded (i.e. just the minimum is recorded) the
database will have to be thrown out as error-riddled and unusable once it is
complete (but it won't be thrown out, it will be there forever as an excuse
never to do it right). I am beginning to despair of databases (unless built
from a solid taxonomic basis)."

We are beginning (just beginning) to enjoy the possibility of tapping into a
huge information resource as primary published sources, observation records
and specimen records become accessible on-line, through the results of
digitisation programmes.

To make searches effective (rather than frustrating) requires effective
resource discovery systems and they need to take account of "real-world"
data which, unfortunately (as David Remsen has, of course, already
discovered) have a certain messiness to them. The problem is that
computerised searches are literal and the solution lies in both mapping
between equivalences (names or taxonomic concepts) and in flagging
"approved" terms (names or taxonomic concepts).

As Arthur Chapman likes to remind us; there are errors in every database (a
careful survey of a large botanical database in the UK, revealed 2 errors in
every record!). If you want a result set (say from a search using a portal
across distributed databases) to return all relevant records, then there has
to be somewhere (imbedded in the central access system or using an
independently maintained name-server) a thesaurus that includes all the
names in the contributing sources and maps non-preferred to preferred terms.
Cleaning up large datasets is a lengthy business (but do see Arthur
Chapman's papers on the GBIF website on Data Quality and Data Cleansing, as
there are some automated methods available, and is compounded by the
fact that Natural Science curators like to maintain verbatim records of
their label data - so I think that we have to accept that odd names will be
around on-line for ever. Try a search on Google for known mis-spelling of
taxonomic names - they are out there, and will be difficult to eradicate. If
you wish to get at all available information relating to an organism, then
you have to know about synonyms and archaic names (for vernacular names).

