[Taxacom] the hurdle for all biodiv informatics initiatives

Stephen Thorpe s.thorpe at auckland.ac.nz
Fri Feb 19 17:14:33 CST 2010

Hi Paul,
A frequent complaint you seem to voice about all biodiversity databases, including Wikispecies, is that they don't, by your estimation, contain much in the way of "useful" information. To my mind, however, they function to organise vast numbers references (preferably with links of some kind to those references) in a taxonomic way. The "useful information" is contained in the references, and not in the database per se. This is certainly how I view Wikispecies - a vast taxonomically organised library/bibliography, supplemented where possible with images...

From: taxacom-bounces at mailman.nhm.ku.edu [taxacom-bounces at mailman.nhm.ku.edu] On Behalf Of dipteryx at freeler.nl [dipteryx at freeler.nl]
Sent: Friday, 19 February 2010 10:52 p.m.
To: taxacom at mailman.nhm.ku.edu
Subject: Re: [Taxacom] the hurdle for all biodiv informatics initiatives

Van: Richard Pyle [mailto:deepreef at bishopmuseum.org]
Verzonden: do 18-2-2010 19:39

>> ***
>> OK, "biodiversity informatics" it is (BTW, when I took a
>> course in bioinformatics it meant something different
>> entirely from what is sketched above. There are a lot of
>> terms that are confusing!)
>> * * *

> Yes -- same here.  The term "bioinformatics" existed in our sense
> long before PCR was a mature technology.  But if you Google that
> term, you'll see that it's almost always used sensu stricto for
> DNA stuff.

Well, the course I took dealt with information processing by
biotic systems.
* * *

> Without launching into another HUGE email, I'll just say that this
> is how my own bio[diversity]informatics efforts began.  I spent
> perhaps a decade staunchly avoiding any sort of "surrogate primary
> key" (i.e., arbitrary number) in my data tables.  In fact, the first
> robust database I created -- the specimen database for Ichthyology
> at Bishop Museum -- *still* uses the set of three fields "Genus",
> "Species", "Subspecies" as the compound primary key for the taxonomy
> table.  But after a decade of beating my head against that
> ideological wall, I finally had to re-evaluate my position and
> embrace surrogate primary key fields (locally unique identifiers)
> in my databases. Now that I have a better understanding of what is
> needed to allow taxonomic data to flow across the internet, I have
> come to embrace GUIDs.  Basically, over the years of dealing with
> real-world taxonomy database issues, as I became both experienced
> and educated, I transitioned from a staunch opposition to any kind
> of identifier, to one of the loudest advocates of them.

Although I realize that using artificial identifiers (in the form
of some kind of alphanumerical key/string) looks awkward and thus may
raise instinctive objections with many who encounter them, I do not
see any real problems with them, provided computers can read them
unambiguously and as long humans are not obliged to deal with them
(although for safety's sake they should have access!).

The issue that I am focussing on is the information content that is
accessed: which label unlocks what information? Just making a stack
of labels (which is what any "name" is, a label) only results in a
stack of labels, which is not necessarily of any use whatsoever.

>From a nomenclatural point of view there are two (and only two,
no more!) items of prime importance:
1) the scientific name (in its one correct spelling)
2) the type
In a database this (name + type) can be captured by a single
alphanumerical key: it is a unique "entity", a nomenclatural entity.
Once this nomenclatural entity is included in a database, it is
possible to attach a whole slew of nomenclatural information
(what is its rank, who published it where, etc), which is very nice
for completeness, and for quality control, but immaterial for
information-access purposes. This is the easy part, but absolutely

Nomenclatural information, by itself, is not of practical value.
Some names can be dealt with entirely from a nomenclatural
perspective, by application of nomenclatural rules (Caryophyllus
Mill. is an illegitimate name, a homotypic synonym of Dianthus L.)
or by the Act of a higher nomenclatural authority. The difficult
part in building a database is how to access actual information.
Often it will be necessary to break things down to separate
documented usages of a name; these then each to be linked to a
particular taxon concept (like: Fantasia imaginensis as treated
in the Flora Utopica), that is, a particular circumscription
(which may correspond to a chresonym, or a collection of chresonyms,
if you like). A database had better have a unique alphanumerical
key for each circumscription, or it becomes a meaningless mess.
* * *

>> Text strings will never become consistent, nor stop multiplying.
>> Not unless humans are excluded from the entire process.

> I think you misunderstood my point.  By "consistent" I didn't mean
> that we'd all share the same taxonomy and nomenclature.  I meant
> that we wouldn't have to continue to deal with variations like this:

Actually, that is exactly how I understood it. For practical purposes,
there is an endless amount of variations. Not only won't this ever
stabilize, but such a variation of a text string does not unlock and
access useful information (for example, of the occurrences of spelling
variation umpteen-and-one there may be three that refer to
circumscription A, two to circumscription B and five to circumscription
C, while of the occurrences of spelling variation umpteen-and-two
there are six that refer to circumscription A, one to circumscription
B and two to circumscription C, and on-and-on for all the other
spelling variations). Documenting variations of text strings gets you
nothing, except lots and lots of such text strings.
* * *

>> ***
>> Actually, I do not see that the "myriad text strings... are
>> our only link ... to important information about
>> biodiversity" nor that they would be sufficient to access all
>> the information. They are just what the "biodiversity
>> informatics" people are dealing with.
>> * * *

> OK, I don't follow.  Can you elaborate on how else we index
> information about biodiversity in published (and unpublished) forms?
> It seems to me that the entire REASON we use taxon names is to
> abstract the notion of a taxon concept in the form of a series
> of text characters, and that we use those text-character strings
> to label other information (specimens, images, DNA sequences,
> ecological datasets, taxonomic revisions, etc., etc.).

Well, my point is a very general one. There is a lot of information
"out there" and our link to it is the names that are used; these names
can roughly be subdivided into scientific, purporting-to-be-scientific,
common and vernacular names. All of which may well have spelling
variations, without these necessarily meaning much (or anything at all).
* * *

> And it wasn't the "biodiversity informatics people" who created the mess.

Oh no, it probably safe to say that biodiversity is messy to begin with,
and as for the myriad people who have been dealing with it ...
* * *

> You can blame the "biodiversity informatics people" for a lot of
> things, but the mess of text-strings purported to represent scientific
> names that we now have to deal with is definitely not among those
> things.

That depends on how you phrase it ...


Taxacom Mailing List
Taxacom at mailman.nhm.ku.edu

The Taxacom archive going back to 1992 may be searched with either of these methods:

(1) http://taxacom.markmail.org

Or (2) a Google search specified as:  site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here

More information about the Taxacom mailing list