deepreef at bishopmuseum.org
Wed Sep 16 15:15:16 CDT 2009
> Two major issues with this:
> 1) there are not 18 million names in GNI, but 18 million text
> strings, with chance determining what exactly is in such a
> text string.
Of all the things that have been frustrating to me operating on the border
between taxonomy and informatics, by FAR the most frustrating has been the
diversity and disparity of what people mean when they use the word "name" in
the context of taxonomy. One of the definitions often used by many people
(not necessarily taxonomists), is the unit of information indexed by GNI.
That is, "a text string purported to represent an organism". We coined the
term "NameString" (discussed recently on this list in another thread)
*specifically* to disambiguate a "a text string purported to represent an
organism", from what most taxonomists think of as a "name (although
taxonomists have a lot of variation for the meaning of a "name" all on their
When I think of "name" in the context of taxonomy, this is *not* what I
think of. What I think of is probably very similar to what you think of.
If you read my post, you'll see that when I said "18 million of them", I was
referring specifically to "variations in how scientific names have been
structured". That is, not 18 million "names", but 18 million *variations*
in the way *we* represent taxon names (with and without authors,
abbreviations, alternate spellings, alternate combinations, alternate
formatting conventions, etc., etc., etc.)
> Likely there are 2 to 3 million names, at most?
By what you and I mean by "name" (i.e., individual nomenclatural units
established under rules of the various Codes) -- yes, that's about what I
> 2) the structure of scientific names is not directly
> determined by existing variation, but by the Code that
> applies in that particular case, retroactively if needs be.
> So, looking at variations does not necessarily lead to
> learning anything.
It depends on what your goal is. The goal of GNI is to build links between
the god-awful mess we have created in *representing* scientific names in
digital form, and the "real" names as you and I would think of it. Hence
the perfect marriage between GNI and GNUB. Whereas GNI is focused on
text-strings "in the wild" (as Dave Remsen describes it), GNUB is focused
more on the "Curated Chresonym Index" (a term I coined, but do not
particularly like, and would rather not perpetuate -- see:
http://en.wikipedia.org/wiki/Chresonym). I prefer "Taxon Name Usage" (TNU).
Emerging from the TNU instances in GNUB are the kind of highly-structured,
robustly-metadata'd, carefully scrutinized "name-objects" (to disambiguate
"name") that practicing taxonomists need access to. Under the GNA umbrella,
my vision is that GNI will serve to "wrangle in" all those messy text
strings, and provide services to get them anchored into GNUB, and thereby
anchored into proper "names" as you and I think of that word.
> By the look of things many came from the data aggregators?
True...but where did they get them? How many variants did they actually
*create*, as opposed to simply aggregating what the taxonomic community
created. Perhaps some of the former (in terms of database errors, bad
parsing and re-concatenating, etc.) But the vast majority of the mess
ultimately came from us, and our 250 years of literature, specimen
collections, and other sources.
> Well, I am not particularly convinced as to who these 'us' are ...
As I said in my post: "the taxonomists, museum specimen curators, and
various taxonomic data managers of the world (among others)."
> As to the algorithms, these may be robust, but so far this
> does not show at all.
I'll let Dmitry address that. But I believe it has already been mentioned
that this effort is not intended for human eyes, and therefore the real
"meat" of the operation is happening unseen by human eyes. Despite this,
however, the information is there.
Search for any entry in the index.
Click on any of the names displayed in the column at left.
On the right, you'll see a link for "Parsed information (show)" Click on
Grabbing a random example from the first page under "AAA":
"Aaadonta constricta komakanensis Solem 1976"
gets parsed to:
normalized: Aaadonta constricta komakanensis Solem 1976
authorship: Solem 1976
canonical: Aaadonta constricta komakanensis
verbatim: Aaadonta constricta komakanensis Solem 1976
This wasn't done by a human, it was done by an algorithm. It's pretty damn
hard to find one that was done wrong.
But the point is, once the text string is parsed, it becomes *enormously*
easier to cross-link it to a properly curated "name-object" in a
It already does some preliminary analysis of this sort, as included in the
33 author_word, 38
0 genus, 8
39 year, 43
20 infraspecies, 32
9 species, 19
It also gets combined with other name-strings in the same "Lexical group",
and shows you which content providers have each particular variant.
All of this makes it much, much easier to build links between the mess of
taxonomic data that exists out there "in the wild".
> What gets to me is that there are two different 'universes':
> 1) scientific names
> 2) instances of these being used (what GNI calls a "record")
> (there is a third one, of taxa, but that is out of
> consideration here).
I agree, but would rephrase as:
1) scientific names treated as "objects" with rich metadata, as used by a
2) Name-string instances purported to represent a "name-object" as in #1.
[Note: these sound like bioinformatics-ish definitions, but in fact they
exist independent of anything electronic -- they apply just as well to
> So, the logical thing to do would be either
> 1) to just list all the "records", which after all are
> individually distinct, each with their own history, their own
> data and selection of what they include and do not include; or
> 2) organise these records, by the logical criterium, that is
> by scientific name (OK, with perhaps separate entries for
> each orthographical variant, at least initially).
> (the thing that everybody would really want is to organise
> them by taxon, but that is out of consideration here).
Because they have already done the parsing, it would be quite easy to
organize it that way -- and I'm sure that eventually it will. But as
stated, the whole process is about getting machines to talk to machines, to
cross-link data. This only offers a "window" on that process. At the first
GNI brainstorming session, there was even a small debate about whether there
should be *any* human-accessible interface. Those opposed to having one
basically made the point that people would se the mess of name-strings, and
get the wrong impression about what the function of GNI really is. However,
I think the user interface is far, far better than I ever imagined it would
be, and so I'm very happy to see that they have it working the way they do.
I hope that clears, rather than muddies, the waters a bit. Speaking as
someone who has played the biodiversity informatics game (particularly in
terms of taxon names) for about 20 years now, the merging GNA (GNI, GNUB,
and associated services) is by far the most exciting thing I've seen in this
field, and the one that offers me the most hope that we might actually be
approaching a solution to this age-old problem of integrating biodiversity
Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
and Associate Zoologist in Ichthyology
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org
More information about the Taxacom