David Remsen (GBIF)
dremsen at gbif.org
Wed Sep 16 14:52:47 CDT 2009
You might try out the name parsing algorithm Dmitry put together and
also review the grammar that it uses. Or pass some other list of
names through the parser service at http://globalnames.org/parsers/
new. I think it does a pretty good job.
The grammar was developed by an international group that spent a lot
of time consulting the published Codes and included zoologists and
botanists. Accurate parsing is a component of being able to better
tie a particular name usage to an intended taxon concept which,
ultimately is what we are working for and thus is not out of
consideration. It may, however, be out of the scope for this
particular index. Of course this will be facilitated by having
infrastructure in place that supports the publication, identification
and resolution of these concepts. We are doing our bit to facilitate
this at GBIF and use the parsing rules to try to clear up the mess in
our own data index. The GBIF data portal, from which this index is
derived must fall into this category of 'data aggregator' but I'm not
sure why that should make too big a difference. This index is the
distinct summary of orthographies that are published to the GBIF
network from museums and other institutions and these, by and large,
reflect what is in their local databases. It's a copy.
You said there are likely 2-3 million names, at most. In what sense
of the word "name" since I get different answers from botanists than
zoologists as to what they mean and it affects the cardinality of the
estimate. I posted a question some time back where I asked how many
name-bearing types might exist. I fear (as I often do in this area)
that I didn't have the term quite right but really was asking how many
original species descriptions (which I assume is tied to a type)
exist. Clearly there are more of these than there are species. From
these how many have been moved to new genera, replaced, etc. to create
more distinct names? I would have thought the number to be higher
than 2-3 million.
As to the interface and ordering of the GNI index I am not sure how
they should be organised and I suppose it's based on what the index is
for. I want to use it to access links to things that I want
dynamically updated. I want the orthographic matching to ensure that
a link is made regardless of the specific orthography. So I don't
intend to look at the interface or even many of the records that are
in there. Just the ones that point to the resources I want quick
On Sep 16, 2009, at 9:05 PM, <dipteryx at freeler.nl>
<dipteryx at freeler.nl> wrote:
> Van: Richard Pyle [mailto:deepreef at bishopmuseum.org]
> Verzonden: wo 16-9-2009 20:01
>> Hi Paul,
>>> For a project that is presented as "GNI is meant to collect
>>> all Scientific Names" it pays remarkably little attention to
>>> scientific names (and their structure).
>> This is precisely the opposite of correct.
>> In my estimation, the people developing services behind GNI (I am not
>> among them) have paid more attention to the structure of scientific
>> names than probably any taxonomist in history.
>> How can I say such a seemingly outlandish thing?
>> Well, for one thing, they have access to more variations in how
>> scientific names have been structured than any other taxonomist has.
>> About 18 million of them, in fact.
> Two major issues with this:
> 1) there are not 18 million names in GNI, but 18 million text
> strings, with chance determining what exactly is in such a text
> Likely there are 2 to 3 million names, at most?
> 2) the structure of scientific names is not directly determined by
> existing variation, but by the Code that applies in that particular
> retroactively if needs be. So, looking at variations does not
> lead to learning anything.
> * * *
>> Where did these names come from? They came from *us* --
>> the taxonomists, museum specimen curators, and various taxonomic
>> data managers of the world (among others).
> By the look of things many came from the data aggregators?
> * * *
>> GNI did not create *any* of these text strings -- they simply hold a
>> mirror up to us and show us the mess we've created.
>> But more directly on the issue of paying attention to the structure
>> of scientific names, the key services behind GNI include some of the
>> most robust algorithms ever developed for parsing these text strings
>> into atomized bits of name "data". The entire point of these
>> is to help clean up *our* mess. And frankly, I'm *amazed* at how
>> those algorithms are. Are they perfect? Well....is anything ever
>> Of course not. But they are far, far closer to perfect than we, the
>> taxonomic community, has ever been (a evidenced by 18 million text
>> purported by us to represent a fraction of that number of
>> scientific names).
> Well, I am not partcularly convinced as to who these 'us' are ...
> As to the algorithms, these may be robust, but so far this does not
> at all. By the results so far, it would have been better to just
> label the
> first bit of text string "generic name"; the second bit "specific
> (or "specific name") and the rest "author citation", and take it
> from there
> (horribly inaccurate as that would be).
> What gets to me is that there are two different 'universes':
> 1) scientific names
> 2) instances of these being used (what GNI calls a "record")
> (there is a third one, of taxa, but that is out of consideration
> So, the logical thing to do would be either
> 1) to just list all the "records", which after all are individually
> distinct, each with their own history, their own data and selection
> of what they include and do not include; or
> 2) organise these records, by the logical criterium, that is by
> name (OK, with perhaps separate entries for each orthographical
> at least initially).
> (the thing that everybody would really want is to organise them by
> but that is out of consideration here).
> The choice to organise them by text string looks terribly random, as
> on the
> one hand any number of very disparate items (belonging to wildly
> taxa) can be linked to the same text string (in the case of
> homonyms) and
> on the other hand two or three copies of the exact same record (but
> one or
> more of them slightly edited for style) end up under different
> Also, it is very likely that somebody who has a text string will not
> an exact match, even when he has a very common variation. The whole
> looks designed to maximize confusion.
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> The Taxacom archive going back to 1992 may be searched with either
> of these methods:
> (1) http://taxacom.markmail.org
> Or (2) a Google search specified as: site:mailman.nhm.ku.edu/
> pipermail/taxacom your search terms here
More information about the Taxacom