dipteryx at freeler.nl
dipteryx at freeler.nl
Wed Sep 16 14:05:04 CDT 2009
Van: Richard Pyle [mailto:deepreef at bishopmuseum.org]
Verzonden: wo 16-9-2009 20:01
>> For a project that is presented as "GNI is meant to collect
>> all Scientific Names" it pays remarkably little attention to
>> scientific names (and their structure).
>This is precisely the opposite of correct.
>In my estimation, the people developing services behind GNI (I am not
>among them) have paid more attention to the structure of scientific
>names than probably any taxonomist in history.
>How can I say such a seemingly outlandish thing?
>Well, for one thing, they have access to more variations in how
>scientific names have been structured than any other taxonomist has.
>About 18 million of them, in fact.
Two major issues with this:
1) there are not 18 million names in GNI, but 18 million text
strings, with chance determining what exactly is in such a text string.
Likely there are 2 to 3 million names, at most?
2) the structure of scientific names is not directly determined by
existing variation, but by the Code that applies in that particular case,
retroactively if needs be. So, looking at variations does not necessarily
lead to learning anything.
* * *
>Where did these names come from? They came from *us* --
>the taxonomists, museum specimen curators, and various taxonomic
>data managers of the world (among others).
By the look of things many came from the data aggregators?
* * *
>GNI did not create *any* of these text strings -- they simply hold a
>mirror up to us and show us the mess we've created.
>But more directly on the issue of paying attention to the structure
>of scientific names, the key services behind GNI include some of the
>most robust algorithms ever developed for parsing these text strings
>into atomized bits of name "data". The entire point of these algorithms
>is to help clean up *our* mess. And frankly, I'm *amazed* at how accurate
>those algorithms are. Are they perfect? Well....is anything ever perfect?
>Of course not. But they are far, far closer to perfect than we, the
>taxonomic community, has ever been (a evidenced by 18 million text strings
>purported by us to represent a fraction of that number of scientific names).
Well, I am not partcularly convinced as to who these 'us' are ...
As to the algorithms, these may be robust, but so far this does not show
at all. By the results so far, it would have been better to just label the
first bit of text string "generic name"; the second bit "specific epithet"
(or "specific name") and the rest "author citation", and take it from there
(horribly inaccurate as that would be).
What gets to me is that there are two different 'universes':
1) scientific names
2) instances of these being used (what GNI calls a "record")
(there is a third one, of taxa, but that is out of consideration here).
So, the logical thing to do would be either
1) to just list all the "records", which after all are individually
distinct, each with their own history, their own data and selection
of what they include and do not include; or
2) organise these records, by the logical criterium, that is by scientific
name (OK, with perhaps separate entries for each orthographical variant,
at least initially).
(the thing that everybody would really want is to organise them by taxon,
but that is out of consideration here).
The choice to organise them by text string looks terribly random, as on the
one hand any number of very disparate items (belonging to wildly different
taxa) can be linked to the same text string (in the case of homonyms) and
on the other hand two or three copies of the exact same record (but one or
more of them slightly edited for style) end up under different entries.
Also, it is very likely that somebody who has a text string will not find
an exact match, even when he has a very common variation. The whole thing
looks designed to maximize confusion.
More information about the Taxacom