I have not followed this thread closely, but it seems to me that the main problems people complain about regarding data harvested by aggregators like GBIF fall into two broad categories:
1) The indicated geographic location is bad
2) The indicated taxon is bad

Bad geography comes in two basic forms:
a) The stated geographic place is not correct.  This could be due to bad original data or bad digitization, but there is generally no way to fix this other than fixing it at the source.

b) The stated geographic place is correct, but the associated lat/long coordinates are either missing or wrong.  This one could be improved through various georeferencing algorithms and tools and/or crowd-sourcing.

Bad taxonomy also comes in two basic forms:
a) The organism was misidentified. Again, there is no real way to fix this other than to fix it at the source.  Sometimes a reasonable inference can be made by a good taxonomist, but that always comes with risks.

b) The name used to represent the organism was "correct" in the context in which the organism was identified, but the name is not consistent with "modern" representations of "accepted" taxonomy.  There are many reasons for this, such as abbreviated or misspelled names, names that are objectively unavailable via the relevant Code (e.g., not validly published), names that are now widely regarded as heterotypic synonyms of other names, names that are classified in a different genus from what modern taxonomists follow, and text-strings that are really not representative of Linnean-style scientific names at all.

Of these various categories of problems, I suspect it's the last one that represents the largest portion of the "mess".  The good news is that help is on the way.

If you've got some time, and have an interest in this sort of thing, grab a cup of coffee and read on. Otherwise, hit "delete" now.


Still here?  Cool.

OK, so one of the prototype services Rob Whitton and I developed through NSF funding of the Global Names Architecture is a service we call "real-time taxonomic translation".  Basically, this is a service that "translates" taxon names into the "modern" equivalent. The best way to demonstrate the power of this service is through a specific example.

When Rob and I are wearing our fish-nerd hats instead of our database-nerd hats, we are collaborating with colleagues at NOAA to develop a comprehensive checklist of the fishes of the Northwestern Hawaiian Islands that is "evidence-based" (i.e., occurrence-based with explicit evidence supporting each occurrence).  When this is published later this year (or possibly early next year), I think it will represent a very cool model for how all regional organism checklists should be done in the future.  But for this Taxacom post, I want to focus on just one small component of it:  how real-time taxonomic translation works.

So, the "evidence" behind the occurrences we are using to develop this checklist come from various sources: Museum specimens, recorded observations, photos and videos, and, of course, historical literature reports.  On the literature reports, so far we have captured 2,856 Occurrence records based on reports in 24 publications going back 114 years.  If we only look at the raw taxon names as they appeared in these 24 publications, we get a list of 675 distinct scientific names.  Obviously, the prevailing taxonomy has changed over these 114 years, so many of those names are not consistent with the "modern" interpretation of the relevant taxonomy.  It would take many hours of time from multiple experts to review all of those 675 names and figure out all the corrected spellings, etc.  However, using the real-time taxonomic translation service Rob and I developed, we can convert these 675 historical names into the 506 "accepted" names as we would use them today.  And it does so in a few seconds (i.e., in "real time").

A short explanation of how it works is as follows:

All 2,856 literature-based occurrence records are tied to a "Taxon Name Usage" (TNU) instance (i.e., the usage of a taxon name within a publication). These represent how the original publication recorded the name.  For example, what we now call Acanthurus triostegus had been variously recorded in these literature citations by the following names:
Acanthurus triostegus (Linnaeus, 1758)
Hepatus triostegus (Linnaeus, 1758)
Acanthurus triostegus sandvicensis Streets, 1877
Hepatus sandvicensis (Streets, 1877)
Teuthis sandvicensis (Streets, 1877)

Similarly, what we now call Coris flavovittata has been recorded variously as:
Coris flavovittata (Bennett, 1828)
Coris lepomis Jenkins, 1901
Julis eydouxii Valenciennes in Cuvier & Valenciennes, 1839
Julis flavovittata Bennett, 1828

...and so on for all the different names.

Every TNU is linked to what we call the "Protonym" of the name.  This is essentially equivalent to the botanical "basionym", but essentially represents the original description of the name. Taking the second example above, there are three distinct Protonyms represented among the four names used for Coris flavovittata:
flavovittata Bennett, 1828
eydouxii Valenciennes in Cuvier & Valenciennes, 1839
lepomis Jenkins, 1901

The taxonomic translation service is built around the "Meta-Authority" (Authority of Authorities) concept.  A Meta-Authority is any organization or individual who wants to assert an "accepted" taxonomy.  For example, ITIS, CoL, WoRMS, etc. are all Meta-Authorities, because they assert an "accepted" usage for each taxon name.  For our checklist paper, we have established our own Meta-Authority (technically now recorded as the "Rob Whitton Meta-Authority, but functionally it is the Bishop Museum Meta-Authority). Each Meta-Authority has a specific scope of interest -- which might be very large (ITIS, CoL, WoRMS, etc.), or might be very small (e.g., a single family or geographic region).

In any case, what a Meta-Authority does is, for each name within the scope of interest, it makes a statement along the lines "For Protonym A, I/We follow the Treatment of Reference X"

In this case, The Rob Whitton/Bishop Museum Meta-Authority has made these assertions:
- For the protonym "flavovittata Bennett, 1828", we follow the treatment of Randall 2007 [who treats it as a valid species within the genus Coris].
- For the protonym "eydouxii Valenciennes in Cuvier & Valenciennes, 1839", we follow the treatment of Eschmeyer 2004 [who treats it as a junior synonym of flavovittata Bennett, 1828].
- For the protonym "lepomis Jenkins, 1901", we follow the treatment of Eschmeyer 2004 [who treats it as a junior synonym of flavovittata Bennett, 1828].

This is how we are able to collapse those messy 675 names spanning 114 years of taxonomic history into the 506 names that we (the experts of the fishes of the Northwestern Hawaiian Islands) regard as "accepted" in a few seconds.

If anyone wants more details on how it works, I'd be happy to explain further.

The main limitations of this services are:
1) It's limited to the names within the Global Names Usage Bank (GNUB; currently 543,989 TNUs linked to 195,369 Protonyms); and
2) There is currently only one Meta-Authority implemented

We already have funding from NSF to address limitation #1, by developing a workflow to capture millions of protonyms and tens of millions of TNUs through integrating GNUB, GNI, BHL, and multiple other taxonomic data sources.  We also plan to expand the Meta-Authority list to include the "big" ones (e.g., IT IS/CoL, WoRMS, NCBI), and develop tools to make it easy for any individual or organization to create their own personal Meta-Authority.  And, we just submitted a proposal to NSF to (among other things) develop this real-time taxonomic translation service into a set of tools that can be very easily applied to any list of taxon names.

If we are successful, users of GBIF data will have the option of selecting any Meta-Authority they want (one of the big ones, or their own), and then be able to translate (in real time) all the taxon names as they appear in the GBIF dataset into the "accepted" modern/clean equivalent names according to the selected Meta-Authority.  And the Meta-Authorities aren't just for species-level names -- they also provide full "accepted" classifications all the way up to Kingdom.

Obviously this won't solve all the problems with aggregated data, but it will help solve a lot of it.

OK, enough for now....


