[Taxacom] Chameleons, GBIF, and the Red List

Doug Yanega dyanega at ucr.edu
Thu Aug 21 17:08:22 CDT 2014

On 8/21/14 11:52 AM, James Macklin wrote:
> I would say that in general there
> are now reasonable solutions for achieving distributed annotation at
> various levels of complexity but there is still a challenge/bottleneck in
> pushing these annotations back to the source and into their collection
> management system. The bottleneck is potentially at the source that must
> process the annotations. If we automate (or even semi-auto) the annotation
> process through curation workflows, something my colleagues and I are now
> focusing on, we could potentially flood the "curators" of the
> specimens/data. Then the question becomes how much the owners are committed
> to processing potentially valuable modifications/additions and adding them
> to their database. Certainly data curation and positions to support it are
> in their infancy. The annotations that are not processed by the source
> still have value and can inform the aggregators but have to be dealt with
> in a slightly different manner.
Let me jump in here with an observation. In terms of the sheer volume of 
specimens, arthropods (mostly insects) represent the bulk of all curated 
natural history objects. By the nature of insect curation, however, the 
labels that accompany insect specimens are poor compared to those 
associated with, say, vertebrates or herbarium specimens - mostly due to 
the absurdly small size of the labels themselves, but also due to the 
fact that unlike most other types of museum specimens, a 
disproportionately large number were collected *and labeled by* 
untrained collectors. I have worked extensively in some 20 major insect 
collections, in several countries, and ALL of them have very serious 
issues with the quality of original data labels. Averaging across all 
institutions, across all taxa, and across all ages of specimens, roughly 
20% of all insect specimens, everywhere, suffer from either missing 
information (ranging from trivial to catastrophic), misspellings, or 
actual erroneous data, with the odds of such tending to be higher as 
specimens get older. Given that the standard approach to databasing 
insect specimens is to simply capture the existing data (occasionally 
with automated georeferencing, rarely with direct human screening of 
records), a lot of those same errors, ambiguities, and omissions are 
being carried over, wholesale, to various online data sources. There is 
a huge cost to imposing standards of quality control on label data 
capture, a cost which almost no one budgets for, and if the cost is not 
paid up front, then what we have to contend with is - as James notes - 
the necessity of data USERS trying to correct errors one by one and 
propagate those corrections back to the data providers. This might not 
be so bad if it were a small percentage of records, comprising a small 
number of specimens - but that is not the reality of the situation. The 
scope can run into the tens or hundreds of thousands of records, given 
that collections can have millions of specimens.

As James notes, "data curation and positions to support it are in their 
infancy" - and this also applies to grant-based databasing initiatives, 
where the priority is on the number of records put into the system, 
often with little genuine attempt to ensure that all those records are 
trustworthy (and that leaves aside the issue of whether specimens are 
identified correctly, which requires rare and costly expertise to ensure 
quality control*).

If providers don't clean data before making it available online, because 
the funding agencies don't insist upon it, then we're just deferring the 
problem in a manner that makes the solution LESS likely to be supported 
(*no one* is going to fund a grant simply to hire someone to fix bad 
records in an existing database), and MORE likely to compromise the 
accuracy and quality of analyses that are being *derived from* these 
suboptimal data sources. Are we really better off spending X number of 
dollars to create a database of 100,000 records, of which only a random 
80% are accurate, than a database of 50,000 records, 99% of which are 
accurate? It might be more cost-effective, on paper, but do we truly not 
care whether the data users are being fed erroneous data? We don't seem 
to have our collective priorities straight - we shouldn't be discussing 
proverbial pounds of cure, while ignoring the need for ounces of prevention.

*PS - I don't entirely agree that expertise in data capture and 
processing (e.g., georeferencing) is or should be mutually exclusive of 
taxonomic expertise; I could list example after example of errors in 
collections where (1) accurate georeferencing pointed out mistakes in 
taxonomy, or (2) accurate taxonomy pointed out mistakes in 
georeferencing - examples where the mistakes could ONLY have been caught 
by a person who was competent in both aspects (i.e., to know, 
immediately, that something was suspicious, when it would be overlooked 
by a person familiar only with one or the other). Frankly, we're having 
a tough enough time keeping qualified taxonomists employed these days, 
so people should not be afraid to budget to hire appropriately-qualified 
taxonomists when they write grants to database their collections. It 
takes less effort to train a taxonomist to manage data than it does to 
train a data manager to do taxonomy, and WELL worth it, even if it means 
that fewer total specimens are databased per unit time. And, let's be 
honest, capturing data on specimens whose IDs have not been confirmed by 
a taxonomic specialist is just asking for trouble.


Doug Yanega      Dept. of Entomology       Entomology Research Museum
Univ. of California, Riverside, CA 92521-0314     skype: dyanega
phone: (951) 827-4315 (disclaimer: opinions are mine, not UCR's)
   "There are some enterprises in which a careful disorderliness
         is the true method" - Herman Melville, Moby Dick, Chap. 82

More information about the Taxacom mailing list