Chris Thompson
Fri Feb 18 14:52:39 CST 2011

Rich is exactly right.

The critical aspect is the "collection EVENT." The information about the 
collector, the time, and what basic information was reported, that is what 
is needed to get the best geographic coordinates.

For my students, the example I give is the information that Fabricius (1775) 
provided for new species of insects described from AMERICA.

For example, for Syrphus obesus, Fabricius (1775: 763) wrote only "Habitat 
in America." Later he wrote (Fabricius 1805: 227) "Americae Insulis" for 
these species and in some cases provided the collector's name, von Rohr.

The critical factor here is knowing the source of the material that 
Fabricius got from the "American Islands," which back in 1775 were owned by 
the Denmark and that the Chief Engineer (von Rohr) for the fortification of 
St. Croix collected and sent insects back to Copenhagen, where Fabricius 
studied them. So, while the specimens today remain in Copenhagen and have no 
labels, etc., we know that they were collected by von Rohr and the 
type-locality of these species should be geo-coded as the now American 
Virgin Islands, and probably St. Croix.

So, one should preserve and report all original label data. AND also dig up 
and report information as available about the collecting event, the 
collector, etc. AND then take all the information to properly geo-code the 

AND what the community really needs now is an online collection event based 
gazetteer. So, one can search for "Americae Insulis" and "von Rohr" and get 
the American Virgin Islands.



Hi Wolfgang,

The problem you describe is not caused by GBIF.  Rather, GBIF is simply
holding a mirror up to the content holders.  GBIF is the messenger in this

The problem you describe is also not unique to digitized data.  I've spent
the past 25 years dealing with Museum data; for the past 20 of those years
this has spanned seven major collections.  Long before there were computers,
collectors hastily wrote information on slips of paper and in little field
notebooks.  Sometimes the information on the slips of paper and the field
notes are not in agreement with each other.  Collections managers then
process and curate the specimens, sometimes generate new labels in a
standard format (sometimes hand-written, sometimes typed), and also
sometimes transcribe the information into catalog books (by hand). And, of
course, sometimes the information on the labels and in the catalog books are
not in agreement with each other.  Other researchers later examine the
specimens and add new identifications, and other new information - mostly
using paper.  Eventually, someone published the information, and it appears
in yet another form, which sometimes doesn't match exactly the information
captured in other forms. The point is, inconsistent, non-"original" data has
been the norm for Museum collections and taxonomic practice for centuries.
The difference with digital representations of these data (in part thanks to
GBIF), is that those inconsistencies are now much more obvious and visible.
Sure, the digitization process as its own set of inconsistencies and errors,
to be sure.  But in my experience, no more so than any other link in the
chain of information flow from the brain of the collector/identifier through
the full specimen curation process.

I definitely agree with you that the current error rate in information in
data accessible through GBIF, EOL, etc. should be cause for alarm, and make
us re-consider some procedures.  But the vast majority of those procedures
in need of re-consideration have absolutely nothing to do with GBIF or EOL
or other aggregators, and not so much with the entire foundation of "digital
data".  The procedures we need to re-consider reach all the way back through
the curation process, starting with the collector in the field.

Addressing your point about "original data" more directly, as far as I'm
concerned the "original" data is in the collector's field notes.  The Museum
label is usually only one step removed, and is often not hand-written.  To
me, if you want original data, what you want is the scanned page images of
the field notebooks.  In many modern databases, there are fields for
"original data" or "verbatim XXXX" (where "XXXX" is some piece of
information) so that the non-atomized information is captured (at least to
the extent that hand-writing can be transcribed to machine form, as has been
done for decades with typewriters), and the atomized is used and transformed
for purposes of consistency and indexing.

GBIF's role is to index and summarize information, and provide links back to
the source.  The onus is on the source to provide the "original" data, in
whatever form is deemed appropriate.



the problem, in my point of view, and maybe that's what Bob was saying, is
the focus on making data fit for the machine while the users' need to look
behind the data output is more or less neglected. My impression is that the
current TDWG recommendations for data standards are week in representing the
authentic original information in the databases. More or less, we must trust
in what databasers (often not those who have created the information) are
digitizing. For example, in most insect occurrence data accessible through
GBIF and marked as "specimen"-based, I cannot make out what's on the
original specimen labels. Both the geographic information and the original
taxon identification are already atomized and transformed into
machine-digestible text strings in the first steps of digitization.

The current, really incredible error load in the data accessible through
GBIF, EOL, etc. should be alarming enough to make us re-consider some
procedures, IMHO.

As for occurrence data existing in literature, doesn't BHL already offer a
better alternative?
Just link to a journal page, e.g.
There you pick up the chresonym "Tachys bisulcatus (Nicolai)" and the
verbatim locality name "Marquartstein". Then, in separate data fields,
georeference the place as 47.755/12.464, and get 'Porotachys bisulcatus' as
a better name for the species if you prefer a current classification . With
these data elements and the link to BHL we have a vettable occurrence
record. What else do we need?

For specimen based records, we can do this: With a simple set of controlled
vocabulary (e.g., "\" for the beginning of new text lines) we can enter
verbatim text taken from locality and ID-labels in a first step (rapid data
entry) and add, in separate fields, interpretations like lat/lon &
standardized name strings.
Always let the user know what is original and what is interpretation in the
process of digitization.



But don't you have to know that an original source exists?  If you know
exactly what you're looking for, and you know exactly where to go to get it,
then there's no problem.  The role of aggregators is to provide a single
portal that INDEXES all the original sources, and provides tools to FIND the
stuff you're interested in, and then provide links BACK to the original
source for more detailed information.  They also can do some value-added
stuff like show aggregated points on a map.  Sure, a lot of that is bogus
data; but it's more likely to be obviously so when placed in the context of
good data.  If a collection has two specimens of "Aus bus"; one from
California, and one from Florida, then a non-specialist will assume that the
species has a broad distribution.  If those two data points are put on a map
alongside 1000 other datapoints, 990 of which are clustered in California,
and the remaining ones are scattered hither and yon, you might be inclined
to re-examine the scattered ones to see if they may be mis-labeled,
mis-digitized, or mis-identified.

So, while you look at GBIF and see a godawful mess, I look at GBIF and see
it doing what an aggregator does best -- providing a single portal to
simultaneously access large amounts of data from distributed sources, with
useful services to see these data in aggregate form (which includes drawing
attention to erroneous data in the source databases).

Having access to source data, *and* accessing aggregators that provide
relevant services, are not mutually exclusive things.  And, furthermore, as
you said very well in your later post: "on the Web, those don't have to be
the only 2 choices".


