[Taxacom] Occurrence data...
deepreef at bishopmuseum.org
Fri Feb 18 14:20:27 CST 2011
The problem you describe is not caused by GBIF. Rather, GBIF is simply
holding a mirror up to the content holders. GBIF is the messenger in this
The problem you describe is also not unique to digitized data. I've spent
the past 25 years dealing with Museum data; for the past 20 of those years
this has spanned seven major collections. Long before there were computers,
collectors hastily wrote information on slips of paper and in little field
notebooks. Sometimes the information on the slips of paper and the field
notes are not in agreement with each other. Collections managers then
process and curate the specimens, sometimes generate new labels in a
standard format (sometimes hand-written, sometimes typed), and also
sometimes transcribe the information into catalog books (by hand). And, of
course, sometimes the information on the labels and in the catalog books are
not in agreement with each other. Other researchers later examine the
specimens and add new identifications, and other new information - mostly
using paper. Eventually, someone published the information, and it appears
in yet another form, which sometimes doesn't match exactly the information
captured in other forms. The point is, inconsistent, non-"original" data has
been the norm for Museum collections and taxonomic practice for centuries.
The difference with digital representations of these data (in part thanks to
GBIF), is that those inconsistencies are now much more obvious and visible.
Sure, the digitization process as its own set of inconsistencies and errors,
to be sure. But in my experience, no more so than any other link in the
chain of information flow from the brain of the collector/identifier through
the full specimen curation process.
I definitely agree with you that the current error rate in information in
data accessible through GBIF, EOL, etc. should be cause for alarm, and make
us re-consider some procedures. But the vast majority of those procedures
in need of re-consideration have absolutely nothing to do with GBIF or EOL
or other aggregators, and not so much with the entire foundation of "digital
data". The procedures we need to re-consider reach all the way back through
the curation process, starting with the collector in the field.
Addressing your point about "original data" more directly, as far as I'm
concerned the "original" data is in the collector's field notes. The Museum
label is usually only one step removed, and is often not hand-written. To
me, if you want original data, what you want is the scanned page images of
the field notebooks. In many modern databases, there are fields for
"original data" or "verbatim XXXX" (where "XXXX" is some piece of
information) so that the non-atomized information is captured (at least to
the extent that hand-writing can be transcribed to machine form, as has been
done for decades with typewriters), and the atomized is used and transformed
for purposes of consistency and indexing.
GBIF's role is to index and summarize information, and provide links back to
the source. The onus is on the source to provide the "original" data, in
whatever form is deemed appropriate.
From: Wolfgang Lorenz [mailto:faunaplan at googlemail.com]
Sent: Friday, February 18, 2011 8:39 AM
To: Richard Pyle
Cc: taxacom at mailman.nhm.ku.edu
Subject: Re: [Taxacom] Occurrence data...
the problem, in my point of view, and maybe that's what Bob was saying, is
the focus on making data fit for the machine while the users' need to look
behind the data output is more or less neglected. My impression is that the
current TDWG recommendations for data standards are week in representing the
authentic original information in the databases. More or less, we must trust
in what databasers (often not those who have created the information) are
digitizing. For example, in most insect occurrence data accessible through
GBIF and marked as "specimen"-based, I cannot make out what's on the
original specimen labels. Both the geographic information and the original
taxon identification are already atomized and transformed into
machine-digestible text strings in the first steps of digitization.
The current, really incredible error load in the data accessible through
GBIF, EOL, etc. should be alarming enough to make us re-consider some
As for occurrence data existing in literature, doesn't BHL already offer a
Just link to a journal page, e.g.
There you pick up the chresonym "Tachys bisulcatus (Nicolai)" and the
verbatim locality name "Marquartstein". Then, in separate data fields,
georeference the place as 47.755/12.464, and get 'Porotachys bisulcatus' as
a better name for the species if you prefer a current classification . With
these data elements and the link to BHL we have a vettable occurrence
record. What else do we need?
For specimen based records, we can do this: With a simple set of controlled
vocabulary (e.g., "\" for the beginning of new text lines) we can enter
verbatim text taken from locality and ID-labels in a first step (rapid data
entry) and add, in separate fields, interpretations like lat/lon &
standardized name strings.
Always let the user know what is original and what is interpretation in the
process of digitization.
Wolfgang Lorenz, Tutzing, Germany
2011/2/18 Richard Pyle <deepreef at bishopmuseum.org>
But don't you have to know that an original source exists? If you know
exactly what you're looking for, and you know exactly where to go to get it,
then there's no problem. The role of aggregators is to provide a single
portal that INDEXES all the original sources, and provides tools to FIND the
stuff you're interested in, and then provide links BACK to the original
source for more detailed information. They also can do some value-added
stuff like show aggregated points on a map. Sure, a lot of that is bogus
data; but it's more likely to be obviously so when placed in the context of
good data. If a collection has two specimens of "Aus bus"; one from
California, and one from Florida, then a non-specialist will assume that the
species has a broad distribution. If those two data points are put on a map
alongside 1000 other datapoints, 990 of which are clustered in California,
and the remaining ones are scattered hither and yon, you might be inclined
to re-examine the scattered ones to see if they may be mis-labeled,
mis-digitized, or mis-identified.
So, while you look at GBIF and see a godawful mess, I look at GBIF and see
it doing what an aggregator does best -- providing a single portal to
simultaneously access large amounts of data from distributed sources, with
useful services to see these data in aggregate form (which includes drawing
attention to erroneous data in the source databases).
Having access to source data, *and* accessing aggregators that provide
relevant services, are not mutually exclusive things. And, furthermore, as
you said very well in your later post: "on the Web, those don't have to be
the only 2 choices".
> -----Original Message-----
> From: taxacom-bounces at mailman.nhm.ku.edu [mailto:taxacom-
> bounces at mailman.nhm.ku.edu] On Behalf Of Bob Mesibov
> Sent: Thursday, February 17, 2011 8:28 PM
> To: TAXACOM
> Subject: Re: [Taxacom] Occurrence data...
> Hi, Ken.
> I think you're missing my point. I can think of a lot of uses for data,
but I want
> to get those data directly from their sources, not from the godawful mess
> that GBIF has created. That's me as user. Now who is going to want to go
> GBIF route, and why?
> Dr Robert Mesibov
> Honorary Research Associate
> Queen Victoria Museum and Art Gallery, and School of Zoology, University
> Tasmania Home contact: PO Box 101, Penguin, Tasmania, Australia 7316
> Ph: (03) 64371195; 61 3 64371195
> Webpage: http://www.qvmag.tas.gov.au/?articleID=570
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> The Taxacom archive going back to 1992 may be searched with either of
> these methods:
> (1) http://taxacom.markmail.org
> Or (2) a Google search specified as:
> site:mailman.nhm.ku.edu/pipermail/taxacom your search terms here
Taxacom Mailing List
Taxacom at mailman.nhm.ku.edu
The Taxacom archive going back to 1992 may be searched with either of these
Or (2) a Google search specified as:
site:mailman.nhm.ku.edu/pipermail/taxacom your search terms here
More information about the Taxacom