[Taxacom] new GBIF dataportal
WEITZMAN at si.edu
Mon Aug 6 19:41:08 CDT 2007
While I totally agree with your points on data quality (here and in your excellent white paper on the topic), and agree that this would be a wonderful (and VERY important) thing to do ... this is yet another thing that institutions have to balance in our quest for the perfect balance of staff time use. So, to go back to my earlier email in this string (or the one that I renamed to better reflect the discussion--data quality vs. accessibility vs. quantity) we now have quandary about:
Data Quality vs. Metadata about Data Quality vs. Data Accessibility vs. Data Quantity.
So, we have people requesting that we do better on all fronts. Given our extremely limited resources, how do we balance those things? In a perfect world, no institution would have to, but I dare say, few live in that world. In the case of NMNH, with the number of collections that we have and the decreasing resources (in our case thanks to US Government priorities...which I will not comment on) what do you suggest?
If NMNH posted a message saying that 70% (probably more) of NMNH data records have not been quality checked in any way in the last 10 years--how would that help on the GBIF site? We have a number of really excellent records that have been vetted multiple times by specialists, but we have others that have never been looked at since they were entered--perhaps 30 years ago. How would a user know how to weed out the good from the bad.
Yes, we now have the fields for each record in the database that allow a user to record what has been verified and/or updated, but it takes time to fill in those fields. We ask people to do so, but.... We certainly don't have the staff to go through and update the records that have not been updated.
We have a list of caveats about our data in general (in our data access policy) that should be taken into account when using them, but how many users will do so?
If there is one thing that I have said repeatedly since I started working on the NMNH databases: In a perfect world, we should throw away nearly all of our records and start again with better data standards. However, while the costs of hardware (and even software--to some extent) have decreased, the cost of data input has increased and continues to do so. Until we can find a way to solve the problem that everything is dependent on the per hour costs for a human, we have serious problems.
Sorry to sound so negative. We (as a community) are going in the correct direction with GBIF, accessibility and better standards, I know that. But we also have to be aware of the constraints on institutions.
Now I'll go back to data standards for the perfect world and optimism about our (collective) future (and be glad that NMNH databases on a grand scale are no longer my responsibility)!
Anna L. Weitzman, PhD
Botanical and Biodiversity Informatics Research
National Museum of Natural History
weitzman at si.edu
From: taxacom-bounces at mailman.nhm.ku.edu on behalf of Arthur Chapman
Sent: Mon 06-Aug-07 8:04 PM
Cc: taxacom at mailman.nhm.ku.edu
Subject: Re: [Taxacom] new GBIF dataportal
I think one of the key points that has been missed here, is the need for
the documentation of data quality.
Perhaps the key issue is the recording of uncertainty - especially with
the georeferencing attributes, but also other attributes where possible.
This should be done at the record level and should be additional to
metadata at the collection level.
Documentation (metadata) should be an essential element for all data
providers (to GBIF or anywhere else). Such documentation should include
such issues as
* Checks that have been carried out on quality
* Validation tests have been conducted on the data
* Ideally (I know a pipe-dream) information such that
o 20% of records have had their taxonomy verified in the past
o 50% of records have had their taxonomy verified in the past
o 10% of records have unverified taxonomy.
* Also ideally
o 20% records are accurate to within 2 km
o 50% or records are accurate to within 5 km
o 90% of records are accurate to within 10 km, etc.
A large percentage of the data made available via GBIF will contain
errors, what is important for users is that the quality be documented,
so that users can make informed decisions on the value of those data for
their particular use. Secondly - a good feedback mechanism needs to be
available for users to be able to feed back information on errors or
I refer readers who have not already done so, to read the documents on
data quality on the GBIF Web site at
http://www.gbif.org/prog/digit/data_quality - including the BioGeomancer
Guide to Georeferencing
There are many automated and semi-automated techniques that institutions
can use for improving the quality of their data. Some training documents
in some of these can also be obtained from the GBIF site at
development of a self-training CD to cover many of these topics is also
currently under discussion.
Richard Pyle wrote:
> David has captured most of my thoughts on this issue -- which I've spent a
> lot of time thinking about lately. The basic issue is a trade-off between
> data accuracy, and data accessibility. These two things tend to work
> against each other: Do I expose all my data now, so that people can access
> the parts they need? Or do I wait until I've verified all the content? If
> it's too messy when released, it quickly gets branded as garbage and
> useless, and it's very difficult to overcome such a branding once applied,
> even when the data are later cleaned up. If you wait to make sure it's all
> clean before releasing, then it can remain inaccessible for years.
> One of the most promising solutions to the problem is what David outlined
> below, which is to provide robust feedback tools so that consumers of the
> data can very easily report suspected errors back to the providers, and the
> providers can easily make corrections. An extension of this approach is to
> develop login proceedures to allow consumers to get in and correct the data
> themselves (rather than just report the error and wait for over-worked data
> managers to get around to it eventually). With a reliable and comprehensive
> logging/auditing system, I think this approach has great promise.
> As we continue to develop a new prototype implementation of ZooBank, my
> vision is to provide the best of both worlds. The idea would be to expose
> all the available data, and clearly distinguish the "verified" from the
> "unverified" in an unambiguous, dichotomous way. Obviously, careful thought
> would need to be put into the criteria that constitute "verified", but I
> believe this can be solved to the satisfaction of most. But the key is to
> provide consumers with an easy mechanism for contributing content and
> corrections in a way that helps move data records from the "unverified" bin
> to the "verified" bin.
>> -----Original Message-----
>> From: taxacom-bounces at mailman.nhm.ku.edu
>> [mailto:taxacom-bounces at mailman.nhm.ku.edu] On Behalf Of
>> Shorthouse, David
>> Sent: Saturday, August 04, 2007 10:56 AM
>> To: taxacom at mailman.nhm.ku.edu
>> Subject: Re: [Taxacom] new GBIF dataportal
>>> But a more serious question:
>>> GBIF is heading towards full operation. The new data portal
>>> is a wonderful piece of programming work which gives us a vision of
>>> can expect.
>>> If only that massive lot of errors and wrong
>> interpretations could be
>>> Wouldn't it be better if someone look at the data before
>> they are put
>>> public domain?
>> I feel your dismay & I suspect many share the same opinion.
>> But, these and other similar errors also appear in
>> peer-reviewed publications. The only difference now is the
>> errors are not closed-access & all the dirty laundry is being
>> aired. More individuals like yourself who can critically
>> examine the data necessarily leads to better data &
>> consequently, better use of the data. This means you and
>> others like you have to be actively involved in the process &
>> assume some of the responsibility. GBIF has options available
>> for users to directly contact the provider who may not
>> realize they are serving erroneous data. To provide feedback
>> to the University of Navarra's Museum of Zoology, see one
>> such erroneous record here:
>> http://data.gbif.org/occurrences/21924/. They also have
>> publicly available event logs on providers such that anyone &
>> everyone can see nomenclatural, geocoding, etc. issues. Logs
>> for MZNA are accessible at:
>> That being said, GBIF could improve its portal by:
>> 1. Facilitating communal vetting of data & make it abundantly
>> obvious that it are providers & not themselves who are
>> responsible for some of the garbage that slips through.
>> 2. Make these event logs for providers very apparent (email reports
>> perhaps?) & make providers accountable. These event logs
>> really should cause providers sit up & take notice because
>> the dirty laundry is now being pointed out.
>> 3. Flashy disclaimers to catch one's eye such that the end
>> user takes some responsibility prior to using acquired data
>> in occurrence algorithms, etc.
>> My two cents,
>> David P. Shorthouse
>> Department of Biological Sciences
>> CW-403, Biological Sciences Centre
>> University of Alberta
>> Edmonton, AB T6G 2E9
>> mailto:dps1 at ualberta.ca
>> http://canadianarachnology.webhop.net <http://canadianarachnology.webhop.net/>
>> http://arachnidforum.webhop.net <http://arachnidforum.webhop.net/>
>> http://www.spiderwebwatch.org <http://www.spiderwebwatch.org/>
>> Taxacom mailing list
>> Taxacom at mailman.nhm.ku.edu
> Taxacom mailing list
> Taxacom at mailman.nhm.ku.edu
Taxacom mailing list
Taxacom at mailman.nhm.ku.edu
More information about the Taxacom