[Taxacom] new GBIF dataportal

Arthur Chapman taxacom3 at achapman.org
Mon Aug 6 19:04:35 CDT 2007

I think one of the key points that has been missed here, is the need for 
the documentation of data quality.

Perhaps the key issue is the recording of uncertainty - especially with 
the georeferencing attributes, but also other attributes where possible. 
This should be done at the record level and should be additional to 
metadata at the collection level.

Documentation (metadata) should be an essential element for all data 
providers (to GBIF or anywhere else).  Such documentation should include 
such issues as

    * Checks that have been carried out on quality
    * Validation tests have been conducted on the data
    * Ideally (I know a pipe-dream) information such that
          o 20% of records have had their taxonomy verified in the past
            5 years
          o 50% of records have had their taxonomy verified in the past
            10 years
          o 10% of records have unverified taxonomy.
    * Also ideally
          o 20% records are accurate to within 2 km
          o 50% or records are accurate to within 5 km
          o 90% of records are accurate to within 10 km, etc.

A large percentage of the data made available via GBIF will contain 
errors, what is important for users is that the quality be documented, 
so that users can make informed decisions on the value of those data for 
their particular use.  Secondly - a good feedback mechanism needs to be 
available for users to be able to feed back information on errors or 
suspect records.

I refer readers who have not already done so, to read the documents on 
data quality on the GBIF Web site at
http://www.gbif.org/prog/digit/data_quality - including the BioGeomancer 
Guide to Georeferencing 

There are many automated and semi-automated techniques that institutions 
can use for improving the quality of their data. Some training documents 
in some of these can also be obtained from the GBIF site at 
http://www.gbif.org/prog/digit/workshop_data_quality_and_cleaning.  The 
development of a self-training CD to cover many of these topics is also 
currently under discussion.

Arthur Chapman

Richard Pyle wrote:
> David has captured most of my thoughts on this issue -- which I've spent a
> lot of time thinking about lately.  The basic issue is a trade-off between
> data accuracy, and data accessibility.  These two things tend to work
> against each other:  Do I expose all my data now, so that people can access
> the parts they need?  Or do I wait until I've verified all the content?  If
> it's too messy when released, it quickly gets branded as garbage and
> useless, and it's very difficult to overcome such a branding once applied,
> even when the data are later cleaned up.  If you wait to make sure it's all
> clean before releasing, then it can remain inaccessible for years.
> One of the most promising solutions to the problem is what David outlined
> below, which is to provide robust feedback tools so that consumers of the
> data can very easily report suspected errors back to the providers, and the
> providers can easily make corrections.  An extension of this approach is to
> develop login proceedures to allow consumers to get in and correct the data
> themselves (rather than just report the error and wait for over-worked data
> managers to get around to it eventually).  With a reliable and comprehensive
> logging/auditing system, I think this approach has great promise.
> As we continue to develop a new prototype implementation of ZooBank, my
> vision is to provide the best of both worlds.  The idea would be to expose
> all the available data, and clearly distinguish the "verified" from the
> "unverified" in an unambiguous, dichotomous way.  Obviously, careful thought
> would need to be put into the criteria that constitute "verified", but I
> believe this can be solved to the satisfaction of most.  But the key is to
> provide consumers with an easy mechanism for contributing content and
> corrections in a way that helps move data records from the "unverified" bin
> to the "verified" bin.
> Aloha,
> Rich
>> -----Original Message-----
>> From: taxacom-bounces at mailman.nhm.ku.edu 
>> [mailto:taxacom-bounces at mailman.nhm.ku.edu] On Behalf Of 
>> Shorthouse, David
>> Sent: Saturday, August 04, 2007 10:56 AM
>> To: taxacom at mailman.nhm.ku.edu
>> Subject: Re: [Taxacom] new GBIF dataportal
>>> But a more serious question:
>>> GBIF is heading towards full operation. The new data portal 
>> launched 
>>> July
>> 2nd
>>> is a wonderful piece of programming work which gives us a vision of 
>>> what
>> we
>>> can expect.
>>> If only that massive lot of errors and wrong 
>> interpretations could be 
>>> avoided!
>>> Wouldn't it be better if someone look at the data before 
>> they are put 
>>> in
>> the
>>> public domain?
>> I feel your dismay & I suspect many share the same opinion. 
>> But, these and other similar errors also appear in 
>> peer-reviewed publications. The only difference now is the 
>> errors are not closed-access & all the dirty laundry is being 
>> aired. More individuals like yourself who can critically 
>> examine the data necessarily leads to better data & 
>> consequently, better use of the data. This means you and 
>> others like you have to be actively involved in the process & 
>> assume some of the responsibility. GBIF has options available 
>> for users to directly contact the provider who may not 
>> realize they are serving erroneous data. To provide feedback 
>> to the University of Navarra's Museum of Zoology, see one 
>> such erroneous record here:
>> http://data.gbif.org/occurrences/21924/. They also have 
>> publicly available event logs on providers such that anyone & 
>> everyone can see nomenclatural, geocoding, etc. issues. Logs 
>> for MZNA are accessible at:
>> http://data.gbif.org/datasets/provider/185/logs/.
>> That being said, GBIF could improve its portal by:
>> 1. Facilitating communal vetting of data & make it abundantly 
>> obvious that it are providers & not themselves who are 
>> responsible for some of the garbage that slips through. 
>> 2. Make these event logs for providers very apparent (email reports
>> perhaps?) & make providers accountable. These event logs 
>> really should cause providers sit up & take notice because 
>> the dirty laundry is now being pointed out.
>> 3. Flashy disclaimers to catch one's eye such that the end 
>> user takes some responsibility prior to using acquired data 
>> in occurrence algorithms, etc.
>> My two cents,
>> David P. Shorthouse
>> ------------------------------------------------------
>> Department of Biological Sciences
>> CW-403, Biological Sciences Centre
>> University of Alberta
>> Edmonton, AB   T6G 2E9
>> mailto:dps1 at ualberta.ca
>> http://canadianarachnology.webhop.net
>> http://arachnidforum.webhop.net
>> http://www.spiderwebwatch.org
>> ------------------------------------------------------
>> _______________________________________________
>> Taxacom mailing list
>> Taxacom at mailman.nhm.ku.edu
>> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
> _______________________________________________
> Taxacom mailing list
> Taxacom at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom

More information about the Taxacom mailing list