[Taxacom] data quality vs. accessibility vs. quantity

Weitzman, Anna WEITZMAN at si.edu
Sat Aug 4 18:49:40 CDT 2007

Dear Wolfgang et al.
I also agree with the points both Rich & David and I want to add some additional perspectives to the discussion:
First, let us applaud GBIF for getting such well designed and functioning tools in place so quickly that can handle so much data.  And perhaps, more importantly, let us applaud GBIF for creating an atmosphere in which so many institutions (finally) put a priority on making their data publicly accessible.  The combination means of tools and so much data is that we can see both its strengths and its shortcomings.  I know that GBIF (collectively) and each of the providers have thought about data quality and feedback mechanisms for GBIF and its data providers and I know, especially now that the second version of the portal is in place, that comments like these about next steps are welcome and will be considered seriously.
Second, I want to second Rich's points about the yin/yang of data quality vs. data accessibility, using the ca. 83 million recent biological specimens at Smithsonian's NMNH as an example.  When I first undertook the task leading the migration of all of NMNH's data to a new system, partially in order to make the data uniformly accessible, I got a lot of pushback about the quality of NMNH data from a number of research and collections staff.  Other staff pointed out that there is nothing like making data available to bring the much-needed critics out of the 'woodwork', and that we should consider this a help.  By the end of the years of migration (at least of the major databases--we still have an unknown number of research databases that also contain collection data to add as we can), I was delighted that strong voices from every part the recent biology departments' staff were pushing to make the data accessible faster--largely because of the community pressure that arose from GBIF and other community projects.  This too shows how much progress GBIF has brought in a relatively short time!
As for data quality, NMNH's data were entered in digital databases over the last ca. 30 years (about 4 million recent biological records of nearly 6 million total representing 10-15% of NMNH's total collections).  By definition, that means that the data have been entered for different purposes (because Congress mandated a 'complete' inventory; because scientists wanted data about a particular group for a monograph, flora or fauna; because collections staff needed better control over certain collections; etc.).  They have also been entered into different systems that had different capabilities and designs when space was a premium and now when space is relatively cheap.  As databases have matured and we, as a community, have learned more about how museum data can be used, our ideas of database design, atomization, and inclusion have changed--this, along with data entry ability and priorities has led to a great deal of inconsistency in data in the records that we do have.  Further, the data that were entered into the earliest systems have been migrated 1-several times--each migration can itself add errors.  
That raises the question:  do we put our extremely limited data management staff time into trying to enter the remaining 85-90% of our collections or into the laborious task of data cleansing/verifying?  That is a very difficult question for any institution.  Our types get more attention and cleanup than the rest, for obvious reasons.  Nevertheless, I can only say that NMNH recognizes the importance of adding data records, data cleanup, and accessibility (whether those in Washington who decide our budget do or not) and we do our best to balance those issues given our resources.  
Finally, the majority of NMNH staff are grateful for whatever assistance we receive from those who spot errors and we do our best to correct them as we learn of them.  We are also grateful to those who help to educate those in Washington about the importance of museums and their data. 
Oh well.  Let us hope that we ALL find ways to do this better, and faster ...SOON!
Anna L. Weitzman, PhD
Botanical and Biodiversity Informatics Research
National Museum of Natural History
Smithsonian Institution
office: 202.633.0846
mobile: 202.415.4684
weitzman at si.edu


From: taxacom-bounces at mailman.nhm.ku.edu on behalf of Richard Pyle
Sent: Sat 04-Aug-07 5:38 PM
To: taxacom at mailman.nhm.ku.edu
Subject: Re: [Taxacom] new GBIF dataportal

David has captured most of my thoughts on this issue -- which I've spent a
lot of time thinking about lately.  The basic issue is a trade-off between
data accuracy, and data accessibility.  These two things tend to work
against each other:  Do I expose all my data now, so that people can access
the parts they need?  Or do I wait until I've verified all the content?  If
it's too messy when released, it quickly gets branded as garbage and
useless, and it's very difficult to overcome such a branding once applied,
even when the data are later cleaned up.  If you wait to make sure it's all
clean before releasing, then it can remain inaccessible for years.

One of the most promising solutions to the problem is what David outlined
below, which is to provide robust feedback tools so that consumers of the
data can very easily report suspected errors back to the providers, and the
providers can easily make corrections.  An extension of this approach is to
develop login proceedures to allow consumers to get in and correct the data
themselves (rather than just report the error and wait for over-worked data
managers to get around to it eventually).  With a reliable and comprehensive
logging/auditing system, I think this approach has great promise.

As we continue to develop a new prototype implementation of ZooBank, my
vision is to provide the best of both worlds.  The idea would be to expose
all the available data, and clearly distinguish the "verified" from the
"unverified" in an unambiguous, dichotomous way.  Obviously, careful thought
would need to be put into the criteria that constitute "verified", but I
believe this can be solved to the satisfaction of most.  But the key is to
provide consumers with an easy mechanism for contributing content and
corrections in a way that helps move data records from the "unverified" bin
to the "verified" bin.


> -----Original Message-----
> From: taxacom-bounces at mailman.nhm.ku.edu
> [mailto:taxacom-bounces at mailman.nhm.ku.edu] On Behalf Of
> Shorthouse, David
> Sent: Saturday, August 04, 2007 10:56 AM
> To: taxacom at mailman.nhm.ku.edu
> Subject: Re: [Taxacom] new GBIF dataportal
> > But a more serious question:
> > GBIF is heading towards full operation. The new data portal
> launched
> > July
> 2nd
> > is a wonderful piece of programming work which gives us a vision of
> > what
> we
> > can expect.
> > If only that massive lot of errors and wrong
> interpretations could be
> > avoided!
> > Wouldn't it be better if someone look at the data before
> they are put
> > in
> the
> > public domain?
> I feel your dismay & I suspect many share the same opinion.
> But, these and other similar errors also appear in
> peer-reviewed publications. The only difference now is the
> errors are not closed-access & all the dirty laundry is being
> aired. More individuals like yourself who can critically
> examine the data necessarily leads to better data &
> consequently, better use of the data. This means you and
> others like you have to be actively involved in the process &
> assume some of the responsibility. GBIF has options available
> for users to directly contact the provider who may not
> realize they are serving erroneous data. To provide feedback
> to the University of Navarra's Museum of Zoology, see one
> such erroneous record here:
> http://data.gbif.org/occurrences/21924/. They also have
> publicly available event logs on providers such that anyone &
> everyone can see nomenclatural, geocoding, etc. issues. Logs
> for MZNA are accessible at:
> http://data.gbif.org/datasets/provider/185/logs/.
> That being said, GBIF could improve its portal by:
> 1. Facilitating communal vetting of data & make it abundantly
> obvious that it are providers & not themselves who are
> responsible for some of the garbage that slips through.
> 2. Make these event logs for providers very apparent (email reports
> perhaps?) & make providers accountable. These event logs
> really should cause providers sit up & take notice because
> the dirty laundry is now being pointed out.
> 3. Flashy disclaimers to catch one's eye such that the end
> user takes some responsibility prior to using acquired data
> in occurrence algorithms, etc.
> My two cents,
> David P. Shorthouse
> ------------------------------------------------------
> Department of Biological Sciences
> CW-403, Biological Sciences Centre
> University of Alberta
> Edmonton, AB   T6G 2E9
> mailto:dps1 at ualberta.ca
> http://canadianarachnology.webhop.net <http://canadianarachnology.webhop.net/> 
> http://arachnidforum.webhop.net <http://arachnidforum.webhop.net/> 
> http://www.spiderwebwatch.org <http://www.spiderwebwatch.org/> 
> ------------------------------------------------------
>Best wishes,
Wolfgang Lorenz
Faunistics & Environmental Planning
Hoermannstr. 4
D-82327 Tutzing

More information about the Taxacom mailing list