[Taxacom] Data quality in aggregated datasets

Stephen Thorpe stephen_thorpe at yahoo.co.nz
Fri Apr 19 20:10:25 CDT 2013

I would say that quality control is the single biggest problem in the whole biodata enterprise. There is school of thought which I shall dub "underminism", which says a few (noticed) errors undermines confidence in the whole data set! I'm not sure that I agree entirely, but it seems a valid point of view. The only answer to the problem that I can see is to make sure that everything is as verifiable as possible, by the user, should they choose to do so. There are limits, obviously, to what is possible here, but it seems to me that most aggregators have a way to go before they come close to those limits!

From: Doug Yanega <dyanega at ucr.edu>
To: "taxacom at mailman.nhm.ku.edu" <taxacom at mailman.nhm.ku.edu> 
Sent: Saturday, 20 April 2013 12:23 PM
Subject: Re: [Taxacom] Data quality in aggregated datasets

On 4/19/13 3:04 PM, Robert Mesibov wrote:
> In this particular case, an interested third party (me) finds problems and alerts the data provider directly. The data provider fixes the errors and in the fullness of time sends corrected records to the aggregator. (Although I found evidence that erroneous records can persist through an update.)
> What about aggregated datasets in general? What mechanisms are there for detecting and fixing errors besides (interested third party) > (data provider) > aggregator?
I'm not sure that *fixing* can ever work any other way, since the 
original data source is generally going to be where something needs TO 
be fixed. I don't know of any data aggregators that will ignore input 
from a provider whenever that new input contains an error that would 
overrwrite a correct record already in place (this is not the same as an 
aggregator that flags or excludes suspicious records, which is a 
safeguard many already have in place). That is, if I *were* able to go 
into an aggregator and correct an error, then the next time data was 
uploaded from that provider, the correction would be overwritten by the 
erroneous original.

In that respect, if aggregators made an "external comments" field that 
was linked to records, and the contents of that field were maintained 
regardless of any alterations (or lack thereof) made by the provider, it 
would be helpful, but it would still not be a true "fix" because one 
would have to read the external comments field every time one tried to 
use data, and manually make corrections whenever those comments said to 
do so (and that also presupposes that whoever made those external 
comments knew what they were doing - they could be mistaken, or they 
could even be vandals).

Ultimately, I can't see any alternatives other than having the data 
provider make the corrections, so the corrections propagate downstream. 
That means that being a data provider is a far longer-term commitment 
than most institutions/individuals are generally prepared for. After 
all, if you hired a data entry technician on soft money, created a few 
thousand records and put them online, and corrections need to be made 
three years after that technician (and the soft money) is gone, you may 
not be able to accomodate.

As for detecting errors, I've seen examples of automated protocols, and 
I'm not impressed; the classes of errors they catch are a tiny fraction 
of the actual errors present, and all of them are things the data 
provider should have *easily* detected before uploading (e.g., 
misspelled country names [like "Columbia"], terrestrial records plotting 
in oceans, points in the wrong hemisphere, lat/long values that are 
impossible, etc.). <rant>Maybe I'm in a minority on this issue, but I 
consider it a dereliction of scientific responsibility when a provider 
uploads data that have not been absolutely scrubbed clean of errors, 
simply because they only budgeted for data entry, and nothing for 
human-provided quality control. It should never be necessary for an 
"interested third party" to make corrections to someone else's data set; 
if errors can be found after uploading, then they COULD have been found 
prior to uploading, e.g., if that same third party had been hired to 
check the data set. In effect, what is happening is that people are 
saving money by skimping on quality control and leaving it to 
"interested third parties" that will do it for free. I'm not claiming 
that it's a devious and deliberate plan to cheat the system (and 
goodness knows that in many cases, data entry itself is not funded), but 
third-party intervention is not the way that quality control should be 
accomplished, even if it's by accident rather than design. When funding 
agencies don't rate data quality as a primary concern, then it's not 
really surprising when all anyone budgets for is quantity.</rant>


Doug Yanega      Dept. of Entomology      Entomology Research Museum
Univ. of California, Riverside, CA 92521-0314    skype: dyanega
phone: (951) 827-4315 (disclaimer: opinions are mine, not UCR's)
  "There are some enterprises in which a careful disorderliness
        is the true method" - Herman Melville, Moby Dick, Chap. 82

Taxacom Mailing List
Taxacom at mailman.nhm.ku.edu

The Taxacom Archive back to 1992 may be searched with either of these methods:

(1) by visiting http://taxacom.markmail.org/

(2) a Google search specified as:  site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here

Celebrating 26 years of Taxacom in 2013.

More information about the Taxacom mailing list