[Taxacom] Data quality in aggregated datasets

Robert Mesibov mesibov at southcom.com.au
Wed Apr 24 00:58:00 CDT 2013


And like the worthy Rich Pyle, I'm inevitably going to comment on Lee's comment.

(1) If we're all aggregators, then my mother was a retail grocery stock manager, because just like the manager, she made up lists of groceries (shopping lists).

Aggregators are different because they're data *publishers*. GBIF and ALA fit within the information chain after collector and museum, and before end-user. Unlike (generally) the collector and the museum (and often the knowledgeable end-user as well), the aggregators do not do their best to ensure that the information they're passing on is correct. They issue disclaimers instead.

The aggregators seem to believe they don't have to do any checking, because if they 'expose' the data on the Web, then some end-user out there will find any data problems and report back to the aggregator, who will then pass the report back to the museum or other data provider. If there's no feedback, there's no errors, right? It's all 'fit for use', QED.

As an error-detecting and -fixing protocol, this stinks. It's chancey, slow and piecemeal. Even the clue-y specialist isn't going to go that route. She'll fix the data herself and use/publish the results, or (in my case) talk directly with the data provider.

(3) The only thing 'manual' about most of the checking I did was typing out some simple commands in a terminal. My PC did the rest. Maybe ALA programmers should take a refresher course in sort, uniq, comm and AWK? They can all be incorporated into scripts.

(6) I don't think Lee was persuaded by my suggestion in an earlier post that aggregators would enhance their prestige, and make their data publishing more responsible, by doing data checks and collaborating with providers to fix problems.

Another way to arrive at the point where the data are a lot more reliable would be for aggregators to *refuse to publish* records that had not been checked according to agreed protocols, and if the provider can't do it, the aggregator will do it on their behalf. That would sort out the worst of the problems *before* the aggregators published the data.

Everyone would benefit from that kind of filtering. As it stands, the aggregators have opted for quantity over quality. The more data they publish, the more useful they are in their own regard. My paper suggests that there could be a different view away from the mirror.
-- 
Dr Robert Mesibov
Honorary Research Associate
Queen Victoria Museum and Art Gallery, and
School of Agricultural Science, University of Tasmania
Home contact: PO Box 101, Penguin, Tasmania, Australia 7316
Ph: (03) 64371195; 61 3 64371195




More information about the Taxacom mailing list