[Taxacom] Data quality in aggregated datasets

Lee Belbin leebelbin at gmail.com
Tue Apr 23 22:28:43 CDT 2013


Like Rich occasionally, I can't refrain from commenting.

Bob's paper on the audit of Australian millipedes highlights the range of
data issues that arise with human observations and where there are multiple
sources of the 'same' data. Errors in data are a community responsibility.
The following are a few observations from the perspective of someone who
has contributed to the development of the Atlas of Living Australia (ALA: A
'super-aggregator' :).

1. We are all 'aggregators' of one form or another. Bob is an Australian
millipede aggregator, as are museums and herbaria. This highlights the
complexity of the 'data error' issue. In the case of the ALA, the steps a
record may follow may be collector -> domain aggregator -> museum
(aggregator) -> museum aggregator (e.g., OZCAM) - > ALA (aggregator) ->
GBIF (super-aggregator!) -> anyone. This is only one of many possible
pathways, each being subject to potential translation or interpretation
issues.

2. 'Data quality' is seen by the ALA and other 'super-aggregators' as a
high priority issue. I do however recommend that 'fitness for use' is a
more appropriate term in many circumstances. For example, a locational
inaccuracy of 20km on a record will not invalidate its use with regional or
continental scale studies. 'Data error' may depend on context.

3. We are all limited in what resources are available to address data
issues. This appears to be increasingly true of museums and herbaria.
'Manual checking' as Bob has done is time consuming yet necessary for a
range of issues where automated checking cannot be guaranteed to find
and/or correct issues. To be efficient, the ALA does have an extensive
suite of 'automated' checks (see
https://docs.google.com/spreadsheet/ccc?key=0AjNtzhUIIHeNdHJOYk1SYWE4dU1BMWZmb2hiTjlYQlE#gid=0).
As Bob has pointed out however, they don't always work. Such checks and
corrections are however a cost-effective and necessary step. We need to
continue to build a more robust 'rule set'. Bob's paper will help with this
and all other contributions would be appreciated.

4. For a range of error types, specialist domain expertise is required. For
example, Bob's expertise with Australian millipedes.  'Super-aggregators'
such as the ALA and GBIF do not generally have this type of expertise (and
museums and herbaria may no longer have staff with expertise in various
taxonomic areas). They do however have expertise to build infrastructure
that enables integrated data to be exposed and discovered in an open
manner. In my experience, errors are more likely to be exposed in
integrated datasets. The 'super-aggregators' are also in a good position to
provide infrastructure and processes that contribute to addressing data
issues.  A two-way table with four cells can be envisioned called "domain
expertise required" where the rows represent error detection and the
columns - error correction. The first row and column are "Yes" and the
second row and column are "No".  There are errors can require domain
expertise to detect and correct (Type 1). There are errors where no domain
expertise is required to detect and correct (Type 4). There are errors
where domain expertise is not required for their detection but is required
for correction (Type 3). The last cell in the table seems unlikely: Where
domain expertise is required for detection but not correction (Type 2).

5. On David and Dean's point about annotations: Good point! The ALA does
have a sophisticated annotation service that enables comments to be
attached to any field of any record. In addition, anyone registered on the
ALA site can also setup alerts for any annotations (and a range of other
additions and changes) made against any suite of records (
http://www.ala.org.au/blogs-news/annotations-alerts-about-new-annotations-and-annotations-of-interest/
).  'Crowd-sourcing' is an extremely effective process.

6. Data providers (like 'aggregators') are a diverse lot. Some data
providers encourage the ALA to make corrections to the provider's records
(for provider and ALA). Other data providers would withdraw their support
if similar changes were attempted on their data by the ALA. Feedback from
the ALA to a data provider may result in immediate corrections (and data
propagation) while in other cases, the provider has no resources to resolve
an issue. There is no single process currently here that will work
effectively in all circumstances. We do however take Bob's paper as a prod
to seek best current practice among providers and 'aggregators' to improve
'data quality'.

7. One point among a few made by Donald Hobern (GBIF) to me yesterday:
"Progress will be limited while the underlying culture of data publishing
and data management does not support stable, long-term reference to each
data record and community-based curation of those data in a way that
ensures that each act of correcting any aspect of any data element is not
lost but contributes to the development of a global digital biodiversity
knowledgebase."

Lee

 --
Lee Belbin
Blatant Fabrications Pty Ltd
Tasmania



More information about the Taxacom mailing list