[Taxacom] Data quality of aggregated datasets

David Remsen [GBIF] dremsen at gbif.org
Thu Apr 25 08:51:23 CDT 2013

I attended a meeting last week in Arizona that focused on the concept of
recognition and attribution for a broad set of activities within
informatics that constitute "work."

As Bob points out, "we" can develop all the enabling technology we can
imagine but at the end of the day, someone must actually do the review and
annotation in order improve the utility of data.  The question is whether
and how they might be incentivized, recognized and rewarded for this
activity.  Strategies that rely on altruism have a limited shelf life and,
as more and more aspects of the data life cycle take place online, there
are more opportunities for different actors to assert a right to be
recognized in the process.

One answer to who should be recognized for their efforts is simple and
pragmatic; anyone who asks for it, without whom, the data stops moving. 
Bob is right that without the "'interest' bit", the best pipes in the
world will sit empty with nothing moving through them.  On the other hand,
recognition and attribution, in the form of a citation or thanks splashed
across a web page in response to some recommendation of a data-enabling
middleman doesn't cut it either.  What we need are numbers and if they
aren't dollars they need to be solid metrics that individuals can present
to demonstrate their true role.

My view is that citation and attribution need to be built into the same
infrastructure that moves and enables the data.  As the data moves from
one place to another, it accrues provenance.  As it is annotated,
additional actors and their roles are added to the audit trail. All of
this is tied to a resolvable identifier, like a DOI, that can be instantly
resolved to unravel this network of provenance.  And in the same way that
you can get metrics regarding how often your publication is cited, you
should be able to get similar metrics for how often the data you enabled
is used.

Since data moves from place to place, a system for keeping track of all
who touch the data needs to be consistent and in place.  At a meeting at
NESCent in March I learned about technologies like nanopublications
(http://nanopub.org) and ORCIDs (http://orcid.org) which seem to be
pointing in the right direction.  A successful infrastructure, in other
words, incorporates "you," the experts, into "we," the network, in such a
way that the infrastructure itself starts to fade from view.  Growing up
in the US in the 70s there was a shampoo ad for Vidal Sassoon, who said,
in his stylish accent, "if you don't look good, we don't look good."

That's how successful cyber-infrastructure works.  It provides a venue for
brokering connections and the real metric for its success is when "we"
stop singing our own praises because "you," the users and the enablers of
the infrastructure, do it for us.

David Remsen

> Rod Page wrote:
> "There at least things we need to do to tackle this problem.." and I like
> to think the presumably unintentional gap between 'least' and 'things'
> shows that Rod had something else in mind, maybe a definition of who 'we',
> the annotators, might be?
> Donald Hobern says 'It will also involve a commitment from us all to work
> collaboratively to manage digital knowledge of biodiversity', and he used
> 'we' quite a bit in his last Taxacom post. But Doug Yanega's awesome post
> (and lots of other examples known to Taxacom listers) shows that the hard
> grind of data inspection and cleaning isn't done by a rhetorical 'we,'
> it's done by a small number of individuals around the world with the
> interest and the time (often a lot of time) to do that work.
> While persistent identifiers and an effective annotation mechanism can
> help with the 'time' bit, and ensure that any particular job only has to
> be done once, they do nothing for the 'interest' bit. Here and there in
> the landscape of biological data, isolated individuals appear, roll up
> their sleeves and try to bring order and accuracy to what's known so far.
> Is there a technical fix for increasing their numbers?
> These people aren't going to appear out of nowhere when effective
> data-item identifiers and annotation mechanisms are developed and agreed
> on, because their job has been made easier. It hasn't. In the same way
> that digital tools can't significantly accelerate taxonomy if 90% of
> taxonomists' time is spent quietly examining and curating specimens, the
> technical solutions for archiving/tagging/annotating data can't reduce the
> effort involved in the detective work needed to upgrade the data.
> I used digital tools in my Australian millipede records audit to quickly
> identify potential problems, and I then contacted the data providers with
> quite a few queries, most of which only curatorial staff (sometimes
> particular curatorial staff) could answer. Dealing with those queries took
> time. Between us, the staff members and I had enough 'interest' to pursue
> these data issues.
> So a vanishingly small percentage of the world's biodiversity data items
> got upgraded. What about all the others? Who does those? Like Rod says,
> there's a lot of arm-waving about projects, but not much talk about
> fielding an army of data-checkers, except moonshine about 'the crowd',
> which properly interpreted means 'an additional small number of patient,
> capable people we hope to find by asking for help over the Internet'.
> Which leaves us where? I can see a future with an increase in the quality
> of data 'crystallised' around particular taxa and geographical areas,
> because that's how human interest focuses. Maybe Rod has an idea?
> It would be nice if those 'crystallisations' got supported. The money in
> recent years seems to have been going mainly to aggregating data with very
> patchy quality.
> --
> Dr Robert Mesibov
> Honorary Research Associate
> Queen Victoria Museum and Art Gallery, and
> School of Agricultural Science, University of Tasmania
> Home contact: PO Box 101, Penguin, Tasmania, Australia 7316
> Ph: (03) 64371195; 61 3 64371195
> _______________________________________________
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
> The Taxacom Archive back to 1992 may be searched with either of these
> methods:
> (1) by visiting http://taxacom.markmail.org
> (2) a Google search specified as:
> site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here
> Celebrating 26 years of Taxacom in 2013.

David Remsen
Global Biodiversity Information Facility Secretariat
Universitetsparken 15, DK-2100 Copenhagen, Denmark
Tel: +1 508 274 4055   Fax: +45-35321480
Skype: dremsen

More information about the Taxacom mailing list