[Taxacom] Chameleons, GBIF, and the Red List

I for one am finding this a very useful discussion (apologies to those on TAXACOM who are bored rigid by it).

Not wishing to spoil the fun regarding evil bureaucrats sucking all the biodiversity money into organisations intent on covering up their inadequacies, but here are some links to some relevant aspects of GBIF.

- GBIF recently released analytics where you can get an indications of data quality http://analytics.gbif-uat.org/global/index.html You can see all sorts of features in the data, some of which are almost certainly artefacts. Indeed one reason for showing these charts is so that artefacts can be spotted more readily.

- GBIF does a lot of data cleaning before data appears in the portal, see this blog post for example http://gbif.blogspot.co.uk/2011/05/here-be-dragons-mapping-occurrence-data.html  For each individual occurrence record GBIF will often add a  “flag” indicating that something is wrong (e.g, the geographic coordinates don’t match the country the specimen is reported from). These flags don’t cover all issues by any means, but they do flag some of the obvious problems.

- To get a sense of how GBIF-mobilised data is being used, see http://www.mendeley.com/groups/1068301/gbif-public-library/papers/ for a list of publications that mention GBIF or make use of data from GBIF.



I agree with Rich that data cleaning would greatly enhance the value of GBIF, but I see a huge "political" roadblock here. The awkward dilemma for GBIF is to facilitate data cleaning while at the same time not publicly admitting "imperfection", for, like it or not, they are already "selling themselves" as a reliable source of data. Taxonomists like Rich won't be suckered in by this, but many 'crats working in local and national government (e.g. biosecurity agencies, etc.) will be so suckered, and quite possibly GBIF funding is driven by their needs/wants.


What you describe is EXACTLY what GBIF and
others in the biodiversity informatics world are hoping to

My previous post
(reply to Stephen) covered these parts of the process:
1) Data exist in thousands of databases around
the world
2) Aggregators like GBIF make our
lives MUCH easier in helping us to discover those data
3) We, the experts of the world, spend hours
"cleaning" data after GBIF has so helpfully
allowed us to locate it.

What you're talking about is the next

4) After we, the
experts of the world, have spent hours "cleaning"
the data, how do we allow those efforts to propagate back to
the sources, so that the NEXT person who encounters those
records through GBIF can benefit from the toils of us

There are two
basic roadblocks to achieving this final step.

First, as has been made
ABUNDANTLY clear in this thread, the data do NOT belong to
GBIF.  They belong to the hundreds (thousands?) of
institutions around the world that manage those thousands of
databases.  Ultimately, those corrections have to find
their way back to the source databases, so that GBIF can
re-index them with the corrections included.  And believe
me, GBIF and others have tried to do this EXTENSIVELY -- for
many years.  A lot of the mechanisms are being developed
(e.g., FilteredPush), but so far there has been slow
adoption of those mechanisms by the thousands of source
databases.  There are many reasons for this, but I suspect
the main reason is that institutions are barely keeping up
day-to-day activities with ever-shrinking budgets, and
simply do not have time or IT expertise to implement the
corrections to the datasets that they manage. Thus, because
the source data remain "unclean", the aggregated
data in GBIF remains unclean.

The second major roadblock is the lack of
"proper" identifiers (globally unique, persistent,
actionable) for these occurrence records.  The only way
that corrections that you make in your downloaded copy of
GBIF data is if you can report back on exactly which records
need cleaning (along with the corrected information).  GBIF
does assign its own locally unique identifier (integer),
which could be used for this purpose -- but only for piping
the data back to GBIF.  GBIF can relay the corrections back
to the source databases, but that will only be helpful to
the rest of us if the source incorporates the fixes.

There is actually a third
roadblock, which has the potential to become a major
roadblock, but we haven't bumped into it yet so much
because we still can't get past the first two
roadblocks. And that is, institutions will not automatically
assume that every "correction" that is sent to
them is actually "correct".  Managers of those
data will in almost all cases want to review the changes to
ensure that they are appropriate for updating in the source
database. And this process, of course, requires time and
resources that most institutions simply do not have.

There may be another solution,
however, which is for GBIF to cache corrections submitted by
people like you and other experts, such that these
annotations/corrections can be made visible to all users of
GBIF data; not just the source datasets.  Perhaps this
feature already exists.  Perhaps the politics of
implementing such a feature are too daunting to overcome.

But the bottom line is that we
really do need to address this fourth step, so that we can
more effectively benefit from the work of others, and
(conversely), so that our own efforts will benefit more than
just ourselves.


Donat Agosti

feel, the discussion is too much centered on data that has
not the
information content needed,
like studying a Landsat image at 30 meter
resolution and discussing what tree
species is shown"

Excellent metaphor! For most scientific
uses, you need much more data than
provided by any available database. Can you get everything
you need
online? No. Do existing
aggregators like GBIF offer a helpful starting point?
For some people and some uses, yes.

But now the
important question: when you have all the information you
need, and clean it and enrich it, do you
publish it online in a usable form? I

don't know what Quentin Groom's project was about,
nor do I know if he
published his final

In my own
case, every one of my 12123 locality records for
Millipedes is freely
available in CSV format (and in abbreviated form in KML)
from the 'Millipedes of Australia'
website. This store is larger and more up to
date and contains fewer errors than any
aggregator store, or even, the
data providers' stores (because certain providers have
been slow
to add my edits to particular
records, or to upload them to their own or
aggregator stores).

But if people like me and Quentin
publish data freely to the Web and

aggregators don't use this improved/extended data,
aggregation looks less
and less
Robert Mesibov
Honorary Research
Queen Victoria Museum and Art
Gallery, and School of Land and Food,

University of Tasmania Home contact:
Box 101, Penguin, Tasmania, Australia 7316
(03) 64371195; 61 3 64371195

More information about the Taxacom mailing list