[Taxacom] GBIF Data Portal response from Donald Hobern

Beach, James H beach at ku.edu
Wed Aug 8 14:57:45 CDT 2007


>From Donald Hobern, GBIF: 


Many thanks to all who have contributed to the discussion of the new
GBIF data portal over the last few days.  We have been following this
with a much interest. We are indeed grateful for suggestions for further
improvements to the data portal, and will consider all of them
seriously. 

As some of the portal-related procedures may not be very transparent, a
little more background information on the current state may help to
clarify what is already happening and what is planned so far:

The new indexing process is a three-phase procedure that (1) indexes the
core data items of all specimen- and observation records exactly as
provided in a dataset (the "harvesting" step of the log messages), then
(2) analyses them while splitting the data out into a number of tables
(called "extraction", and (3) is rounded off with a number of batch
processes (e.g. map layer generation) that have to run before a new
version of the index is put online. The second phase already includes a
number of checks, which correspond to the messages appearing in the
"event log" of a data provider or dataset. At step (2), there are a
number of checks already existing, while additional ones can and will be
inserted as data and metadata become available. Existing checks mostly
concern scientific names, which are checked for compatibility with
syntax rules (making it possible to match them to the existing entries
in the taxonomic "backbone", also taking available information on higher
taxonomy into account), geographical information (checking the country
name against available coordinates, and suggesting possible errors e.g.
regarding negated values, swapped lat/long values and missing values -
coordinates of "0"), and the presence of mandatory elements like ID
values or the "Basis of Record" information that is needed to
distinguish living collections from observations or fossils, etc.. There
are also some other tests in place that check for, e.g., expected data
types.

The number of available tests will grow as we go along. One of the next
things to get up and working from our side is a provider and resource
registration that also captures some more meta information like
geographic and taxonomic coverage, which in turn will be used for more
consistency checks in the extraction step. Records with potential issues
are already flagged, so that the user interface could offer more choice
to users on whether to display/download all records in a result set, or
limit to unambiguous ones. Likewise, download options could default to
original data, but offer an option to use interpreted data instead or
show the flagged issues in addition. Previous test versions of the data
portal included some of these options, but they were discarded again in
the launched version because they tended to be confusing. Some more
thought would have to go into representation, but they certainly could
be re-integrated at some point. 

Data providers currently have to actively screen the event logs
regarding their dataset for indexing issues. Email notifications were
not included initially, because of the potential hassle they might cause
data providers trying to follow up on a number of  messages from
successive indexing runs. We are happy, however, to get recommendations
on that matter. Feedback messages from users of the data portal, on the
other hand, are forwarded directly to the registered contacts of the
dataset they concern. One of our main goals is to support data providers
as much as we can in the effort of correcting and completing the
original data. Pointing out possible issues is, of course, only a first
step, and we are aware that in many cases, resources are scarce and
updates will take some time, so that we agree data users need some help
in interpreting the data the receive through the GBIF data portal. 

Many thanks again for all your comments and suggestions, which will
always be much appreciated.


Donald Hobern (on behalf of the GBIF portal team)


------------------------------------------------------------
Donald Hobern (dhobern at gbif.org)
Deputy Director for Informatics 
Global Biodiversity Information Facility Secretariat 
Universitetsparken 15, DK-2100 Copenhagen, Denmark
Tel: +45-35321483   Mobile: +45-28751483   Fax: +45-35321480
------------------------------------------------------------





More information about the Taxacom mailing list