[Taxacom] new GBIF data portal

Donald Hobern dhobern at gbif.org
Wed Aug 8 12:42:46 CDT 2007

Many thanks to all who have contributed to the discussion of the new 
GBIF data portal over the last few days.  We have been following this 
with a much interest. We are indeed grateful for suggestions for further 
improvements to the data portal, and will consider all of them seriously.

As some of the portal-related procedures may not be very transparent, a 
little more background information on the current state may help to 
clarify what is already happening and what is planned so far:

The new indexing process is a three-phase procedure that (1) indexes the 
core data items of all specimen- and observation records exactly as 
provided in a dataset (the "harvesting" step of the log messages), then 
(2) analyses them while splitting the data out into a number of tables 
(called "extraction", and (3) is rounded off with a number of batch 
processes (e.g. map layer generation) that have to run before a new 
version of the index is put online. The second phase already includes a 
number of checks, which correspond to the messages appearing in the 
"event log" of a data provider or dataset. At step (2), there are a 
number of checks already existing, while additional ones can and will be 
inserted as data and metadata become available. Existing checks mostly 
concern scientific names, which are checked for compatibility with 
syntax rules (making it possible to match them to the existing entries 
in the taxonomic "backbone", also taking available information on higher 
taxonomy into account), geographical information (checking the country 
name against available coordinates, and suggesting possible errors e.g. 
regarding negated values, swapped lat/long values and missing values - 
coordinates of "0"), and the presence of mandatory elements like ID 
values or the "Basis of Record" information that is needed to 
distinguish living collections from observations or fossils, etc.. There 
are also some other tests in place that check for, e.g., expected data 

The number of available tests will grow as we go along. One of the next 
things to get up and working from our side is a provider and resource 
registration that also captures some more meta information like 
geographic and taxonomic coverage, which in turn will be used for more 
consistency checks in the extraction step. Records with potential issues 
are already flagged, so that the user interface could offer more choice 
to users on whether to display/download all records in a result set, or 
limit to unambiguous ones. Likewise, download options could default to 
original data, but offer an option to use interpreted data instead or 
show the flagged issues in addition. Previous test versions of the data 
portal included some of these options, but they were discarded again in 
the launched version because they tended to be confusing. Some more 
thought would have to go into representation, but they certainly could 
be re-integrated at some point.

Data providers currently have to actively screen the event logs 
regarding their dataset for indexing issues. Email notifications were 
not included initially, because of the potential hassle they might cause 
data providers trying to follow up on a number of  messages from 
successive indexing runs. We are happy, however, to get recommendations 
on that matter. Feedback messages from users of the data portal, on the 
other hand, are forwarded directly to the registered contacts of the 
dataset they concern. One of our main goals is to support data providers 
as much as we can in the effort of correcting and completing the 
original data. Pointing out possible issues is, of course, only a first 
step, and we are aware that in many cases, resources are scarce and 
updates will take some time, so that we agree data users need some help 
in interpreting the data the receive through the GBIF data portal.

Many thanks again for all your comments and suggestions, which will 
always be much appreciated.

Donald Hobern (on behalf of the GBIF portal team)

Donald Hobern (dhobern at gbif.org)
Deputy Director for Informatics 
Global Biodiversity Information Facility Secretariat 
Universitetsparken 15, DK-2100 Copenhagen, Denmark
Tel: +45-35321483   Mobile: +45-28751483   Fax: +45-35321480

More information about the Taxacom mailing list