[Taxacom] Chameleons, GBIF, and the Red List

Richard Pyle deepreef at bishopmuseum.org
Sun Aug 24 20:26:31 CDT 2014

Michael Heads wrote:

> But it's not getting any data from many of the most important collections in US, 
> UK etc. or any from the largest country in the world, the one with the largest 
> expanse of forest etc. 

Why not?  More importantly, to whom am I addressing the "Why not?" question?  To GBIF?  Or to many of the most important collections in US, UK etc. or any from the largest country in the world?

> That data (in the collections of Moscow etc) is incorporated in works such as 
> 'Birds of Russia" and from there into sites such as IUCN, but it's not in GBIF. 
> So GBIF is not really a *global* biodiversity information facility - in practice 
> it doesn't supply reliable information on global distributions, even in the 
> best-known groups. IUCN is much more useful.  

Err... last I checked, "Global" means a geographic scope that spans the entire planet -- not necessarily implying data omniscience.

Perhaps what you're suggesting is that GBIF expand its scope if sources for Occurrence data?  But it's scope is limited only by what people can provide.  As I described about a half-dozen posts ago, we've developed a workflow to harvest Occurrence records from literature, so maybe that's the next frontier towards data omniscience?

Stephen Thorpe wrote:

> I agree with Rich that data cleaning would greatly enhance the value of GBIF,
> but I see a huge "political" roadblock here. The awkward dilemma for GBIF is
> to facilitate data cleaning while at the same time not publicly admitting
> "imperfection", for, like it or not, they are already "selling themselves" as a
> reliable source of data. 

Really?  And here I was thinking they were "selling" themselves as a data aggregator.  I agree there may be politics at play, but not because GBIF is fearful that the world would see the data they aggregate as "imperfect" (indeed, anyone inside or outside of GBIF who believed the data are perfect is delusional at best).  No, the politics, if any, are the extent to which the source databases would be exposed to their imperfections.  GBIF is just the mirror, remember?  It's a shame, though.  In an ideal world, data sources would be HUNGRY for the data clean-up tools and services.  But in the real world, as I said, most of those sources lack the resources and/or infrastructure to incorporate the clean-up, and I would hazard a guess that they may be reluctant to expose their content through GBIF if the "warts" were so readily visible to all.  Or maybe not?

Bob Mesibov wrote:

[A bunch of stuff that I don't have any comment on, plus...]

> Yeah, well. I don't see links on the GBIF website to alternative data sources,
> which would be a lot easier to add than data caches.

I'm not sure what you mean by this.  Do you mean alternative sources for a particular occurrence record?  I suspect most occurrence records in GBIF have only one source, and the problem with discovering occurrence records with multiple sources comes back to the globally unique identifier problem I've touched on already.  Or, do you mean a more generic "other places where Occurrence records can be found"?  I don't need GBIF to tell me that.... although I sincerely wish that such that sources would provide those occurrence records to GBIF.

Or, do you mean other sources for "updated taxonomy for particular names"?  GBIF is not a taxon name service (although it has some taxon-related services that help users find occurrence records).  Such taxon services are at CoL, WoRMS and the like.  What we're trying to do at GNA is get those things cross-linked.

Rod Page wrote:

> Given that GBIF has duplicate specimen records already (e.g., 
> occurrences that come directly from the original museum, and 
> also via other projects such as FishBase, or from DNA barcoding 
> projects), we can think of these different records as being 
> “annotations” (e.g., this occurrence is what the museum says 
> about this specimen, this other occurrence is what BOLD or GenBank say). 
> So, we could cluster these and end up knowing that this 
> museum specimen is the voucher for these DNA sequences. 

I could definitely get behind that!  We need a mechanism to "know" when two records in GBIF are the "same" occurrence.  We have the integer gbifID values, but these are record identifiers, not "Occurrence" (as a concept) identifiers.  Like Rod, I think the path forward is in this space, whether we intend it to be or not.

> If people have downloaded GBIF data, clean it, then send it back, 
> we could simply cluster the new data with the old, and people 
> can then see what has happened to the data when a user has 
> scrutinised it.

I have a snapshot download from GBIF right now that I'm working on for our Checklist of fishes of the Northwestern Hawaiian Islands.  Over the next few weeks, we'll be completing the requisite "cleanup", and I'd be DELIGHTED to use this as an example dataset that might (eventually) round-trip back to GBIF and (ultimately) the original data sources.


More information about the Taxacom mailing list