[Taxacom] saturday morning fun
David Remsen (GBIF)
dremsen at gbif.org
Mon Nov 29 04:19:22 CST 2010
I tried to explain the state that data is when it is published from
source collections data and the challenge of organising it for
evaluation. I also said that since the taxonomic organisation of
these source data is so inconsistent and we lack the means to rank
individual data sets (something I think we need to revisit) that we
rely on additional external sources to provide taxonomic authority
- We use the Catalogue of Life because it is available and is
purported to be curated by a network of taxonomic experts. We have
the capacity to utilise additional and alternative sources if they are
made available to us. What are they and where can I find them?
And which sectors of the Catalogue of Life are completely worthless?
Like GBIF, the Catalogue of Life is composed of component parts that
are curated elsewhere. Like GBIF, CoL does actually shoulder some
responsibility for the organisation of these parts. Is it the
higher (supra-familial) taxonomy that the COL uses to organise the
components. Is it specific taxonomic sectors like the Porifera or
the legumes or weevils? Or, is it like the proverbial rotten egg,
where you don't need to eat the whole thing to just know it's bad?
The latter requires the least effort but doesn't actually say much.
- We are aware of particulars regarding species concept issues but
specific concept references are not provided in data sources and
concept differentiation remains a problem for nearly all biodiversity
data networks. We know how to improve this however, but it requires
concept identifiers and concept definitions themselves, to be provided
in a structured manner, with the data.
- The GBIF data portal is in need of a make-over and will be getting
one over the next year. I agree we need to make the simple thing it
is supposed to be doing simpler and with improved precision. That
simple thing is access to primary biodiversity data records shared
through the network. We will be doing a lot more processing of these
records to try to make more clear qualitative separations. There are
two basic issues however that are very difficult to address the first
being comparing two records and determining if record A and record B
actually refer to the same thing.
Lastly, GBIF is an open source project and the data published through
it are (or can be) available to anyone wishing to provide improved
access to it. This of course, requires an agreement that access to
raw collections data from a federation of sources has some sort of
merit - something that seems to me is not agreed by commenters here.
If this isn't agreed then the issue is far larger than GBIF and in my
mind raises the question as to why NSF and others continue to wish to
fund such digitisation and networking efforts.
If you think there IS merit in sharing these data but that the problem
is we just continually botch it up in Copenhagen, we would support
your proposal to improve access to these meritous data. I will
provide you with a copy of the current index (a ~75 GB text file) or
any subset for evaluation and new ideas. Just remember, the fixes
and organisations have to be re-applied next month to a new copy.
Lastly, just what did you search for in Google that provided better
recall and precision than a GBIF search in terms of specimen
records? If Google has access to an wider set of better curated
specimen records than indeed, the portal merits a real existential
On Nov 29, 2010, at 9:46 AM, <dipteryx at freeler.nl>
<dipteryx at freeler.nl> wrote:
> Van: taxacom-bounces at mailman.nhm.ku.edu namens Jim Croft
> Verzonden: ma 29-11-2010 1:04
>> To be fair, the only reason GBIF is 'feeding us shit' is
>> because 'shit' is what we gave them.
> Not at all sure about that. What has been playing through my
> mind is the idea that a data aggregator is an agency which can
> be characterized by "Data in, garbage out". It is a complete
> mystery to me why GBIF uses something known to be so completely
> worthless as the taxonomy of the Catalogue of Life; nothing good
> can come of that ...
> Like some other list-members, I tried a small test, for which I
> selected a genus where it is known to be essential to be explicit
> about the species concept used in order to be able to interpret
> and handle data, in anything like a meaningful manner.
> Using the GBIF data portal, the most noticeable thing is how much
> work it is to use, before getting to any data. There is indeed a
> significant degree of completely irrelevant material linked from
> this entry (the wondrous ways of computers!), but this is easily
> identifiable, so not much of an actual problem. There is no apparent
> awareness of the species-concept issue, with more than one species
> concept used happily side by side. So, a lot of work (and 'expert'
> knowledge required), but basically usable. This in contrast to the
> Wikipedia entry, which requires very little work on the part of the
> reader for him to be completely misinformed. Wikispecies is
> although it offers only little information, with a 25% rate of error
> (as compared to the source it was copied from), but at least it
> indicates its source, and it has selected a relevant source.
> On the whole it proves that the casual user is best advised to just
> use Google (which not only did turn up the relevant information but
> quickly showed me a very nice site unknown to me): this is less work
> and yields more useful results (a higher ratio of information/amount
> -of-work) than trying one of the self-advertised high-profile sites
> (obviously, the 'expert' does not need advice).
> Paul van Rijckevorsel
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> The Taxacom archive going back to 1992 may be searched with either
> of these methods:
> (1) http://taxacom.markmail.org
> Or (2) a Google search specified as: site:mailman.nhm.ku.edu/
> pipermail/taxacom your search terms here
More information about the Taxacom