 'No, that's not the argument. Biodiversity data aren't like biodiversity
books or papers, for which you can (in principle) generate a complete
catalogue or index. Given such a catalogue or index, you can go further and
digitise and make available on the Web all the content. Cool, yes?'

Yes, cool,  but between the "cool stuff" and the BHL scans or recently
published PDFs there is such thing called "markup".

Imho, the biggest problem of GBIF is that the main source of data are still
mostly large institutional collections. Not bad at all, and it is
definitely a great thing to index and make these discoverable.
Unfortunately, this comes with the trade-off and on the expense of
inclusion of non-verified data as well.  GBIF are still faraway from
mobilization of the properly published "small" data (that is well
documented in literature, peer-reviewed, registered for authorship and
priority, citeable, etc.). Why it is so? Because of the same "markup"

Here comes the second big problem of GBIF which is that a huge amount of
"small" biodiversity data are still outside the institutional collections
of the GBIF member countries (e.g., published in the historical literature,
in many cases even not in English, or stored in museums of non-member
countries, smaller museums,  private collections, observations not stored
in collections, etc.). Nobody knows the real proportion between the "big"
collections' data and "small" data, but according to some calculations, the
"small" data constitute about 80 % of all data. In other words, the "small'
data are in fact the "big" data. When GBIF will be able to gather the
precious, peer-reviewed, small data from historical and recently published
literature, for example? The answer is: "probably never", until we continue
to publish in archaic way, such as paper/PDF, forgetting that someone
should spend the huge effort to extract data from PDFs, put this data into
a database and then upload it to GBIF or anywhere else...... while PDFs
continue to pile up every day!

Even ZooKeys, despite integrated with GBIF Integrated Published Toolkit
(IPT), that is giving authors the option to publish their data associated
with a taxonomic revision, or in the form of "data papers", still publishes
a lot of non-marked-up occurrence records. Why so? Again, because of the
huge effort associated with markup, especially of complex biodiversity data
types, such as occurrence records and morphological descriptions.

Fortunately, it looks like that there is a light in the tunnel, called
Data Journal <http://biodiversitydatajournal.com/browse_articles>. Any kind
of occurrence records (and other types of small data) are published
mandatory in both human-readable (HTML, PDF) and computer-readable (XML,
Darwin Core and DwC Archive) formats. It is like a piece of cake to
download or harvest data published in this way and to make them not just
"discoverable via open access" or "linked to taxon names, mentioned in the
same article", but REUSABLE!

The "gold open access" is a good thing but not sufficient anymore.  We need
to switch to "platinum open access" publishing which will eliminate the
costs of markup and make data easily available and reusable to all.


> Having spent a lot of time trying to extract content from BHL for projects
> such as BioStor and BioNames, the kinds of issues you raise for specimens
> sound all too familiar.
> BHL grabs physical things, scans them, associates whatever metadata
> library catalogues have, and puts the online. Simples. Ah, but then the fun
> starts. Locating articles (i.e., the things we actually cite) in BHL is
> sometimes straightforward, but often it is anything but. Journals can
> change names, may have multiple names (sometimes in multiple languages),
> concurrent or inconsistent volume and/or page numbering, etc. Notions that
> we take for granted today (that there are "articles" and that they have
> explicit titles) may not hold, and off course every taxonomist knows that
> determining the data of publication can be a challenge (as I'm sure Neal
> Evenhuis, among others, will testify). Much of the time I spend on BioNames
> consists of taking cryptic, often misleading (if not downright erroneous)
> citations to original descriptions and matching these to BHL (or other
> sources).
> My point is that I don't think there's a world of difference between the
> two problems. For all the issues that you document in "A specialist’s audit
> of aggregated occurrence records"
> http://dx.doi.org/10.3897/zookeys.293.5111 , I could probably find
> equivalent horror stories for bibliographic data.
> As you say, many of the basic elements of a GBIF occurrence are
> potentially contested, subject to uncertainty, error, etc. I guess it's for
> everyone to decide whether the trade-off involved in simplifying the data
> so it can be aggregated in bulk is worthwhile.
> One thing I'd like to see is GBIF occurrence data integrated with the
> literature, for example by linking specimens to their citation in the
> literature (another reason to play with BHL). If we can go from a specimen
> to the associated literature we could then track some of the issues you
> mention, such as different identifications, discussion of whether the
> collection locality is correct, etc.
> For a simple example, the specimen FMNH 147942 appears in at least three
> articles in BioStor (http://biostor.org/specimen/FMNH%20147942 ). Below
> are the three article links plus text extract around the specimen code:
> http://biostor.org/reference/81423
> Crunomys suncoides Rickart et al.,
> 1998. — Mindanao Island, Bukidnon Prov-
> ince, Mount Katanglad Range, 18.5 km S,
> 4 km E Camp Phillips, elev. 2,250 m,
> 8°9'30"N, 124°5rE, 1 male (FMNH
> 147942).
> http://biostor.org/reference/65896
> Crunomys suncoides Rickart, Heaney, Tabar-
> anza, and Balete, 1998
> The Kitanglad shrew-mouse is currently
> known only from the Kitanglad Range (Rickart
> et al., 1998), though we suspect that it is more
> widespread in mossy forest on Mindanao. The
> species was described based on a single adult
> male (FMNH 147942; 37 g) we captured in April
> 1993 in old-growth mossy forest at 2250 m (Site
> 6, Fig. 8). It had scrotal testes measuring 14 X
> 8 mm.
> http://biostor.org/reference/95679
> Crunomys suncoides, new species
> (Figs. 2, 4-9)
> HoLOTYPE — Adult male, fmnh 147942; collect-
> ed 10 April 1993 (original number 5330 of L. R.
> Heaney); initially fixed in formalin, now pre-
> served in ethyl alcohol with the skull removed
> and cleaned. The stomach and both femora have
> been removed; otherwise the specimen is in ex-
> cellent condition. It is deposited at fmnh but will
> be transferred to pnm.
> Each tells us something about the specimen (and more than GBIF does). So,
> what if we linked this information together so that GBIF users could learn
> more about that record?
> > Hi, Rod.
> >
> > 'So, if the argument is that GBIF should be looking beyond museum
> collections then I completely agree...'
> >
> > No, that's not the argument. Biodiversity data aren't like biodiversity
> books or papers, for which you can (in principle) generate a complete
> catalogue or index. Given such a catalogue or index, you can go further and
> digitise and make available on the Web all the content. Cool, yes? Anyone
> anywhere with access to the Web can view a biodiversity publication at the
> click of a mouse. This works because biodiversity publications are very
> well-defined objects which either exist or don't. BHL is hugely valuable
> and 'intrinsically' successful because the goal of digitising all
> biodiversity publications is achievable, in principle.
> >
> > GBIF is intrinsically unsuccessful because it treats occurrence records
> as very well-defined objects, which they aren't. Each record is instead an
> entry point into an investigation (minimally) of the identity of the
> organism(s) observed, of the location of the observation, of the timing of
> the observation, of the observer and of the fate of any specimen(s) or
> images which are hard evidence for the observation. I say 'minimally'
> because the museum records that wind up in GBIF often have more than these
> basics in their 'pre-GBIF' form, and are sometimes only condensed versions
> of even more information available elsewhere. You don't get that from GBIF.
> >
> > Records aren't open-ended, but some users will go much further with them
> than other users. GBIF best suits users who accept the data as-is and can
> find trivial purposes for which those untested, sparse data are 'fit'.
> >
> > The argument that GBIF in fact suits everyone - because it lets everyone
> know where to find out more - fails because GBIF is a lousy index. It
> contains lots of errors, it's taxonomically, geographically, ecologically
> and 'literature-wise' grossly incomplete, and for many biodiversity studies
> (see Meier and Dikow) you're better off starting with your own plan of
> attack and chasing sources independently.
> >
> > It would be possible to rebuild GBIF from scratch as the thing its title
> suggests (an information facility), namely a 'meta' resource that points to
> and introduces data sources, but I don't think that's going to happen,
> because it's too hard. GBIF has taken an easier approach and has been
> accumulating records as though they were coins, and measuring its
> usefulness by counting its 'wealth' of records, so that if it has twice as
> many records it must be twice as useful, right? Other people in this thread
> have pointed out how raw counts are meaningless for assessing usefulness.
> Here I just wanted to say that what works for BHL doesn't work for GBIF,
> because the items being made Web-available are inherently different.
