[Taxacom] A new way to view taxonomic publications

Richard Pyle deepreef at bishopmuseum.org
Fri Jun 21 23:15:22 CDT 2013

This is why the model of "clean bucket" and "dirty bucket" makes the most

Rod (through Bionames, etc.), as well as other efforts such as GNI (for
taxon names) and RefBank (for literature citations) and --to some extent --
GBIF (for specimen & occurrence data) are focused on the "dirty bucket"
side.  In this context, "dirty" is not derogatory -- it's an accurate
characterization.  The advantage of the "dirty bucket" approach is to
harness the power of very large & comprehensive datasets -- which gets you
mostly the answer you need in most of the cases -- it just requires some
on-the-fly cleanup/filtration by the end user, and the associated caveat
emptor disclaimers.  You can do some very powerful things with dirty
buckets, through generating very large datasets relatively quickly and
easily, and getting some first-approximation results via algorithms.

Donat (through Plazi), as well as other efforts such as GNUB, ORCID, IPNI,
Index Fungorum, Tropicos, various Nomenclators (Catalog of Fishes, Diptera
database, hymenoptera name server, etc.), and various others are focused on
"clean buckets". These represent much more limited volumes of data, but much
higher quality of information.  They're rarely free of error, but they
strive towards correcting those errors and trending the datasets to ever
"cleaner" states.  It's much, much harder to generate a large "clean bucket"
dataset, but you can do MUCH more powerful things with it once it's

For the past several years, it has become increasingly clear to me that the
pathway we should all be focused on is the pathway that includes both the
dirty buckets and the clean buckets, and the services that bridge the two.

If you watch the segment from 05:29-09:45 on this video:
http://www.youtube.com/watch?v=PSzL2NwRemU -- you can see a bit of what I'm
on about here.  Though this emphasizes taxon names, the same principle
applies to authors, literature citations, specimen data, and others.

We're wasting our time arguing about apples and oranges (dirty bucket data
and clean bucket data).  We should be focused on how we can use each to
empower the other.


> -----Original Message-----
> From: taxacom-bounces at mailman.nhm.ku.edu [mailto:taxacom-
> bounces at mailman.nhm.ku.edu] On Behalf Of Roderic Page
> Sent: Friday, June 21, 2013 5:58 PM
> To: Donat Agosti
> Cc: <taxacom at mailman.nhm.ku.edu>; David.King
> Subject: Re: [Taxacom] A new way to view taxonomic publications
> Hi Donat,
> Sent from my iPhone
> On 22 Jun 2013, at 03:29, Donat Agosti <agosti at amnh.org> wrote:
> > For my purpose I want to have a OCR accuracy rate between 99.9 and
> 99.99%
> So this is the crux of the problem. You set a very high bar that BHL will
> struggle to meet in a lot of cases. This then sets limits on what you can
> achieve.
> An alternative is to accept that things will be messier than that, and set
> expectations appropriately. Plus we can think about ways to cope with
> text. It strikes me that there is a misplaced obsession with  "clean" data
> gets in the way of making progress. You want the world to be one way, but
> it's the other way.
> Regards
> Rod
> _______________________________________________
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
> The Taxacom Archive back to 1992 may be searched with either of these
> methods:
> (1) by visiting http://taxacom.markmail.org
> (2) a Google search specified as:
> site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here
> Celebrating 26 years of Taxacom in 2013.

More information about the Taxacom mailing list