[Taxacom] A new way to view taxonomic publications

Donat Agosti agosti at amnh.org
Fri Jun 21 23:59:26 CDT 2013

Hi Rod

There is not One World - there are different uses. For BHL, I am happy, that they scan everything (and unhappy that the pace is slowing down), and the only bar I would impose is to get a scan standard that is widely used on in the digitization world. All the rest can be done later.

The same for extracting names: Get whatever you can, make it open and assume that somebody will pick it up and does things to it, like you do for BHL what BHL doesn't do by themselves.

You have a use case that requires certain specs (such as digital content from BHL)

We have a different one that requires OCR at a different accuracy rate.

I do not say, do one or the other. I say, make products that can be used, give anybody a chance to develop their own ideas to deal with this new digital resource - and in a sense more importantly, realize the potential of the digital resources for scientific study, and feed back in what shape they have to be to be useful.

I don't believe that the way to progress is one - but  rather in the contrary, there are many, and by no means we should compete in a sense that we say this is better than the other. I admire big data, but I also admire being able to read Biologia Centrali Americana with all the links to external resources, and be able to ask questions like where are the red things living.

Finally, my conclusion is not, OCR all to a certain accuracy rate, but drive a strategy to avoid doing this by providing content in a form, others can do things to it, like you did with Zookeys in your demo that started this thread. My strategy, if you want to call this, is to be guided by ongoing research (ea revision or monograph, a biogeographic ananlysis) that will also defined what content you need to refine to what level to be used and brought into the digital space. That also includes tools to find the content in the first hand at every more efficient ways, sometimes with no, sometimes with a bit of effort that is dwarfed in what is gained through this new access.


-----Original Message-----
From: Roderic Page [mailto:r.page at bio.gla.ac.uk] 
Sent: Saturday, June 22, 2013 8:28 AM
To: Donat Agosti
Cc: David.King; <taxacom at mailman.nhm.ku.edu>
Subject: Re: [Taxacom] A new way to view taxonomic publications

Hi Donat,

Sent from my iPhone

On 22 Jun 2013, at 03:29, Donat Agosti <agosti at amnh.org> wrote:

> For my purpose I want to have a OCR accuracy rate between 99.9 and 99.99%

So this is the crux of the problem. You set a very high bar that BHL will struggle to meet in a lot of cases. This then sets limits on what you can achieve.

An alternative is to accept that things will be messier than that, and set your expectations appropriately. Plus we can think about ways to cope with messy text. It strikes me that there is a misplaced obsession with  "clean" data that gets in the way of making progress. You want the world to be one way, but it's the other way.



More information about the Taxacom mailing list