[Taxacom] A new way to view taxonomic publications

Roderic Page r.page at bio.gla.ac.uk
Sat Jun 22 00:47:56 CDT 2013


Hi Donat,

My sense was that you were saying the accuracy of OCR is too low, therefore you can't do what you want to do. But there may be various ways to get to your goal. For example, one way to extract articles from BHL is to attempt to get that information from the OCR text itself, this has met with limited success. The approach I've taken is to match external bibliographic data to BHL bibliographic data and OCR text, and locate articles that way. This works pretty well. So we get to the goal (article level resolution) despite variable OCR quality.

Likewise BioNames uses a bunch of strategies (bibliographic matching, external identifiers, existing indexes to taxonomic literature, searching BHL OCR, etc.) to build a mapping between names and the literature. The result is (I believe) the largest available mapping between animal names and the digital literature.

My point is that if you set a threshold below which OCR quality is too low you may miss out on important data, data which you could recover if you tackle the problem using a variety of tools. There are lots of thing we could do to tackle the kind of quality issues we face and again we've barely begun to explore the tools available.

Regards 

Rod

Sent from my iPhone

On 22 Jun 2013, at 05:59, Donat Agosti <agosti at amnh.org> wrote:

> Hi Rod
> 
> There is not One World - there are different uses. For BHL, I am happy, that they scan everything (and unhappy that the pace is slowing down), and the only bar I would impose is to get a scan standard that is widely used on in the digitization world. All the rest can be done later.
> 
> The same for extracting names: Get whatever you can, make it open and assume that somebody will pick it up and does things to it, like you do for BHL what BHL doesn't do by themselves.
> 
> You have a use case that requires certain specs (such as digital content from BHL)
> 
> We have a different one that requires OCR at a different accuracy rate.
> 
> I do not say, do one or the other. I say, make products that can be used, give anybody a chance to develop their own ideas to deal with this new digital resource - and in a sense more importantly, realize the potential of the digital resources for scientific study, and feed back in what shape they have to be to be useful.
> 
> I don't believe that the way to progress is one - but  rather in the contrary, there are many, and by no means we should compete in a sense that we say this is better than the other. I admire big data, but I also admire being able to read Biologia Centrali Americana with all the links to external resources, and be able to ask questions like where are the red things living.
> 
> Finally, my conclusion is not, OCR all to a certain accuracy rate, but drive a strategy to avoid doing this by providing content in a form, others can do things to it, like you did with Zookeys in your demo that started this thread. My strategy, if you want to call this, is to be guided by ongoing research (ea revision or monograph, a biogeographic ananlysis) that will also defined what content you need to refine to what level to be used and brought into the digital space. That also includes tools to find the content in the first hand at every more efficient ways, sometimes with no, sometimes with a bit of effort that is dwarfed in what is gained through this new access.
> 
> 
> Donat
> 
> -----Original Message-----
> From: Roderic Page [mailto:r.page at bio.gla.ac.uk] 
> Sent: Saturday, June 22, 2013 8:28 AM
> To: Donat Agosti
> Cc: David.King; <taxacom at mailman.nhm.ku.edu>
> Subject: Re: [Taxacom] A new way to view taxonomic publications
> 
> Hi Donat,
> 
> Sent from my iPhone
> 
> On 22 Jun 2013, at 03:29, Donat Agosti <agosti at amnh.org> wrote:
> 
>> For my purpose I want to have a OCR accuracy rate between 99.9 and 99.99%
> 
> So this is the crux of the problem. You set a very high bar that BHL will struggle to meet in a lot of cases. This then sets limits on what you can achieve.
> 
> An alternative is to accept that things will be messier than that, and set your expectations appropriately. Plus we can think about ways to cope with messy text. It strikes me that there is a misplaced obsession with  "clean" data that gets in the way of making progress. You want the world to be one way, but it's the other way.
> 
> Regards
> 
> Rod
> 
> 




More information about the Taxacom mailing list