[Taxacom] OCR and A new way to view taxonomic publications
bayshark at bigfoot.com
Sat Jun 22 20:47:07 CDT 2013
The good OCR is still a big problem even though the Omnipage software is
getting better and better because it is based on inbuilt dictionaries.
I remember the times when PC / computer exhibitions were big events, when
not only large companies such as Microsoft, Adobe, Wordstar and Worperfect
WERE PRESENT with large stands, but even small software groups were
represented at every exhibition.
I have carried with me a book - A Winkler: Catalogus Coleopteroprum
Regionis Palearcticae and visited OCR software stands, where OCR companies
show how accurate, magnificent and perfect their software is.
So I have asked to try OCR on my book, just any page (what is just a list of
the names for example > 1148 tenuicollis Rossi 90 Med.Ca.) and always I
have discovered it was a big bummer because the accuracy of the results was
only around 65%.
From: taxacom-bounces at mailman.nhm.ku.edu
[mailto:taxacom-bounces at mailman.nhm.ku.edu] On Behalf Of Roderic Page
Sent: Saturday, 22 June 2013 3:48 PM
To: Donat Agosti
Cc: <taxacom at mailman.nhm.ku.edu>; David.King
Subject: Re: [Taxacom] A new way to view taxonomic publications
My sense was that you were saying the accuracy of OCR is too low, therefore
you can't do what you want to do. But there may be various ways to get to
your goal. For example, one way to extract articles from BHL is to attempt
to get that information from the OCR text itself, this has met with limited
success. The approach I've taken is to match external bibliographic data to
BHL bibliographic data and OCR text, and locate articles that way. This
works pretty well. So we get to the goal (article level resolution) despite
variable OCR quality.
Likewise BioNames uses a bunch of strategies (bibliographic matching,
external identifiers, existing indexes to taxonomic literature, searching
BHL OCR, etc.) to build a mapping between names and the literature. The
result is (I believe) the largest available mapping between animal names and
the digital literature.
My point is that if you set a threshold below which OCR quality is too low
you may miss out on important data, data which you could recover if you
tackle the problem using a variety of tools. There are lots of thing we
could do to tackle the kind of quality issues we face and again we've barely
begun to explore the tools available.
Sent from my iPhone
On 22 Jun 2013, at 05:59, Donat Agosti <agosti at amnh.org> wrote:
> Hi Rod
> There is not One World - there are different uses. For BHL, I am happy,
that they scan everything (and unhappy that the pace is slowing down), and
the only bar I would impose is to get a scan standard that is widely used on
in the digitization world. All the rest can be done later.
> The same for extracting names: Get whatever you can, make it open and
assume that somebody will pick it up and does things to it, like you do for
BHL what BHL doesn't do by themselves.
> You have a use case that requires certain specs (such as digital content
> We have a different one that requires OCR at a different accuracy rate.
> I do not say, do one or the other. I say, make products that can be used,
give anybody a chance to develop their own ideas to deal with this new
digital resource - and in a sense more importantly, realize the potential of
the digital resources for scientific study, and feed back in what shape they
have to be to be useful.
> I don't believe that the way to progress is one - but rather in the
contrary, there are many, and by no means we should compete in a sense that
we say this is better than the other. I admire big data, but I also admire
being able to read Biologia Centrali Americana with all the links to
external resources, and be able to ask questions like where are the red
> Finally, my conclusion is not, OCR all to a certain accuracy rate, but
drive a strategy to avoid doing this by providing content in a form, others
can do things to it, like you did with Zookeys in your demo that started
this thread. My strategy, if you want to call this, is to be guided by
ongoing research (ea revision or monograph, a biogeographic ananlysis) that
will also defined what content you need to refine to what level to be used
and brought into the digital space. That also includes tools to find the
content in the first hand at every more efficient ways, sometimes with no,
sometimes with a bit of effort that is dwarfed in what is gained through
this new access.
> -----Original Message-----
> From: Roderic Page [mailto:r.page at bio.gla.ac.uk]
> Sent: Saturday, June 22, 2013 8:28 AM
> To: Donat Agosti
> Cc: David.King; <taxacom at mailman.nhm.ku.edu>
> Subject: Re: [Taxacom] A new way to view taxonomic publications
> Hi Donat,
> Sent from my iPhone
> On 22 Jun 2013, at 03:29, Donat Agosti <agosti at amnh.org> wrote:
>> For my purpose I want to have a OCR accuracy rate between 99.9 and 99.99%
> So this is the crux of the problem. You set a very high bar that BHL will
struggle to meet in a lot of cases. This then sets limits on what you can
> An alternative is to accept that things will be messier than that, and set
your expectations appropriately. Plus we can think about ways to cope with
messy text. It strikes me that there is a misplaced obsession with "clean"
data that gets in the way of making progress. You want the world to be one
way, but it's the other way.
Taxacom Mailing List
Taxacom at mailman.nhm.ku.edu
The Taxacom Archive back to 1992 may be searched with either of these
(1) by visiting http://taxacom.markmail.org
(2) a Google search specified as: site:mailman.nhm.ku.edu/pipermail/taxacom
your search terms here
Celebrating 26 years of Taxacom in 2013.
More information about the Taxacom