[Taxacom] OCR and A new way to view taxonomic publications

Tony.Rees at csiro.au Tony.Rees at csiro.au
Sat Jun 22 23:10:41 CDT 2013


Hi Vratislav,

Have you tried the default OCR which comes with Adobe acrobat? It is supposed to be slightly inferior to the best of Omnipage etc. but I have found it quite good for simply formatted text (depending of course on the quality of the original as well). Might be interesting to know how it performs with a standard corpus such as yours.

Regards - Tony Rees




> -----Original Message-----
> From: taxacom-bounces at mailman.nhm.ku.edu [mailto:taxacom-
> bounces at mailman.nhm.ku.edu] On Behalf Of Ricardo
> Sent: Sunday, 23 June 2013 11:47 AM
> To: taxacom at mailman.nhm.ku.edu
> Subject: Re: [Taxacom] OCR and A new way to view taxonomic publications
> 
> Hi,
> 
> The good OCR is still a big problem even though the Omnipage software
> is
> getting better and better  because it is based on inbuilt dictionaries.
> 
> 
> 
> I remember the times when  PC / computer exhibitions were big events,
> when
> not only large companies such as Microsoft, Adobe, Wordstar and
> Worperfect
> WERE PRESENT with large stands, but even small software groups were
> represented at every exhibition.
> 
> 
> 
> I have carried with me a book - A Winkler:  Catalogus Coleopteroprum
> Regionis Palearcticae  and visited OCR software stands, where OCR
> companies
> show how accurate, magnificent and perfect their software is.
> 
> So I have asked to try OCR on my book, just any page (what is just a
> list of
> the names for example > 1148 tenuicollis Rossi 90    Med.Ca.) and
> always I
> have discovered it was a big bummer because the accuracy of the results
> was
> only around 65%.
> 
> 
> 
> Regards
> 
> Vratislav
> 
> www.coleoptera.org <http://www.coleoptera.org/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: taxacom-bounces at mailman.nhm.ku.edu
> [mailto:taxacom-bounces at mailman.nhm.ku.edu] On Behalf Of Roderic Page
> Sent: Saturday, 22 June 2013 3:48 PM
> To: Donat Agosti
> Cc: <taxacom at mailman.nhm.ku.edu>; David.King
> Subject: Re: [Taxacom] A new way to view taxonomic publications
> 
> 
> 
> Hi Donat,
> 
> 
> 
> My sense was that you were saying the accuracy of OCR is too low,
> therefore
> you can't do what you want to do. But there may be various ways to get
> to
> your goal. For example, one way to extract articles from BHL is to
> attempt
> to get that information from the OCR text itself, this has met with
> limited
> success. The approach I've taken is to match external bibliographic
> data to
> BHL bibliographic data and OCR text, and locate articles that way. This
> works pretty well. So we get to the goal (article level resolution)
> despite
> variable OCR quality.
> 
> 
> 
> Likewise BioNames uses a bunch of strategies (bibliographic matching,
> external identifiers, existing indexes to taxonomic literature,
> searching
> BHL OCR, etc.) to build a mapping between names and the literature. The
> result is (I believe) the largest available mapping between animal
> names and
> the digital literature.
> 
> 
> 
> My point is that if you set a threshold below which OCR quality is too
> low
> you may miss out on important data, data which you could recover if you
> tackle the problem using a variety of tools. There are lots of thing we
> could do to tackle the kind of quality issues we face and again we've
> barely
> begun to explore the tools available.
> 
> 
> 
> Regards
> 
> 
> 
> Rod
> 
> 
> 
> Sent from my iPhone
> 
> 
> 
> On 22 Jun 2013, at 05:59, Donat Agosti <agosti at amnh.org> wrote:
> 
> 
> 
> > Hi Rod
> 
> >
> 
> > There is not One World - there are different uses. For BHL, I am
> happy,
> that they scan everything (and unhappy that the pace is slowing down),
> and
> the only bar I would impose is to get a scan standard that is widely
> used on
> in the digitization world. All the rest can be done later.
> 
> >
> 
> > The same for extracting names: Get whatever you can, make it open and
> assume that somebody will pick it up and does things to it, like you do
> for
> BHL what BHL doesn't do by themselves.
> 
> >
> 
> > You have a use case that requires certain specs (such as digital
> content
> from BHL)
> 
> >
> 
> > We have a different one that requires OCR at a different accuracy
> rate.
> 
> >
> 
> > I do not say, do one or the other. I say, make products that can be
> used,
> give anybody a chance to develop their own ideas to deal with this new
> digital resource - and in a sense more importantly, realize the
> potential of
> the digital resources for scientific study, and feed back in what shape
> they
> have to be to be useful.
> 
> >
> 
> > I don't believe that the way to progress is one - but  rather in the
> contrary, there are many, and by no means we should compete in a sense
> that
> we say this is better than the other. I admire big data, but I also
> admire
> being able to read Biologia Centrali Americana with all the links to
> external resources, and be able to ask questions like where are the red
> things living.
> 
> >
> 
> > Finally, my conclusion is not, OCR all to a certain accuracy rate,
> but
> drive a strategy to avoid doing this by providing content in a form,
> others
> can do things to it, like you did with Zookeys in your demo that
> started
> this thread. My strategy, if you want to call this, is to be guided by
> ongoing research (ea revision or monograph, a biogeographic ananlysis)
> that
> will also defined what content you need to refine to what level to be
> used
> and brought into the digital space. That also includes tools to find
> the
> content in the first hand at every more efficient ways, sometimes with
> no,
> sometimes with a bit of effort that is dwarfed in what is gained
> through
> this new access.
> 
> >
> 
> >
> 
> > Donat
> 
> >
> 
> > -----Original Message-----
> 
> > From: Roderic Page [mailto:r.page at bio.gla.ac.uk]
> 
> > Sent: Saturday, June 22, 2013 8:28 AM
> 
> > To: Donat Agosti
> 
> > Cc: David.King; <taxacom at mailman.nhm.ku.edu>
> 
> > Subject: Re: [Taxacom] A new way to view taxonomic publications
> 
> >
> 
> > Hi Donat,
> 
> >
> 
> > Sent from my iPhone
> 
> >
> 
> > On 22 Jun 2013, at 03:29, Donat Agosti <agosti at amnh.org> wrote:
> 
> >
> 
> >> For my purpose I want to have a OCR accuracy rate between 99.9 and
> 99.99%
> 
> >
> 
> > So this is the crux of the problem. You set a very high bar that BHL
> will
> struggle to meet in a lot of cases. This then sets limits on what you
> can
> achieve.
> 
> >
> 
> > An alternative is to accept that things will be messier than that,
> and set
> your expectations appropriately. Plus we can think about ways to cope
> with
> messy text. It strikes me that there is a misplaced obsession with
> "clean"
> data that gets in the way of making progress. You want the world to be
> one
> way, but it's the other way.
> 
> >
> 
> > Regards
> 
> >
> 
> > Rod
> 
> >
> 
> >
> 
> 
> 
> _______________________________________________
> 
> Taxacom Mailing List
> 
> Taxacom at mailman.nhm.ku.edu
> 
> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
> 
> 
> 
> The Taxacom Archive back to 1992 may be searched with either of these
> methods:
> 
> 
> 
> (1) by visiting http://taxacom.markmail.org
> 
> 
> 
> (2) a Google search specified as:
> site:mailman.nhm.ku.edu/pipermail/taxacom
> your search terms here
> 
> 
> 
> Celebrating 26 years of Taxacom in 2013.
> 
> _______________________________________________
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
> 
> The Taxacom Archive back to 1992 may be searched with either of these
> methods:
> 
> (1) by visiting http://taxacom.markmail.org
> 
> (2) a Google search specified as:
> site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here
> 
> Celebrating 26 years of Taxacom in 2013.




More information about the Taxacom mailing list