[Taxacom] A new way to view taxonomic publications
agosti at amnh.org
Fri Jun 21 21:29:49 CDT 2013
There is some contradiction: " In general OCR is very good " and " Sadly OCR does struggle in two very useful sections of a document " Our experience we are documenting right now in Pro-iBiosphere are pointing to the latter.
What is the OCR accuracy rate you discovered? Do you find 99%, 99:99%?
What is the error rate to get two columns text properly resolved?
What is the error rate for tables?
What is the error rate for microcitations?
What is the error rate for scientific names or morphological terms?
How does this vary over time?
Rob is right in pointing out, we need stats.
For my purpose I want to have a OCR accuracy rate between 99.9 and 99.99%, that is 1 error per page approximately, and the tables properly resolved and other structural issues not interfering with text flow. This means for us to spend about 2 min per page to clean OCR artifacts (http://wiki.tdwg.org/twiki/pub/.../Agosti_Tdwg_literature_interoperability.ppt ). Our purpose is to reuse the text for display to humans and machine harvesting or characters, and eventually republishing some of it. This pretty much defines my goals and requirements. And I want to contribute to HNS ca>200,000 names and Zoobank to their list of "clean" names by adding "clean" names linked to either the page of occurrence of treatment.
I don't want to deal with a large amount of (additional) artifacts of misspelled terms (see Wei et al., http://tinyurl.com/mgvg6bq ), because we don't care about the OCR. It doesn't mean others do have to care and have similar requests, if the results are good enough for their purpose. And we even use those results to find content in BHL we can use in our work.
Names and treatments are quasi legal documents, and thus I will go to figure out what the lawyers do with the conversion of a similar huge corpus of legacy literature, a case where which is similar to what we have: A case number pointing to a case has to be right, because you can't find the cited case otherwise, especially by machine. And that is the ultimate goal.
From: taxacom-bounces at mailman.nhm.ku.edu [mailto:taxacom-bounces at mailman.nhm.ku.edu] On Behalf Of David.King
Sent: Friday, June 21, 2013 8:20 PM
To: taxacom at mailman.nhm.ku.edu
Subject: Re: [Taxacom] A new way to view taxonomic publications
I'm with Rod, BHL OCR can be excellent. In ViBRANT we tried to replicate Qin Wei's work with BHL to re-assess how 'bad' the OCR is but just couldn't get the same poor quality results.
In general OCR is very good with body text where there are enough cues for the software to work on. Errors still creep in particularly with special characters, for example, English OCR software really doesn't like Latin ligatures and tends to mangle the end of taxon names even though the other characters in the name are accurately identified. Another common problem arises from using male and female symbols in text, made worse because the symbols are normally found in the middle of a very terse description, often full of abbreviations, so devoid of cues for non-specialist software to follow.
Indeed, we gave up on some of our experimental ViBRANT work using parts-of-speech tagging to identify anomalies in OCR text because the OCR was not bad enough.
Sadly OCR does struggle in two very useful sections of a document: the table of contents and the index. Part of the problem lies with a page full of 'funny' words not in the software's dictionary ;-) Then there are other problems usually to do with layout such as non-aligned columns and leading lines which break the OCR accuracy.
> Date: Thu, 20 Jun 2013 14:21:48 +0100
> From: Roderic Page <r.page at bio.gla.ac.uk>
> Subject: Re: [Taxacom] A new way to view taxonomic publications
> To: Donat Agosti <agosti at amnh.org>
> Cc: taxacom taxacom <taxacom at mailman.nhm.ku.edu>
> Message-ID: <3D62E585-50CF-451A-BD62-5CCAE0D779B6 at bio.gla.ac.uk>
> Content-Type: text/plain; charset=windows-1252
> Again, this is repeating myths. The OCR in BHL ranges from excellent in places to crappy in places. If it was uniformly bad we couldn't have indexed it for names, nor would http://biostor.org be possible. The quality is variable, but we can quantify this. We don't control the original OCR, but we can always redo bits if we need to.
> Indexing OCR documents is a well known problem, and there is a wealth of literature of various techniques that can be used (see http://www.mendeley.com/groups/752871/ocr-optical-character-recognition/ for an introduction to the literature ). Why do we simply say OMG the BHL OCR is bad? Why not be scientific , quantify its quality, and exploit the existing technology to improve things?
> I am constantly flummoxed by our community's assumption that it knows what is possible, and what the limit of the state of the art is outside its domain. We have barely scratched the surface of what is possible.
Professor of Taxonomy
Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page at bio.gla.ac.uk
Tel: +44 141 330 4778
Fax: +44 141 330 2792
Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
ORCID id: http://orcid.org/0000-0002-7101-9767
The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302).
Taxacom Mailing List
Taxacom at mailman.nhm.ku.edu
The Taxacom Archive back to 1992 may be searched with either of these methods:
(1) by visiting http://taxacom.markmail.org
(2) a Google search specified as: site:mailman.nhm.ku.edu/pipermail/taxacom your search terms here
Celebrating 26 years of Taxacom in 2013.
More information about the Taxacom