[Taxacom] A new way to view taxonomic publications
David.King at open.ac.uk
Fri Jun 21 10:49:57 CDT 2013
I'm with Rod, BHL OCR can be excellent. In ViBRANT we tried to replicate Qin Wei's work with BHL to re-assess how 'bad' the OCR is but just couldn't get the same poor quality results.
In general OCR is very good with body text where there are enough cues for the software to work on. Errors still creep in particularly with special characters, for example, English OCR software really doesn't like Latin ligatures and tends to mangle the end of taxon names even though the other characters in the name are accurately identified. Another common problem arises from using male and female symbols in text, made worse because the symbols are normally found in the middle of a very terse description, often full of abbreviations, so devoid of cues for non-specialist software to follow.
Indeed, we gave up on some of our experimental ViBRANT work using parts-of-speech tagging to identify anomalies in OCR text because the OCR was not bad enough.
Sadly OCR does struggle in two very useful sections of a document: the table of contents and the index. Part of the problem lies with a page full of 'funny' words not in the software's dictionary ;-) Then there are other problems usually to do with layout such as non-aligned columns and leading lines which break the OCR accuracy.
> Date: Thu, 20 Jun 2013 14:21:48 +0100
> From: Roderic Page <r.page at bio.gla.ac.uk>
> Subject: Re: [Taxacom] A new way to view taxonomic publications
> To: Donat Agosti <agosti at amnh.org>
> Cc: taxacom taxacom <taxacom at mailman.nhm.ku.edu>
> Message-ID: <3D62E585-50CF-451A-BD62-5CCAE0D779B6 at bio.gla.ac.uk>
> Content-Type: text/plain; charset=windows-1252
> Again, this is repeating myths. The OCR in BHL ranges from excellent in places to crappy in places. If it was uniformly bad we couldn't have indexed it for names, nor would http://biostor.org be possible. The quality is variable, but we can quantify this. We don't control the original OCR, but we can always redo bits if we need to.
> Indexing OCR documents is a well known problem, and there is a wealth of literature of various techniques that can be used (see http://www.mendeley.com/groups/752871/ocr-optical-character-recognition/ for an introduction to the literature ). Why do we simply say OMG the BHL OCR is bad? Why not be scientific , quantify its quality, and exploit the existing technology to improve things?
> I am constantly flummoxed by our community's assumption that it knows what is possible, and what the limit of the state of the art is outside its domain. We have barely scratched the surface of what is possible.
Professor of Taxonomy
Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page at bio.gla.ac.uk
Tel: +44 141 330 4778
Fax: +44 141 330 2792
Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
ORCID id: http://orcid.org/0000-0002-7101-9767
The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302).
More information about the Taxacom