Hi Dauvit

There is some contradiction: " In general OCR is very good " and " Sadly OCR does struggle in two very useful sections of a document " Our experience we are documenting right now in Pro-iBiosphere are pointing to the latter.

What is the OCR accuracy rate you discovered? Do you find 99%, 99:99%?
What is the error rate to get two columns text properly resolved?
What is the error rate for tables?
What is the error rate for microcitations?
What is the  error rate for scientific names or morphological terms?
How does this vary over time?
Rob is right in pointing out, we need stats.

For my purpose I want to have a OCR accuracy rate between 99.9 and 99.99%, that is 1 error per page approximately, and the tables properly resolved and other structural issues not interfering with text flow. This means for us to spend about 2 min per page to clean OCR artifacts (http://wiki.tdwg.org/twiki/pub/.../Agosti_Tdwg_literature_interoperability.ppt  ). Our purpose is to reuse the text for display to humans and machine harvesting or characters, and eventually republishing some of it. This pretty much defines my goals and requirements. And I want to contribute to HNS ca>200,000 names and Zoobank to their list of "clean" names by adding "clean" names linked to either the page of occurrence of treatment. 

I don't want to deal with a large amount of (additional) artifacts of misspelled terms (see Wei et al.,  http://tinyurl.com/mgvg6bq ), because we don't care about the OCR. It doesn't mean others do have to care and have similar requests, if the results are good enough for their purpose. And we even use those results to find content in BHL we can use in our work.

Names and treatments are quasi legal documents, and thus I will go to figure out what the lawyers do with the conversion of a similar huge corpus of legacy literature, a case where which is similar to what we have: A case number pointing to a case has to be right, because you can't find the cited case otherwise, especially by machine. And that is the ultimate goal.


Hi Donat

I'm with Rod,  BHL OCR can be excellent. In ViBRANT we tried to replicate Qin Wei's work with BHL to re-assess how 'bad' the OCR is but just couldn't get the same poor quality results.

In general OCR is very good with body text where there are enough cues for the software to work on. Errors still creep in particularly with special characters, for example, English OCR software really doesn't like Latin ligatures and tends to mangle the end of taxon names even though the other characters in the name are accurately identified. Another common problem arises from using male and female symbols in text, made worse because the symbols are normally found in the middle of a very terse description, often full of abbreviations, so devoid of cues for non-specialist software to follow.

Indeed, we gave up on some of our experimental ViBRANT work using parts-of-speech tagging to identify anomalies in OCR text because the OCR was not bad enough.

Sadly OCR does struggle in two very useful sections of a document: the table of contents and the index. Part of the problem lies with a page full of 'funny' words not in the software's dictionary ;-) Then there are other problems usually to do with layout such as non-aligned columns and leading lines which break the OCR accuracy.



> Donat,
> Again, this is repeating myths. The OCR in BHL ranges from excellent in places to crappy in places. If it was uniformly bad we couldn't have indexed it for names, nor would http://biostor.org be possible. The quality is variable, but we can quantify this. We don't control the original OCR, but we can always redo bits if we need to.
> Indexing OCR documents is a well known problem, and there is a wealth of literature of various techniques that can be used  (see http://www.mendeley.com/groups/752871/ocr-optical-character-recognition/ for an introduction to the literature ). Why do we simply say OMG the BHL OCR is bad? Why not be scientific , quantify its quality, and exploit the existing technology to improve things?
> I am constantly flummoxed by our community's assumption that it knows what is possible, and what the limit of the state of the art is outside its domain. We have barely scratched the surface of what is possible.
> Regards
> Rod

