David Remsen dremsen at MBL.EDU
Tue Jan 4 15:29:20 CST 2005

I wonder if it might be useful when the NZ has gone through it's
editorial passes is to atomize it into it's constituent terms
specifically for use as a taxonomically-informed dictionary for future
OCR and other text-processing exercises.  It would include all these
crazy publication and author abbreviations that are so difficult to
review by eye.   It might be useful to have such a library.  One of the
big drawbacks to proofing the NZ OCR was that very few terms were
actual words but instead were names or abbreviations.

Also, I hope I didn't sound like I disparage OCR. Although we found
that OCR was insufficient for NZ, the same company employed it
successfully for a conversion of a Smithsonian Bulletin, Catalog of
Living Whales.  We built an application that demonstrates the
conversion and subsequent utilization of the conversion within a name
service.  We wanted to determine that we could repurpose elements of a
taxonomic catalog within a more generalized name service without
affecting the precision of the original document.  Thus, the
application shows three versions of the same document, the original
page images, a parsed stand-alone conversion, and a hybrid version
drawing names from the service and coupled with local data. ( ).

It would be really useful to accumulate tools to enhance the means for
future efforts.  I'd love a dictionary that would find the misspelled
names and suggest corrections.  A google search on Loligo pealeii
reveals more misspelled forms ("Loligo pealei"  (1820), "Loligo pealii"
  (403)) than the correct form (896).
The misspellings, furthermore, are often from scientific sources.

> That's a good one.  Another thing I've done is make extensive use of
> the
> "Add" button when spell-checking a scanned document full of scientific
> names, jargon, abbreviations, etc.; so the real mis-transcriptions are
> more
> likely to stand out.
> Aloha,
> Rich
David Remsen
uBio Project Developer
Marine Biological Laboratory
Woods Hole, MA 02543

More information about the Taxacom mailing list