dremsen at MBL.EDU
Tue Jan 4 15:29:20 CST 2005
I wonder if it might be useful when the NZ has gone through it's
editorial passes is to atomize it into it's constituent terms
specifically for use as a taxonomically-informed dictionary for future
OCR and other text-processing exercises. It would include all these
crazy publication and author abbreviations that are so difficult to
review by eye. It might be useful to have such a library. One of the
big drawbacks to proofing the NZ OCR was that very few terms were
actual words but instead were names or abbreviations.
Also, I hope I didn't sound like I disparage OCR. Although we found
that OCR was insufficient for NZ, the same company employed it
successfully for a conversion of a Smithsonian Bulletin, Catalog of
Living Whales. We built an application that demonstrates the
conversion and subsequent utilization of the conversion within a name
service. We wanted to determine that we could repurpose elements of a
taxonomic catalog within a more generalized name service without
affecting the precision of the original document. Thus, the
application shows three versions of the same document, the original
page images, a parsed stand-alone conversion, and a hybrid version
drawing names from the service and coupled with local data. (
It would be really useful to accumulate tools to enhance the means for
future efforts. I'd love a dictionary that would find the misspelled
names and suggest corrections. A google search on Loligo pealeii
reveals more misspelled forms ("Loligo pealei" (1820), "Loligo pealii"
(403)) than the correct form (896).
The misspellings, furthermore, are often from scientific sources.
> That's a good one. Another thing I've done is make extensive use of
> "Add" button when spell-checking a scanned document full of scientific
> names, jargon, abbreviations, etc.; so the real mis-transcriptions are
> likely to stand out.
uBio Project Developer
Marine Biological Laboratory
Woods Hole, MA 02543
More information about the Taxacom