[Taxacom] iDigBio Augmenting OCR October Workshop, February Hackathon Invitation

Deb Paul dpaul at fsu.edu
Wed Aug 22 14:49:51 CDT 2012

iDigBio Augmented OCR Best Practices Workshop and Hack-a-thon Planning

iDigBio (https://www.idigbio.org/) is running a workshop (October 1-2, 2012) and hack-a-thon (February 2013) to identify best practices and develop tools to get information from museum labels into computer databases.

We are seeking individuals to participate in the "iDigBio Augmenting OCR" workshop on October 1-2. The objective of the workshop is to improve OCR output and subsequent manipulation by algorithms to extract the content of biological collection specimen labels and notes and have them efficiently and accurately inserted into a database for future use.  Participants in the October workshop plan to narrow the hack-a-thon focus down to specific programmatic goals for software developers working at a hackathon to be held in February of 2013.

Most broadly there can be four main steps to digitization: create an image, process the image to text using Optical Character Recognition (OCR) and/or human typists, break the content of the text into semantically useful fields such as family, scientific name, collector, date collected, location, habitat, growth habit and other fields and finally format this information for injection into a database. The participants will help to identify and collect images that are representative of those that will be needed by the biology community. This collection of images will serve as the working set for developers in the February Hack-a-thon.

The October workshop participants plan to identify OCR output products that will be useful for the community as well as metrics that help evaluate how well different automation approaches produce these products. This may include measures of accuracy of the OCR but also accuracy of automated error correction, effectiveness of breaking text into meaningful semantic units such as precision, recall and F-Score. We seek biologists, programmers and others involved in the digitization process to participate in this October workshop to plan the February hack-a-thon and participate in the hackathon itself.

Anyone can view our wish list at
of some possible goals we have for optimizing machine and natural language processing algorithms used on OCR output from specimen labels.

If interested in participating and you would like to know more please email asap to:
Debbie Paul,dpaul at fsu.edu
Deadline Thursday, August 30th to participate in the Oct 1 - 2 workshop.

Looking forward to your participation,
 From all of us in the iDigBio Augmenting OCR Working Group
Please forward to other interested listserves - thanks!

Deborah Paul
User Services, iDigBio
Institute for Digital Information, iDigInfo
Florida State University
Tallahassee, Florida 32308

More information about the Taxacom mailing list