Richard Pyle deepreef at BISHOPMUSEUM.ORG
Mon Jan 3 10:04:37 CST 2005

Wow!  What a wonderful way to start the new year -- several of my favorite
topics of discussion (databases, online access to taxonomic information,
converting the content of published manuscripts into structured electronic
databases, taxonomic registration...) wrapped up into a single Taxacom
thread!  If we can only find a way to integrate the discussion of advanced
undersea diving technology, I will have achieved email-list Nirvana!

First, I want to wholeheartedly join the chorus of "thank you's" and
"congratulations" to Dave Remsen & the uBio team for pulling this off.  In
answer to Martin Spies' question, "why put an unfinished product on the
'market', thereby increasing the chances for confusion?"; I agree with what
Paul Kirk said (i.e., that these sorts of prjects are *perpetually*
"unfinished").  Further, I would point out that projects of this magnitude
CANNOT be realistically managed by a single group or entity.  At some point,
the only way to achieve accuracy is to expose it to a virtual army of
experts who can scrutinize it to the point where it approaches perfection.
That is -- to distribute the workload. And the internet is the perfect
medium for that sort of distributed collaboration.

Having said that, I fully agree with Martin's (and Wolfgang's) fundamental

> Prominent (!) warnings signalling preliminary data that must not be
> taken for gospel seem the bare minimum one can ask for (and Wolfgang
> Lorenz and I haven't asked for anything more).

In this case, there needs to be a double-disclaimer:  one underscoring the
potential for error within NZ, and the other underscoring the potential for
error in transcription from printed page to electronic version.

As for eliminating the need for the second of these, I think the best way to
achieve this would be to develop a dead-simple web-based user-interface that
allows an online reviewer to easily go page-by-page, line-by-line, and see
the electronic version directly next to the original scanned version
(ideally with the electronic version formatted with the same font, etc. as
the printed version, so discrepancies are more obvious); and a simple way
for the reviewer to make the necessary adjustments right there on the web
page, then click a "Submit" button to go to the next name.  Dave -- I've got
a number of ideas & suggestions, if you're interested (although I suspect
that you're already WAY ahead of me on this).

As for automated editing, one technique I have been using a lot lately to
ensure accuracy when converting word-processed documents to databases, is to
develop an output report style from the database that matches exactly that
of the source.  I then use MS Word's "Compare documents" feature to
highlight any discrepancies between the original and the database output --
until the output is a perfect match for the original.

I realize this specific technique will not be helpful in the case of NZ (due
to the way it was converted from the image files into text files), but I
wonder if there isn't an analgous method that could be applied to image
files.  The idea would be to output the data from the database in such a way
that it produces a formatted document that matches as closely as possible
the original NZ font, kerning, spacing, etc.; then create image files from
the database output for each page.  These page images could then be compared
with the original scans using some image comparison algorithm to highlight

I suspect that the amount of effort to develop the sort of comparison
feature just described would probably approach or exceed the effort to
simply correct it manually, 100 pages at a time -- so probably a moot point
for this project.  However, if (as I hope) this same technique will
ultimately be applied to a much larger body of scientific literature
(witness the Stanford Library), the development of such automated proofing
tools might serve the greater good in the long run.

On a final note (for now), I was hoping to see a commentary from Doug Yanega
on this thread emphasizing once again how we (as a taxonomic community) can
ALL gain from convergence upon some sort Code-enforced of name-registration
system.  I was glad to see that Paul Kirk touched on it; and the various
GBIF-related gatherings that are happening this year *might* just push us a
bit closer to that goal. One can only hope (and continue to rant on email


P.S. Dave -- out of curiosity, have you tried Adobe's Acrobat Capture 3.0
software? It's their industrial-strength OCR product, which has impressive
tools for collaborative scanning/OCRing/proofing of printed documents. Was
this among the tools you tried when determining that in-house OCRing was not

Richard L. Pyle, PhD
Natural Sciences Database Coordinator, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at

> -----Original Message-----
> From: Taxacom Discussion List [mailto:TAXACOM at LISTSERV.NHM.KU.EDU]On
> Behalf Of David Remsen
> Sent: Monday, January 03, 2005 4:51 AM
> Subject: Re: NomenclatorZoologicus
> Dr. Lorenz/ Dr. Spies
> Thank you for your points and observations regarding the conversion of
> Nomenclator Zoologicus.   With the new year we are focusing our efforts
> on continuing the revision of the conversion and an extension of the
> existing Agenda and Corrigenda sections of the NZ.  The current online
> NZ is still a work in progress.
> Our first priority is to ensure that the digitized conversion
> represents an accurate transcription of the existing volumes 1-9 of NZ.
>    This is the intent of locating reviewers.   This round requires less
> taxonomic expertise than simply a willingness to endure 100 pages of
> proofing.  The result however, is a clean community resource reviewed
> by members of the community itself.  Reviewers will also receive a copy
> of the final work on CD.   Breaking the volumes into 100 page sections
> appeared to me to be more systematic in that we would then know which
> sections have or have not been reviewed.  However, we could also
> provide the means to review taxonomic groups as well and simply place
> review attributes at the record level.  My experience in doing this
> myself is that once you are on a page it is easier to simply stay on
> that page rather than jump around.  We hope to have an editorial system
> running sometime this month.
> Once we have a clean and reviewed conversion we can focus on the
> accuracy of the printed document. Our second goal is an online
> supplement to the Agenda and Corrigenda sections where corrections to
> the correctly transcribed volumes can be added as supplements to the
> original records.  For example, Alatotrochus Cair 1994 (within the
> volumes) should actually be Alatotrochus Cairns 1994.   In this case,
> the original must remain as it factually exists within the printed
> document and the correction will be a separate and linked record.
> Our third effort will be to catalog genera not contained with NZ and to
> provide a means to include or not include these data within a search as
> a separate resource.  We have received numerous offers of additional
> genera data.
> Lastly, although I do like the look of "Dr. Remsen", I'm afraid that I
> am neither an M.D. or Ph.D. (yet) but instead remain a curious amalgam
> of biologist and informatician who has yet to free himself from all
> this to actually finish a graduate degree.
> Cheers,
> David Remsen
> On Jan 3, 2005, at 5:05 AM, Faunaplan at AOL.COM wrote:
> > Dear All,
> > the online version of NEAVE's NomenclatorZoolicus (via the uBio
> > website) is
> > certainly among the outstanding achievements of last year,
> > congratulations!
> > However, since it probably will be consulted by many more users than
> > the book
> > version, wouldn't it be helpful to point to the pitfalls and limits of
> > this
> > source?
> > In my mind, things like the following should be made explicit:
> > - the Nomenclator is by no means a complete directory of available
> > genus-group names for the time period 1758-1994 and availability of
> > names listed in the
> > Nomenclator is not guaranteed.
> > - in some cases, the attribution of names to authors & dates is not
> > Code
> > compliant. For example, there is misleading information on the
> > following 31
> > important genus-group names in Coleoptera Carabidae, which should be
> > attributed to
> > BONELLI 1810: Abax, Agonum, Alpaeus, Amara, Anchomenus, Aptinus,
> > Blethisa,
> > Calathus, Callistus, Cephalotes, Chlaenius, Demetrias, Dinodes,
> > Ditomus, Dolichus,
> > Dromius, Dyschirius, Epomis, Laemostenus, Lamprias, Melanius, Molops,
> > Oodes,
> > Pelor, Percus, Platynus, Platysma, Poecilus, Polistichus, Procrustes,
> > Pterostichus.  (for details see MADGE 1975: The type-species of
> > BONELLI's genera of
> > Carabidae (Coleoptera, Carabidae). - Quaest. Ent., 11: 579-586).
> >
> > Best wishes,
> > Wolfgang Lorenz, Tutzing, Germany
> >
> >
> _______________________________________________
> David Remsen
> uBio Project Developer
> Marine Biological Laboratory
> Woods Hole, MA 02543
> 508-289-7632

More information about the Taxacom mailing list