[Taxacom] BHL survey: scan quality

Dean Pentcheff pentcheff at gmail.com
Fri May 7 15:13:26 CDT 2010

I will confirm what Karl Magnacca had to say.

The scanning strategy for most materials at BHL (via Internet Archive,
of course) seems to be to make the visual appearance of the pages as
close to the original as possible, including yellowed paper, etc. This
comes at the expense (since there are always tradeoffs) of highly
resolved text. When it comes to plates, the results are visually
appealing, but often of poor actual resolution.

The PDFs are generated using "Luradocument", which achieves excellent
compression of those images, but with the cost of long rendering times
for the pages (again, as mentioned by Karl Magnacca). The result is
that the PDFs can be very cumbersome to use on anything but the very
fastest desktop computers.

I would recommend instead a scan-processing strategy that prioritizes
readability (including OCRability) of text and high resolution on
figures. With the rare exception of color plates, retaining color at
all is useless for the reader and should be avoided.

In many cases, what we do at the Decapoda literature website
(http://decapoda.nhm.org/references) is take the PDF from BHL, use
adaptive thresholding to make the text and line art binary, reassemble
the document, OCR it again, and make that alternative version
available at our site. The size is often much smaller (since the text
portions are now binary B&W), but sometimes much bigger (if there are
many plates that we have to keep in greyscale).

Our strategy when doing our own scanning is to use 400dpi (for the
non-U.S. world, that's dots per inch [sorry]) binary (black & white
only) for text and line art and 600dpi greyscale (or color when
needed) for plates. That gives crisp readable/OCRable text with small
file sizes, combined with good resolution on figures (though of course
many figures make for a big file).

Dean Pentcheff
pentcheff at gmail.com
dpentch at nhm.org

On Fri, May 7, 2010 at 10:27 AM, Francisco Welter-Schultes
<fwelter at gwdg.de> wrote:
> Dear Taxacomers,
> thank you all so much for having participated in the BHL Survey 2010.
> We obtained more than 1000 answers, of which more than 60 % were by
> taxonomists. This gives us really good preconditions to continue our
> work.
> The next step for us consists in evaluating the results. We will talk
> about these in our BHL/BHL-Europe conference in Vienna (Austria) at
> the end of this month and then we are certainly going to present the
> results for all of you in an internet page that we will set up so
> that you can see what the participants answered.
> I have one question. There is one result in the survey that I do not
> understand. When we were asking "how satisfied are you with the
> following functions of BHL" the levels of agreement with all
> functions were surprisingly high. The differences were only
> finely tuned. It is noce to get such a positive feedback, but on
> the other hand this makes it more difficult to improve our
> service. One of the highest levels of agreement (73 %) was recorded
> for the scan or image quality. Being a taxonomist myself I know that
> the scan quality (for example when I look up plate figures) provided
> by Smithsonian, Natural History Museum London, Harvard and Missouri
> Botanical Garden are relatively low. Does the high level of agreement
> with the scan quality mean that plate figures are not needed for your
> work, only the texts? Or does it mean that you are happy that
> anything is provided at open access at all, so that you did not dare
> to complain about the quality? Or was a misunderstanding provoked by
> the wording of the bullet point ("The scan or image quality is fine"
> - Strongly agree, Agree, Neither agree or disagree, Disagree,
> Strongly disagree)?
> I anticipate that there will be discussions about this point in the
> meeting in Vienna. Since I have no idea for an answer, we would be
> left in speculations, so I have decided to ask you.
> Thank you for your precious help.
> Francisco
> University of Goettingen, Germany
> www.animalbase.org
> _______________________________________________
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
> The Taxacom archive going back to 1992 may be searched with either of these methods:
> (1) http://taxacom.markmail.org
> Or (2) a Google search specified as:  site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here

More information about the Taxacom mailing list