> "The experience I have says that if the number of names in a list exceeds
> a number of around 2000, getting the error rate down to zero will become
> most unlikely. 2000 seems to be some kind of a limit. Usual error rates in
> longer lists range around 2-5 %, also depending on how many data are
> incorporated to each entry."
> Why is the "2-5%" error rate you experience so much higher than in other
> big lists of records, such as the customer account databases kept by many
> businesses, i.e the ones in which records start off with manual entry of
> the personal data supplied by customers on paper forms?

I have no control over such customer databases, how they are compiled,
what is considered as an error there and what are their true error rates.

> How is the error split up between
> (a) errors in data sources, such as original publications

zero, what do you mean by "original publications"?

> (b) inaccurate manual data entry during the listing process

usually near 100 %

> (c) errors in processing of data from one format or compilation to
> another?

If there are such processes then it depends on the nature of these copying

> Does your experience include formal data cleaning (e.g.
> http://en.wikipedia.org/wiki/Data_cleansing,
> http://www.datacleansing.net.au/Data_Cleansing_Services)?

Yes it does, of course.

> If so, what are
> the residual errors?

These ones. I only meant the final error rates, the errors that appear in
the final result.

> Are they all (or almost all) problems in original publications?

I did not understand this question.

> My own experience with *good*, low-error compilations is that most of the
> inconsistencies have to do with the spelling of names. Is that also your
> experience?

No, it isn't. Errors are more or less evenly distributed over the field,
with some data having higher and others lower average error rates. The
spelling of names is rather a quite unlikely chance to detect an error
(fortunately). Having overlooked established available names seems by far
the most frequent error source, in Sherborn as well as in AnimalBase.
Incorrect original sources, incorrect page numbers, incorrectly cited
original genera and incorrect avalability status is also quite frequent.

In an excellently compiled list of available molluscan genera (MNHN Paris,
Bouchet & Rocroi) I controlled some 4500 entries and documented the error
classes. Only 10 names were misspelled in the final list, and a few of
them were debatable (o/oe problem). In 140 cases author and date were
incorrectly cited, in 55 cases the name was unavailable. 50 concerned an
incorrectly given original source, 200 entries had problems with the page
number, 200 genera had incorrect type species, 400 an incorrectly given
mode of type designation. I found only 8 overlooked names, but had no
method to obtain a reliable figure on that error class.
The data were copied manually from a xerox copy of the original source to
a dBase computer program.


