[Taxacom] Errors in compilations

Francisco Welter-Schultes fwelter at gwdg.de
Thu Jan 16 04:14:13 CST 2014


Dear Bob,

> "In an excellently compiled list of available molluscan genera (MNHN
> Paris, Bouchet & Rocroi) I controlled some 4500 entries and documented the
> error classes. Only 10 names were misspelled in the final list, and a few
> of them were debatable (o/oe problem). In 140 cases author and date were
> incorrectly cited, in 55 cases the name was unavailable. 50 concerned an
> incorrectly given original source, 200 entries had problems with the page
> number, 200 genera had incorrect type species, 400 an incorrectly given
> mode of type designation. I found only 8 overlooked names, but had no
> method to obtain a reliable figure on that error class."
>
> Many thanks for letting us know the error categories in this particular
> list. So these are errors you found in the list *after* basic data
> cleaning - they are errors in what you might call the 'meaningful content'
> of the cleaned list, and they amount to *at most* ca 20-25% of the 4500
> entries (some entries might have more than one of the error categories),
> i.e. no more than 10+140+55+50+200+200+400/4500.

You are basically right. But we should substract systematic errors from
the individual errors. Initially they did not pay much attention on the
type species, which as such has the nature of a more systematic error, so
the error rates concerning type species in the final list resulted to be
higher.

>
> Also, you suggest that nearly 100% of these errors arise at the time that
> data are manually entered, but looking at your categories I can imagine
> that many of them actually arise before the entries were compiled, and
> that some of them come from the publications used as sources.

Rocroi made xerox copies from the original sources where the names had
been established. Bouchet took the xerox copies and derived the
nomenclatural conclusions in front of a computer screen, and produced the
electronic file at this very instance.
They did not use secondary sources. In some rare cases Rocroi did not copy
correctly or completely. Maybe 10-15 % of the errors in the final list was
because of that. 85-90 % was because Bouchet's manual work produced the
error. This was much less spelling, it was more taking the correct
nomenclatural conclusions in a complex situation. For example,
availability was incorrectly interpreted in 1 % of the cases. In
AnimalBase we had the same problem, also Sherborn. Deriving correct
conclusions on the availability status of a name is not easy. Author and
date was wrong in 3 % of the cases, but many cases were debatable, the
Code regulations unclear or Rocroi did not copy or highlight the important
things.

The Code is also a problem. If the Code was written in a more
service-directed language with more examples, this would help
dramatically. Also, if ambiguous regulations were converted into clear
rules. Gender is of course the worst problem.

Later in the course of their work they improved their collaboration,
Bouchet gave Rocroi feedback to improve the xerox copy compilation quality
and so the electronic database record improved over the years. It was less
that Rocroi did not copy the correct information, it was more a problem
that initially it took Bouchet too long time to find the information. For
example, page number.
In AnimalBase we also experienced that with ongoing time data quality will
improve.

> I apologise
> for using the phrase 'original publications', which is ambiguous. What I
> meant was 'source publications', i.e. the publications used by the
> compiler.

I distinguish between original publications and secondary publications. A
xerox copy from an original publication or a digitised copy somewhere
online is an original publication in the sense I use this term.
Rocroi always used the original publications. In AnimalBase we used
secondary sources only if we could not get the original work, which
happened quite rarely. Also Sherborn worked mostly with the original
sources.

>
> And you estimate that even after correcting the errors noted above, there
> is still something like a 2-5% error rate? Would that consist entirely of
> overlooked names or publications?

After correcting you have to consider the error produced by the corrector.
This is now more than 5 % because the corrector will accumulate the more
problematic cases, where error chances are significantly higher. So in
Paris I was very probably the person who made most errors of all.
So initially you have say, 3 % errors, and after correction 0.2 % errors,
which is still too much (we have millions of names in zoology). Even after
second correction remain again 0,02 %. Philippe Bouchet took the clear
conclusion that the LAN idea was not good, and I take the same clear
conclusion. We both agree that the list of 30,000 molluscan genus-group
names compiled mainly by Jean-Pierre Rocroi during 40 years of work, and
even if independently corrected by 2 persons doing this as accurately as I
did it (4500 names in 3 months, full-time), should not be submitted to
become a LAN list.

An electronic registration perhaps, but it needs escape rules for the
error case.

Overlooked names: it depends on your method. If you start from 1758 and
just read every book that was published (as did Sherborn and as did we in
AnimalBase), then you have different preconditions than Bouchet & Rocroi
who had to know molluscan literature very well and select the books from I
don't want to know where. They must have had really good sources, given
that I found only 8 overlooked names. But finding genera cited and
recorded is easier than species. For species this would be different.

Overlooked literature is less a problem.
Just take a book by Fabricius and extract all new names. Maybe there are
200, and in the control you see that 4 names in your list are missing.
This is what often happened to Sherborn, and also to AnimalBase. Just that
Sherborn did not see it, because he was the first to extract such a list.
He should have done it doubly - and work additional 43 years longer...

Francisco

> --
> Dr Robert Mesibov
> Honorary Research Associate
> Queen Victoria Museum and Art Gallery, and
> School of Agricultural Science, University of Tasmania
> Home contact:
> PO Box 101, Penguin, Tasmania, Australia 7316
> (03) 64371195; 61 3 64371195
>


Francisco Welter-Schultes
Zoologisches Institut, Berliner Str. 28, D-37073 Goettingen
Phone +49 551 395536
http://www.animalbase.org





More information about the Taxacom mailing list