[Taxacom] ZooBank Progress

Francisco Welter-Schultes fwelter at gwdg.de
Sun Apr 28 17:13:40 CDT 2013


Just some thoughts on two issues you brought up.

"Fossil: No": I would still remove it. The example I gave you was not a
result of a systematic search but I just saw it coincidentally when
passing by. Linnaeus 1758 described many fossil species, so finding such a
case would probably not be such a rare event.

You know in AnimalBase we display a field concerning the gender treatment,
a specific name is either marked as changeable or unchangeable. Shortly
after we began doing this around 2005 we obtained angry feedback from
users who discovered wrong information. Our team members had selected the
wrong item. We initially answered "thanks, I corrected it, took me 10
seconds", but this did not help.
So we quickly modified our strategy and instructed our team members very
strictly to give this information only if they were absolutely sure about
what they wrote. If they were not sure we demanded them to select the
option "I don't know". We had obtained feedback that saying "I don't know"
was much better than giving incorrect information.
I would compare this experience with the fossil information. Only display
it if you have verified it. Otherwise, write "I don't know" or "not
verified".



The long s. I had discussions with others and the same arguments came up,
and we also had this problem in AnimalBase. We discussed it intensively
and came to a different conclusion.
Let me summarize the arguments, maybe you find them interesting. Perhaps
you already know them, but these things are rarely discussed.

(1) the long s is a rare letter and many non-insiders do not know this
letter and when they see it, they cite it as f. The result is that we
observe a number of Latin names spelled with f instead of s. Many people
do not know Latin and do not know a word refinella or resinella. If
ZooBank (which is deveoping into a very important database) decides
displaying the long s and will be consulted by many people, these errors
will increase and we will have to search for names like Musca and Mufca,
Ostrea and Oftrea, pusio and pufio, refinella and resinella, and so on, a
never ending list. Citing the long s would create more problems than
benefit.
We repeatedly had to teach AnimalBase team members to pay very much
attention on the f and long s problem, and also, if they did not find a
name with f, for example in Sherborn, we instructed them to look for it
spelled with s.
This had the funny result that in the weekly lectures when they reported
their problems with names they had not found in the literature, they
always knew where they had to add the sentence "I also looked for the name
with f and did not find it either".

(2) The long s and other special characters have never been cited as such
by taxonomists. We quickly said, let's avoid introducing a totally new
standard, we can only make mistakes and we don't help people. This was
probably the most important argument: we intended to present a service for
the community, to facilitate people's work, and not to complicate it.
The ae and oe ligatures have been cited in the past, so these characters
are well known and we decided to cite them.

(3) the argument "let's display all the UTF-8 characters as in the
original source" seemed weak to us. In UTF-8 you can display almost
anything. We would have needed to teach our team members a whole new world
of Latin script characters. Would have been like teaching Chinese...
In ZooBank you now have the problem of being inconsistent, in that some
UTF-8 characters were ignored and automatically corrected, others were
incorrectly cited. For example the ct ligature seems to have been
consistently ignored, or the Linnean u with tilde (ũ) at the end of a
name ending in -um is incorrectly displayed.

http://zoobank.org/NomenclaturalActs/6FCAAF6C-3DAE-4237-B61E-6CAABCE1AAD3
Alcyonium arboreú Linnæus, 1758 - ú was the incorrect UTF-8 character,
correct would be u with tilde.
This seems to be a systematic error, incorrectly cited at all instances.
http://zoobank.org/NomenclaturalActs/41F4F85A-33EC-49BF-AA41-872E0D8F1294
Alcyonium digitatú Linnæus, 1758, same problem.

However, only extremely skilled experts know this u tilde spelling mode at
all, so for not making zoology more sophisticated than it needs to be,
there seems to be a general convention to cite such a name as arboreum,
and to ignore the u with tilde.

Looking for a name Alcyonium digitatum in the ZooBank search box does not
return a result, you have to search for digitatu.

Sometimes I also saw m with tilde, this stands for mm.

Name with ligatures in the original sources which also have UTF-8 characters:

http://zoobank.org/NomenclaturalActs/F3685154-FA47-4085-BFAE-5CF4CF0445DA
Tubipora muſica Linnæus, 1758, with long s - i ligature

http://zoobank.org/NomenclaturalActs/0BD4B2B0-E5E6-4161-BD5B-63678A694474
Dermestes pectinicornis Linnæus, 1758, originally with ct ligature

http://zoobank.org/NomenclaturalActs/4F7B08AD-13A7-4C58-94FD-FEFCFF188399
Gadus aeglefinus Linnæus, 1758, originally with fi ligature (and moreover,
originally spelled Æglefinus, with AE ligature, in Zoobank cited as
aeglefinus, probably unintended)

I guess there are half a dozen more of such ligatures, about 10 % of the
Linnean names are involved and we would have become crazy. A whole hell of
devils in the details.

Never hesitate if you have more questions and ideas
Francisco


>
>> If it's done manually it might be worth to correct some other names that
>> were linked by ZooBank to the Linnean 1758 work.
>
> Yes, exactly!  Part of the process of cross-linking is to compare
> discrepancies.  We've found that in most datasets that we cross-link
> against, there are a relatively small fraction of discrepancies.  For
> example, out of 50,000 names, there might be only a few hundred
> discrepancies -- usually involving the date of publication, correct
> authorship, or the exact orthography of the name.  This means that it's a
> very manageable task to investigate each one of the discrepancies. Also,
> I've found that no database is perfect.  Some are better than others, to
> be sure -- but one can never assume that one database is always correct --
> which means that it's important to examine the discrepancies individually.
>  Indeed, this is one of the main reasons why we wanted to establish this
> link with BHL -- to make it easier to resolve discrepancies.
>
>> My understanding is that ZooBank is a data resource where available
>> names
>> are contained. Unavailable names should probably not be contained at
>> all,
>> and if yes, they should clearly be marked as such.
>> I am not sure how names should be treated which were initially made
>> available and later suppressed.
>
> This is an issue that has been debated since ZooBank was first conceived.
> In 2008, at a Commissioners meeting in Paris, it was determined that
> ZooBank *would* include unavailable names, and that those names would be
> clearly marked not only as unavailable, but also give the reason(s) why
> the name is unavailable.  We already have a very robust data model to deal
> with this (which I'd be happy to describe, if anyone is interested).  But
> as with most aspects of ZooBank development, the tricky part is how to
> implement it (devil is always in the details).  One of the things in the
> works is a policy on data verification in ZooBank.  Right now, the focus
> is on building the core infrastructure of ZooBank, populating it with
> restrospective content, and building tools to streamline the capture of
> prospective content.  However, what people *really* want from ZooBank is a
> definitive declaration of whether or not any particular name is available
> under the Code.  This is the entire process of content verification.  So
> far, we focused only on registration (these are two very different
> things).
>
>> Example:
>> http://zoobank.org/NomenclaturalActs/1E691819-76A8-492D-8AE1-DA84F9103CF8
>> Acarus telarius Linnæus, 1758 - this name should somehow be marked as
>> suppressed (ICZN Op. 968).
>> There were many other such names established in the 1758 work, which
>> were totally or partly suppressed by the Commission.
>
> Yes, indeed!  In fact, one of the projects we've been working on (with
> LARGE thanks to Charles Hussey, and also to Rod Page who defined the
> article boundaries of historical BZN volumes in BHL) is a complete
> database of Opinions.  This is effectively complete (still needs some
> verification, though), and will be one of the new features added to
> ZooBank this summer.  But again, we need to sort out exactly how this sort
> of thing will be implemented on the ZooBank website, and what the policy
> is for editing these sorts of things, etc.
>
> Many thanks for pointing out the individual issues related to Linneaus
> names.  This sort of thing is EXTREMELY helpful!  I will definitely use
> these as test cases when we implement the next set of features involving
> ZooBank record verification/validation.  But again, it probably can't be
> implemented until later this summer (northern hemisphere summer, that is).
>
>> Maybe some other systematic things could be fixed.
>>
>> - Remove the long s throughout the original spellings, and replace it at
>> all
>> instances by the short s.
>
> In this case, we want to maintain the precise orthography as it originally
> appeared on the printed page -- in al respects.  Basically, if a UTF-8
> character exists for a particular glyph, we want to capture it as such.
> The main exceptions are that all-caps words are not faithfully captured as
> such, and other stylistic attributes (e.g., boldface, small-caps, when
> original names were not italicized, etc.) will not be captured.  But
> characters such as the long s and dipthong "æ" will be captured as
> originally printed on the page.
>
> The next step is to build the correct algorithm to transform these things,
> so that the Code-corrected "original spelling" can be generated
> automatically.  In most cases, this is easy to do -- but there are some
> tricky ones (e.g., see Art. 32.5.2.1. -- which would require us to know
> whether the root word is German or not; or some of Art. 32.5.2.4.).  This
> is one more example of features currently in the works, that will be
> introduced over time as they rise up the priority list, and as appropriate
> policies are drafted and ratified.
>
>
>> Example Musca Linnæus, 1758, this name was
>> spelled Musca with long s at some occasions and MUSCA at others, MUSCA
>> is
>> usually converted to Musca with short s. So all specific names should
>> correctly be combined with Musca with short s.
>
> This is a slightly separate issue (multiple spellings of the same genus
> name, and how they map to the species they are combined with).  The new
> GNUB data model (not yet implemented) deals with this by capturing
> separately the verbatim name-string, and the separate name components.  At
> the moment, this sort of issue is rare enough that it has not risen up the
> priority "to-do" list.  But it's definitely on the list.
>
>> Also, the long s is not cited consistently. Example:
>> http://zoobank.org/NomenclaturalActs/D2B4DA70-35AE-4D87-9E34-E219FC8E3DA0
>> Ostrea Puſio Linnæus, 1758 - here Pusio with long s and Ostrea
>> with
>> short s, both had the long s in the original source.
>
> This is another example of the previous.  The genus was rendered as OSTREA
> on p. 696 (http://www.biodiversitylibrary.org/pagethumb/727611), so the
> genus is captured as such in the database (minus the all-caps).  I only
> see "OÅ¿trea" in the page header.  Is it rendered this way somewhere else?
>
>> - Consider presenting a field "original spelling" and another field
>> "correct
>> spelling". This would probably reduce confusion. In the correct spelling
>> field
>> the species would not appear capitalised, and diacritics would be
>> removed.
>
> Yes!  This is already part of the plan.  It just needs to rise up the
> priority list for implementation.
>
>> - I am confused by the statement "Fossil: No" in the ZooBank data result
>> set.
>> Is this nomenclaturally relevant?
>
> It's not a Code-relevant issue, but it is a useful piece of information
> (just like type locality, figures, and page number).
>
>> Is there an exact definition for the term "fossil"? Since when does a
>> taxon
>> need to be extinct for obtaining the attribute "fossil"?
>
> If you read the help section for this particular field (click on the blue
> icon when registering a new name, or editing an existing name), it
> explains it thusly:
>
> "If this new name is based on fossil material, select this checkbox.
> Otherwise, leave the checkbox unselected."
>
> In other words, it only applies to species-group names, and it is a
> specific indication of the nature of the name-bearing type material.
> Technically, if the type specimen of Latimeria chalumnae had been a
> fossil, and then it was later discovered alive, this would be "Fossil:
> Yes".  However, I am not aware of any case where a name is established
> based on a fossilized type, and then later discovered (at the species
> level) to be extant.  Generally such cases are described as separate
> species.
>
>> Can we be sure that all molluscs and brachiopods named in the early
>> Linnean
>> works were recent?
>
> Nope.  Neither can we be sure that all the page numbers are correct, or
> all the type localities are correct -- or any number of other things.
> That doesn't mean the data field should be eliminated.  It just means we
> have to deal with cases that prove to be inaccurate (or unknown).
>
>> Would it not be better to remove the statement, to avoid running the
>> risk to
>> give an incorrect information?
>
> I don't think so, but I'd be interested in hearing opinions from others on
> this.  As I already said, there is no such thing as a perfect database.
> One of the things Rob Whitton constantly reminds me of is not to let the
> "perfect" be the enemy of the "good".  I tend to be a perfectionist on
> thses sorts of things (as many database managers are).  But sometimes it's
> better to just get what you have out there, and then provide a
> crowd-sourcing mechanism to get it corrected.
>
>> Example:
>> http://zoobank.org/NomenclaturalActs/04B5D5F4-648A-489F-ADE9-2C13971F8A69
>> Anomia Gryphus Linnæus, 1758. Here a fossil species was described, and
>> in
>> Zoobank it was marked as "Fossil: No".
>
> Many thanks for the correction!  I have already implemented it on ZooBank
> (it took me 7 seconds to correct this -- but you did the hard part of
> finding the error, and made it extremely easy for me by providing the
> link).
>
> I want to thank you again for providing all of these VERY VALUABLE
> corrections to names in ZooBank.  I will study them in more detail (along
> with your other recent messages), and will likely come back to you with
> follow-up questions.
>
> Aloha,
> Rich
>
>


Francisco Welter-Schultes
Zoologisches Institut, Berliner Str. 28, D-37073 Goettingen
Phone +49 551 395536
http://www.animalbase.org





More information about the Taxacom mailing list