[Taxacom] validation of taxon names

David Patterson dpatterson at mbl.edu
Wed Feb 15 10:08:06 CST 2012


Rod

Bizarre is not the word I would use, having (because of my association with
the Global Names project) some appreciation of the extent of the problem.
 But, I am certainly embarassed that we have made so little effective
progress towards what is such an obvious goal.

Progress, I think, requires a more analytical perspective, and a
willingness to work collaboratively.

Our sub-discipline contrasts massively with, for example, the molecular
domain, where open sharing of content is the norm.  As a community we have
shown much less readiness to share names and taxonomic content.  So, a
'social' change needed there.   We can provide the tools, but attitudes
will need to change.  And change does not just relate to people and
content, but also a change relating to development of services and
software.  New tools need to be designed so that they can be
interconnected, creating a much better open toolkit.  Money, and especially
money not tied to short duration projects

Then we require an infrastructure that allows those who are willing to
share content, annotate names as being valid now according to someone,
declare synonymies, flag chresonyms, disambiguate homonyms, interconnect
lexical variants, offer alternative taxonomic perspective, and provide
means for integrating vernacular names and names for surrogates.  But
that's not too hard, is it?

That structure is in need of the means of capturing ALL new changes that
occur out there.

Rod, you have been significant in showing us that it is feasible to make
massive progress, but then you are equally aware of the spectrum of outlets
and players. That diversity  defies simple and quick solutions, and
frustrates the universal fix. The result should be seen as a process that
will improve with time.  That process will need some kind of overview, but
an overview that is aware that resources for developments that go well
beyond 'proof of concept' are lacking.

Given the diversity of players and the wealth of expertise, the solution
needs to enable crowd sourcing.  Such crowd sourcing would include the
capacity of anyone to comment on any element of information, for some
players to have authority to make changes, for all initiatives to be
interlinked, and for an alert system that keeps all interested parties
aware of all changes as they happen.  The expectation is that this evolve
to synchrony of all parts. Crows sourcing needs to accept that there is
more than one point of view as to what constitutes an entity, and how the
entities should be arranged.

With such a structure, what progress might we get?

1. Is this a name?
     This answer needs access to all names and all variants.   Global Names
has about 22,000,000 strings that are purported to be names, but the
contents of this are extremely dirty (intentionally so). To find a string
in GNI certainly does not mean that it will be a name.  So, that needs to
be fixed by flagging entries as 'names' and 'not names'.  Some of the other
weaknesses in Global Names are those taxonomic territories without good
coverage or where content sharing does not happen.  A lot of namestrings
from the older literature have yet to be added.  Taxonomic sources tend not
to be comprehensive with synonyms, certainly not with lexical variants, are
often contaminated with chresonyms, and vary in terms of taxonomic
currency.  There is often too much distance between expert and point of
contact with the product - making it difficult to correct errors. So, these
are a few of the issues on this front.
      It would be useful if we could run some exercises to assess how close
to the asymptote we are. It would also be of interest to get input on
priorities, given that there are many tasks.

2. Is this the correct way to write it?
    I would suggest this question needs to be rephrased, given that there
are many correct ways to write a scientific name - mostly with variations
in the authority department.  If we limit ourselves to thinking about the
correct spelling of the scientific elements, then this remains the
responsibility of the nomenclators. Making nomenclator content open and
placing nomenclators within crowd sourcing will help to overcome issues of
scale. Historically, nomenclators have defined their own taxonomic context;
but this is no longer a feasible stance.  Homonyms - as Tony Rees has
pointed out abound, some more taxa become ambiregnal, some less so.
Open-ness of nomenclatoral content is an issue.


3. Is this name currently in use?
     Google can probably provide an answer of sorts to that question, but
again I suspect this is a question to be refined.  Is the question: Is this
name, currently, considered to be a nomenclaturally valid for a taxon?
 Given that species concepts are rarely universally accepted, and that
understanding of relationships is improving such that binomials change,
then I am also assuming that we accept that there will be more than one
list of valid names for the same taxonomic area.  Building that polytheism
into the infrastructure is not too challenging, but will the experts accept
that, and will the consumer accept it?

4. What other names are related to this name (e.g., synonyms, lexical
variants)?
     Yes, this is what we called, in our TREE paper, 'reconciliation'.
 Lexical variants are probably the simplest to deal with, but algorithms
that seek to scale to all name strings run into problems where the
fuzziness of lexical variants of similar names overlap, and that leads to
massive aggregates of names - many of which are not lexical variants of all
others.  There will never be a perfect algorithm for this, and this also
needs to have a human interface that allows results to be refined.
 Homotypic synonyms (objective synonyms) can probably be found with fair
success by algorithms.  They are, like heterotypic synonyms, embedded in
taxonomic treatments.  Which leads us to the issue of access to  taxonomic
treatments. Can we make more openly available, and in a form that the
content will flow to all of us, and can we create an infrastructure that
any complementary data can flow back to reward the participating
taxonomists.  Taxonomic treatments differ in completeness, may take
differing and equally valid perspectives on the same taxa, so we still have
a long way to go to merge the compatible thoughts and separate the
incompatible ones.

5. Where was this name published? Can I see that publication?
     The nomenclators should be our primary point of reference for the
first point.  Nomenclators for prokaryotes, fungi, plants, and viruses are
reasonably good.  The protists (especially the heterotrophs) and animals
present us with massive problems. ZooBank is setting up the infrastructure
to help make progress on animals, and with Index Animalium and (let's cross
our fingers) Nomenclator Zoologicus included, there will be a reasonably
good generic framework.  What incentives will entice Zoological Record to
make their content openly available.  After that, we are back to the
taxonomists, and helping them to make their content openly available and
ensuring they are rewarded for doing so.
A proximate task, one that you demonstrated very nicely, is to reconcile
the alternative ways of pointing to a reference.  You are probably in a
better position to estimate what proportion of the relevant literature is
in a digital format, is indexed, and is not behind a paywall or closed off
by copyright issues.


Sorry for the length, but it is my way of saying that the task is not a
small one, and I am sure I have missed out many many issues.  This is a
task that GBIF, TDWG and others have in mind, but if anything the most
proximate problem is that the focus is still too diffuse.

David Patterson



On Wed, Feb 15, 2012 at 4:50 AM, Roderic Page <r.page at bio.gla.ac.uk> wrote:

> But isn't it bizarre that our field can't offer the kind of service Armand
> is looking for?
>
> Very simple questions are being asked:
>
> 1. Is this a name?
> 2. Is this the correct way to write it?
> 3. Is this name currently in use?
> 4. What other names are related to this name (e.g., synonyms, lexical
> variants)?
> 5. Where was this name published? Can I see that publication?
>
> Yes, there will always be edge cases, but in general these are
> straightforward questions and yet we have failed to provide a simple,
> global tool to answer them.
>
> Regards
>
> Rod
>
> On 14 Feb 2012, at 21:02, Stephen Thorpe wrote:
>
> > Hi Armand,
> > Your question opens a familiar "can of worms", as we say!
> > Currently, there is no comprehensive source of validated scientific
> names, and certain vagueness in some ICZN Code articles makes it unlikely
> that there could ever be a robust notion of name availability.
> > I work on Wikispecies to create something akin to what you want, except
> that I try to make the names verifiable by the user, rather than just
> saying "you can trust me". Hence, it involves work on the part of the user.
> Most users want someone else to do the work, and to be "spoon fed" with
> validated names, but this just isn't realistic ...
> >
> > Cheers,
> > Stephen
> >
> >
> >
> > ________________________________
> > From: Armand Turpel <armand.turpel.mnhn at gmail.com>
> > To: taxacom at mailman.nhm.ku.edu
> > Sent: Tuesday, 14 February 2012 9:42 PM
> > Subject: [Taxacom] validation of taxon names
> >
> > Hi,
> >
> > We have a database with over 80000 species taxon names which we want to
> > compare and validate against other databases. Doing this job isn’t very
> > easy:
> >
> > 1. The majority of organizations only provide web interfaces to search
> > for single taxon names.
> > 2. Copyrights of data are some times not very clear
> > 3. Quality of data is doubtful. >
> > - Lamia amputator Guérin-Méneville, 1844
> > - Lamia amputator Guerin-Meneville, 1844
> > - Lamia amputator Guérin-Méneville
> > - …...
> >
> >
> > The only organization we know that provide its whole database for
> > download is species2000 (catalogue of life > COL). We created a
> > postgresql version for the COL data from which it is possible to compare
> > a big number of taxon names in one run. Postgresql provide good fuzzy
> > string algorithms. But the COL data isn’t error free and it isn’t
> > complete for our region.
> >
> > The question is: Which organization provide trustful, complete (as
> > possible) and full accessible data?
> >
> > a+
> >
> > arm
> >
> > _______________________________________________
> >
> > Taxacom Mailing List
> > Taxacom at mailman.nhm.ku.edu
> > http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
> >
> > The Taxacom archive going back to 1992 may be searched with either of
> these methods:
> >
> > (1) by visiting http://taxacom.markmail.org
> >
> > (2) a Google search specified as:  site:
> mailman.nhm.ku.edu/pipermail/taxacom  your search terms here
> > _______________________________________________
> >
> > Taxacom Mailing List
> > Taxacom at mailman.nhm.ku.edu
> > http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
> >
> > The Taxacom archive going back to 1992 may be searched with either of
> these methods:
> >
> > (1) by visiting http://taxacom.markmail.org
> >
> > (2) a Google search specified as:  site:
> mailman.nhm.ku.edu/pipermail/taxacom  your search terms here
>
> ---------------------------------------------------------
> Roderic Page
> Professor of Taxonomy
> Institute of Biodiversity, Animal Health and Comparative Medicine
> College of Medical, Veterinary and Life Sciences
> Graham Kerr Building
> University of Glasgow
> Glasgow G12 8QQ, UK
>
> Email: r.page at bio.gla.ac.uk
> Tel: +44 141 330 4778
> Fax: +44 141 330 2792
> AIM: rodpage1962 at aim.com
> Facebook: http://www.facebook.com/profile.php?id=1112517192
> Twitter: http://twitter.com/rdmpage
> Blog: http://iphylo.blogspot.com
> Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
>
> _______________________________________________
>
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
>
> The Taxacom archive going back to 1992 may be searched with either of
> these methods:
>
> (1) by visiting http://taxacom.markmail.org
>
> (2) a Google search specified as:  site:
> mailman.nhm.ku.edu/pipermail/taxacom  your search terms here
>



-- 
___________________________________
David J Patterson

Senior Scientist, Marine Biological Laboratory
Life Sciences Lead, Data Conservancy
globalnames.org

7 MBL Street, Woods Hole, MASS 02543, USA.



More information about the Taxacom mailing list