[Taxacom] validation of taxon names

Roderic Page r.page at bio.gla.ac.uk
Wed Feb 15 11:13:08 CST 2012


Dear Paddy,

I think we're doing what we often do, which is making things harder than they need to be. I think we expect to be able to provide perfect or near perfect results, when close enough will do. If I type something into Google I don't expect the thing I'm after to be the first hit, I'm prepared to cut Google a little slack, and use my own judgement about the results.

1. Is this a name? 
Why not for the sake of argument say that if a name is in, say, Global Names, then it's a name. Obviously there will be errors, but if I can see the source of the name then I can make some judgement (or have it made for me) about whether something is a name.

2. Is this the correct way to write it? 
I guess I had in mind that if there are multiple variants of a name and authority, then the one that is the most complete and has the appropriate diacritic marks etc., would offered as the preferred way to write it.

3. Is this name currently in use?
Yes, Google would work, or indeed something like Ryan Schenk's tool. Show users the usage of a name over time and they can decide if a name is in use.

4. What other names are related to this name (e.g., synonyms, lexical variants)?  
Again, I don't necessarily need a definitive answer, but showing me other names (and ideally a link to who uses it) would help. If there's dispute about what a name refers to, let me have that information and I'll decide what to do about it.

5. Where was this name published? Can I see that publication?  
This is where I'm spending a lot of time at the moment, and there's a lot more literature available than people might expect - cue Paul telling me that Fungi have had this for years ;)  Zoological Record has been serving LSIDs for a while now, so in a sense a lot of their basic data is out there (we just need to convert it into something useful).

I guess I'd argue that the task is challenging, but nowhere near as hard as we've made it. All of the above we could provide right now if we focussed a little more on what users might want and were willing to accept that we won't always be right.

I also think the issue is less about money than we might think. An awful lot of money has been spent already and we don't have a whole lot to show for it.

Regards

Rod


On 15 Feb 2012, at 16:08, David Patterson wrote:

> Rod
> 
> Bizarre is not the word I would use, having (because of my association with the Global Names project) some appreciation of the extent of the problem.  But, I am certainly embarassed that we have made so little effective progress towards what is such an obvious goal.
> 
> Progress, I think, requires a more analytical perspective, and a willingness to work collaboratively.
> 
> Our sub-discipline contrasts massively with, for example, the molecular domain, where open sharing of content is the norm.  As a community we have shown much less readiness to share names and taxonomic content.  So, a 'social' change needed there.   We can provide the tools, but attitudes will need to change.  And change does not just relate to people and content, but also a change relating to development of services and software.  New tools need to be designed so that they can be interconnected, creating a much better open toolkit.  Money, and especially money not tied to short duration projects
> 
> Then we require an infrastructure that allows those who are willing to share content, annotate names as being valid now according to someone, declare synonymies, flag chresonyms, disambiguate homonyms, interconnect lexical variants, offer alternative taxonomic perspective, and provide means for integrating vernacular names and names for surrogates.  But that's not too hard, is it?
> 
> That structure is in need of the means of capturing ALL new changes that occur out there.  
> 
> Rod, you have been significant in showing us that it is feasible to make massive progress, but then you are equally aware of the spectrum of outlets and players. That diversity  defies simple and quick solutions, and frustrates the universal fix. The result should be seen as a process that will improve with time.  That process will need some kind of overview, but an overview that is aware that resources for developments that go well beyond 'proof of concept' are lacking.
> 
> Given the diversity of players and the wealth of expertise, the solution needs to enable crowd sourcing.  Such crowd sourcing would include the capacity of anyone to comment on any element of information, for some players to have authority to make changes, for all initiatives to be interlinked, and for an alert system that keeps all interested parties aware of all changes as they happen.  The expectation is that this evolve to synchrony of all parts. Crows sourcing needs to accept that there is more than one point of view as to what constitutes an entity, and how the entities should be arranged.
> 
> With such a structure, what progress might we get?
> 
> 1. Is this a name?  
>      This answer needs access to all names and all variants.   Global Names has about 22,000,000 strings that are purported to be names, but the contents of this are extremely dirty (intentionally so). To find a string in GNI certainly does not mean that it will be a name.  So, that needs to be fixed by flagging entries as 'names' and 'not names'.  Some of the other weaknesses in Global Names are those taxonomic territories without good coverage or where content sharing does not happen.  A lot of namestrings from the older literature have yet to be added.  Taxonomic sources tend not to be comprehensive with synonyms, certainly not with lexical variants, are often contaminated with chresonyms, and vary in terms of taxonomic currency.  There is often too much distance between expert and point of contact with the product - making it difficult to correct errors. So, these are a few of the issues on this front.  
>       It would be useful if we could run some exercises to assess how close to the asymptote we are. It would also be of interest to get input on priorities, given that there are many tasks.
> 
> 2. Is this the correct way to write it?  
>     I would suggest this question needs to be rephrased, given that there are many correct ways to write a scientific name - mostly with variations in the authority department.  If we limit ourselves to thinking about the correct spelling of the scientific elements, then this remains the responsibility of the nomenclators. Making nomenclator content open and placing nomenclators within crowd sourcing will help to overcome issues of scale. Historically, nomenclators have defined their own taxonomic context; but this is no longer a feasible stance.  Homonyms - as Tony Rees has pointed out abound, some more taxa become ambiregnal, some less so.   Open-ness of nomenclatoral content is an issue. 
> 
> 
> 3. Is this name currently in use?  
>      Google can probably provide an answer of sorts to that question, but again I suspect this is a question to be refined.  Is the question: Is this name, currently, considered to be a nomenclaturally valid for a taxon?  Given that species concepts are rarely universally accepted, and that understanding of relationships is improving such that binomials change, then I am also assuming that we accept that there will be more than one list of valid names for the same taxonomic area.  Building that polytheism into the infrastructure is not too challenging, but will the experts accept that, and will the consumer accept it?
> 
> 4. What other names are related to this name (e.g., synonyms, lexical variants)?  
>      Yes, this is what we called, in our TREE paper, 'reconciliation'.  Lexical variants are probably the simplest to deal with, but algorithms that seek to scale to all name strings run into problems where the fuzziness of lexical variants of similar names overlap, and that leads to massive aggregates of names - many of which are not lexical variants of all others.  There will never be a perfect algorithm for this, and this also needs to have a human interface that allows results to be refined.  Homotypic synonyms (objective synonyms) can probably be found with fair success by algorithms.  They are, like heterotypic synonyms, embedded in taxonomic treatments.  Which leads us to the issue of access to  taxonomic treatments. Can we make more openly available, and in a form that the content will flow to all of us, and can we create an infrastructure that any complementary data can flow back to reward the participating taxonomists.  Taxonomic treatments differ in completeness, may take differing and equally valid perspectives on the same taxa, so we still have a long way to go to merge the compatible thoughts and separate the incompatible ones.
> 
> 5. Where was this name published? Can I see that publication?  
>      The nomenclators should be our primary point of reference for the first point.  Nomenclators for prokaryotes, fungi, plants, and viruses are reasonably good.  The protists (especially the heterotrophs) and animals present us with massive problems. ZooBank is setting up the infrastructure to help make progress on animals, and with Index Animalium and (let's cross our fingers) Nomenclator Zoologicus included, there will be a reasonably good generic framework.  What incentives will entice Zoological Record to make their content openly available.  After that, we are back to the taxonomists, and helping them to make their content openly available and ensuring they are rewarded for doing so.  
> A proximate task, one that you demonstrated very nicely, is to reconcile the alternative ways of pointing to a reference.  You are probably in a better position to estimate what proportion of the relevant literature is in a digital format, is indexed, and is not behind a paywall or closed off by copyright issues.
> 
> 
> Sorry for the length, but it is my way of saying that the task is not a small one, and I am sure I have missed out many many issues.  This is a task that GBIF, TDWG and others have in mind, but if anything the most proximate problem is that the focus is still too diffuse.
> 
> David Patterson
> 
> 
> 
> On Wed, Feb 15, 2012 at 4:50 AM, Roderic Page <r.page at bio.gla.ac.uk> wrote:
> But isn't it bizarre that our field can't offer the kind of service Armand is looking for?
> 
> Very simple questions are being asked:
> 
> 1. Is this a name?
> 2. Is this the correct way to write it?
> 3. Is this name currently in use?
> 4. What other names are related to this name (e.g., synonyms, lexical variants)?
> 5. Where was this name published? Can I see that publication?
> 
> Yes, there will always be edge cases, but in general these are straightforward questions and yet we have failed to provide a simple, global tool to answer them.
> 
> Regards
> 
> Rod
> 
> On 14 Feb 2012, at 21:02, Stephen Thorpe wrote:
> 
> > Hi Armand,
> > Your question opens a familiar "can of worms", as we say!
> > Currently, there is no comprehensive source of validated scientific names, and certain vagueness in some ICZN Code articles makes it unlikely that there could ever be a robust notion of name availability.
> > I work on Wikispecies to create something akin to what you want, except that I try to make the names verifiable by the user, rather than just saying "you can trust me". Hence, it involves work on the part of the user. Most users want someone else to do the work, and to be "spoon fed" with validated names, but this just isn't realistic ...
> >
> > Cheers,
> > Stephen
> >
> >
> >
> > ________________________________
> > From: Armand Turpel <armand.turpel.mnhn at gmail.com>
> > To: taxacom at mailman.nhm.ku.edu
> > Sent: Tuesday, 14 February 2012 9:42 PM
> > Subject: [Taxacom] validation of taxon names
> >
> > Hi,
> >
> > We have a database with over 80000 species taxon names which we want to
> > compare and validate against other databases. Doing this job isn’t very
> > easy:
> >
> > 1. The majority of organizations only provide web interfaces to search
> > for single taxon names.
> > 2. Copyrights of data are some times not very clear
> > 3. Quality of data is doubtful. >
> > - Lamia amputator Guérin-Méneville, 1844
> > - Lamia amputator Guerin-Meneville, 1844
> > - Lamia amputator Guérin-Méneville
> > - …...
> >
> >
> > The only organization we know that provide its whole database for
> > download is species2000 (catalogue of life > COL). We created a
> > postgresql version for the COL data from which it is possible to compare
> > a big number of taxon names in one run. Postgresql provide good fuzzy
> > string algorithms. But the COL data isn’t error free and it isn’t
> > complete for our region.
> >
> > The question is: Which organization provide trustful, complete (as
> > possible) and full accessible data?
> >
> > a+
> >
> > arm
> >
> > _______________________________________________
> >
> > Taxacom Mailing List
> > Taxacom at mailman.nhm.ku.edu
> > http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
> >
> > The Taxacom archive going back to 1992 may be searched with either of these methods:
> >
> > (1) by visiting http://taxacom.markmail.org
> >
> > (2) a Google search specified as:  site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here
> > _______________________________________________
> >
> > Taxacom Mailing List
> > Taxacom at mailman.nhm.ku.edu
> > http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
> >
> > The Taxacom archive going back to 1992 may be searched with either of these methods:
> >
> > (1) by visiting http://taxacom.markmail.org
> >
> > (2) a Google search specified as:  site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here
> 
> ---------------------------------------------------------
> Roderic Page
> Professor of Taxonomy
> Institute of Biodiversity, Animal Health and Comparative Medicine
> College of Medical, Veterinary and Life Sciences
> Graham Kerr Building
> University of Glasgow
> Glasgow G12 8QQ, UK
> 
> Email: r.page at bio.gla.ac.uk
> Tel: +44 141 330 4778
> Fax: +44 141 330 2792
> AIM: rodpage1962 at aim.com
> Facebook: http://www.facebook.com/profile.php?id=1112517192
> Twitter: http://twitter.com/rdmpage
> Blog: http://iphylo.blogspot.com
> Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
> 
> _______________________________________________
> 
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
> 
> The Taxacom archive going back to 1992 may be searched with either of these methods:
> 
> (1) by visiting http://taxacom.markmail.org
> 
> (2) a Google search specified as:  site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here
> 
> 
> 
> -- 
> ___________________________________
> David J Patterson
> 
> Senior Scientist, Marine Biological Laboratory
> Life Sciences Lead, Data Conservancy
> globalnames.org
> 
> 7 MBL Street, Woods Hole, MASS 02543, USA.
> 

---------------------------------------------------------
Roderic Page
Professor of Taxonomy
Institute of Biodiversity, Animal Health and Comparative Medicine
College of Medical, Veterinary and Life Sciences
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK

Email: r.page at bio.gla.ac.uk
Tel: +44 141 330 4778
Fax: +44 141 330 2792
AIM: rodpage1962 at aim.com
Facebook: http://www.facebook.com/profile.php?id=1112517192
Twitter: http://twitter.com/rdmpage
Blog: http://iphylo.blogspot.com
Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html




More information about the Taxacom mailing list