[Taxacom] validation of taxon names

Roderic Page r.page at bio.gla.ac.uk
Thu Feb 16 00:18:49 CST 2012

Dear Tony,

Yes, in many ways we are close, we just haven't put the bits together in one highly visible place that people can use.

I used to run the "Taxonomic Search Engine" described in http://dx.doi.org/10.1186/1471-2105-6-48 , which was an example of a federated search engine that queried multiple sources on the fly (a bit like http://botany.si.edu/ing/ but I used web services and extracted and reformatted the data rather than show results as web pages inside frames).

The two biggest problems were the source services going offline, or changing their web services (breaking my code). There also performance issues when doing "live" searching as opposed to searching a local list. You've covered all these issues nicely. There's also the problem of redundancy. Because some lists are aggregations of other lists, we can end up with the same source being represented several times in the results, but without the user being aware of this.

I just think there's clearly scope for bringing these sorts of services together in one place and providing people with some tools to address the problems they face, rather than the chaotic landscape we present at the moment.



On 16 Feb 2012, at 03:19, <Tony.Rees at csiro.au> <Tony.Rees at csiro.au> wrote:

> Dear all,
> My take on the issue/s below...
> Basically what is sought is a taxonomic name reconciliation service (TNRS) - now, where have I heard that before...
> In my mind this comprises three coupled components:
> (1) a web (human and machine) interface to search a "master list" or global taxon register
> (2) a fuzzy match component to cope with misspelled queries
> (3) the master list itself.
> Prototypes of just such a system have been in existence for several years, notably:
> - the GRIN Taxonomic Nomenclature Checker http://pgrdoc.bioversity.cgiar.org/taxcheck/ (the granddaddy of such systems), operating over the GRIN database for higher plants
> - my own IRMNG http://www.cmar.csiro.au/datacentre/irmng/, operating over many/most genus namesplus species names from catalogue of Life in the main, and more recently
> - the iPlant TNRS http://tnrs.iplantcollaborative.org/TNRSapp.html, operatin g over the TROPICOS plant database
> - the World Register of Marine Species portal http://www.marinespecies.org/aphia.php operating over its own data, and
> - the Australian "National Species Lists" search service http://biodiversity.org.au/service/taxamatch (working over different master lists as their reference database) as mentioned by Greg; plus no doubt others I have not mentioned.
> So we could say that the technical aspects of (1) and (2) are basically not a problem. The residual problem is the construction of the (actual or virtual) "master list". An example of a virtual "master list" (for plants, at genus level only) is provided by the Index Nominum Genericorum (ING) portal at http://botany.si.edu/ing/ : entering a query in the first text box does a real time distributed search of designated resources which between them provide a (partly overlapping) coverage of most/all plant genus names, comprising ING itself, IPNI, Index Nominum Algarum, TROPICOS, and Index Fungorum in this instance (however without any fuzzy match function). This is one approach; benefits are the removal of the need for one site to hold all the names to search over, plus the removal of any synchronization issues between a remote "point of truth" for the records and a central cache of the data. Downsides are the fact that deduplication/data harmonization of potential duplicates from multiple sources is not done; multiple taxonomic concepts/hierarchies may be in use at the various providers; it is difficult to provide fuzzy search over remotely hosted data; plus if any provider is off line at time of query, their data are not searched.
> The alternative is for remote provider content to be regularly crawled or exported, then cached centrally for the search process. This also provides the option for additional QA / data deduplication and harmonization and also notionally improved performance (provided that sufficient resources can be thrown at the one machine where the searches execute in real time). Disadvantages then include the need for continuous assembly and reassembly of the aggregate dataset, and the possibility of the central "view" of the data being out-of-synch with the latest changes at the provider; but something which can be managed in the main (as is already done my many similar "aggregators" of species distribution records).
> So the residual question is, where are all the data - also in some cases, whose content is the most up-to-date / complete / authoritative / accepted in case of potential multiple sources; plus of course, filling gaps where currently there is no obvious source of content for a particular group or region. These are questions which concern key projects at the present time, exemplified by the Catalogue of Life partnership (Sp2000 plus ITIS) for extant data, other sources for data on fossils, and the "Global Names" partnership, to name a few. Others more qualified than myself can answer this question and look at associated issues of resourcing, persistence, data completeness, data sharing culture and the like but at least what we have is a start...
> So the upshot of the above is:
> - For plants, GRIN and iPlant already provide most of the desired functionality (also ING distributed search for genera)
> - for marine species, try WoRMS
> - for (e.g.) Australian species, Greg's "NSLs" project resources as above, for Europaean species, PESI http://www.eu-nomen.eu/portal/, and so on
> - for Cat. of Life species, plus genera from ING (plants) and Nomenclator Zoologicus (animals) plus elsewhere, my own IRMNG.
> None of the above resources are as yet complete or completely populated (in my case, definitely not...) however they are not only pointers along the road but useable resources today.
> How do we get to where we want to be? Improve the master list, keep it up-to-date, continuously improve the quality and completeness of the accessible data... But it does require a "client focus" which provides strong directions for the types of services to be provide, and their actual useability when accessed. (Who has the mandate / who pays are different issues of course).
> Probably little of this is news to Taxacomers, but I thought I would just show that it is not all gloom and doom. And this is not to mention the myriad (and often excellent) taxon-specific database projects out there, of which Paul Kirk and Chris Thompson have already mentioned shining examples, to name but two... - most of which are already engaged in either Sp2000, Global Names Architecture, or both.
> Regards - Tony
> Tony Rees
> Manager, Divisional Data Centre,
> CSIRO Marine and Atmospheric Research,
> GPO Box 1538,
> Hobart, Tasmania 7001, Australia
> Ph: 0362 325318 (Int: +61 362 325318)
> Fax: 0362 325000 (Int: +61 362 325000)
> e-mail: Tony.Rees at csiro.au
> Manager, OBIS Australia regional node, http://www.obis.org.au/
> Biodiversity informatics research activities: http://www.cmar.csiro.au/datacentre/biodiversity.htm
> Personal info: http://www.fishbase.org/collaborators/collaboratorsummary.cfm?id=1566
> LinkedIn profile: http://www.linkedin.com/pub/tony-rees/18/770/36
>> -----Original Message-----
>> From: taxacom-bounces at mailman.nhm.ku.edu [mailto:taxacom-
>> bounces at mailman.nhm.ku.edu] On Behalf Of Roderic Page
>> Sent: Thursday, 16 February 2012 6:24 AM
>> To: taxacom
>> Subject: Re: [Taxacom] validation of taxon names
>> Dear Doug,
>> Regarding
>>> #4 is something that cannot be objectively determined, because
>>> synonymy is almost invariably subjective.
>> presumably the fact that person x asserted that two names are synonyms
>> can be determined objectively, and that's all we need to know.
>> Regards
>> Rod
>> On 15 Feb 2012, at 18:40, Doug Yanega wrote:
>>> I would observe that, for zoological names, of the following list:
>>>> 1. Is this a name?
>>>> 2. Is this the correct way to write it?
>>>> 3. Is this name currently in use?
>>>> 4. What other names are related to this name (e.g., synonyms,
>>>> lexical variants)?
>>>> 5. Where was this name published? Can I see that publication?
>>> at least 1 and 5 are questions for which an objective and definitive
>>> answer (via application of the ICZN for #1) can be arrived at, and
>>> that the answer will not change. Thus, these are things which could
>>> be made part of a permanent public archive (hopefully, something like
>>> ZooBank).
>>> #2 and 3 are things that can, in essence, be objectively determined
>>> under the Code, but are subject to the nuance of "prevailing usage" -
>>> that is, a sudden change in how taxonomists treat a name can shift
>>> the answer from "no" to "yes" (in both cases) or from "yes" to "no"
>>> (for #2). One hope that I have is that a mechanism for Registration
>>> can be implemented in the future which will prevent such fluctuation,
>>> and thus make the answers to 2 and 3 immutable, as well.
>>> #4 is something that cannot be objectively determined, because
>>> synonymy is almost invariably subjective.
>>> Realistically, then, this list represents a mixed bag of the
>>> immediately attainable, the potentially attainable, and the
>>> unattainable. It might be more productive to focus on the former
>>> categories, in terms of a community-wide goal. I'll further note that
>>> if taxonomists want a system of Registration that will result in
>>> permanently stable names, then they are probably going to have to
>>> insist upon it, *and* be willing to participate in the process
>>> (because such a process is likely to require public review). I'm not
>>> 100% sure whether botanical names would work exactly the same way,
>>> but I expect that the situation would be pretty much the same.
>>> Peace,
>>> --
>>> Doug Yanega        Dept. of Entomology         Entomology Research
>> Museum
>>> Univ. of California, Riverside, CA 92521-0314        skype: dyanega
>>> phone: (951) 827-4315 (standard disclaimer: opinions are mine, not
>> UCR's)
>>>             http://cache.ucr.edu/~heraty/yanega.html
>>>  "There are some enterprises in which a careful disorderliness
>>>        is the true method" - Herman Melville, Moby Dick, Chap. 82

Roderic Page
Professor of Taxonomy
Institute of Biodiversity, Animal Health and Comparative Medicine
College of Medical, Veterinary and Life Sciences
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK

Email: r.page at bio.gla.ac.uk
Tel: +44 141 330 4778
Fax: +44 141 330 2792
AIM: rodpage1962 at aim.com
Facebook: http://www.facebook.com/profile.php?id=1112517192
Twitter: http://twitter.com/rdmpage
Blog: http://iphylo.blogspot.com
Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

More information about the Taxacom mailing list