[Taxacom] validation of taxon names
armand.turpel.mnhn at gmail.com
Thu Feb 16 07:12:38 CST 2012
I have some doubts while reading the architecture description here:
Amongst the problems your described, my experience is that web services are
slow and technically difficult to maintain. I don’t think that such a
service can handle my demand in the initial post of this thread. But there
is a tool which is may worth to mention. Document based databases such as
couchdb can solve a lot of those problems. Here an excerpt from the couchdb
“CouchDB is a peer based distributed database system. Any number of CouchDB
hosts (servers and offline-clients) can have independent “replica copies”
of the same database, where applications have full database interactivity
(query, add, edit, delete). When back online or on a schedule, database
changes are replicated bi-directionally...”
An other advantage of such a system is that it serves also as an
application server from which applications can be replicated. Setup and
maintaining a network based on this needs much less effort.
What do you think? As far as I know you have worked with couchdb:
2012/2/16 Roderic Page <r.page at bio.gla.ac.uk>
> Dear Tony,
> Yes, in many ways we are close, we just haven't put the bits together in
> one highly visible place that people can use.
> I used to run the "Taxonomic Search Engine" described in
> http://dx.doi.org/10.1186/1471-2105-6-48 , which was an example of a
> federated search engine that queried multiple sources on the fly (a bit
> like http://botany.si.edu/ing/ but I used web services and extracted and
> reformatted the data rather than show results as web pages inside frames).
> The two biggest problems were the source services going offline, or
> changing their web services (breaking my code). There also performance
> issues when doing "live" searching as opposed to searching a local list.
> You've covered all these issues nicely. There's also the problem of
> redundancy. Because some lists are aggregations of other lists, we can end
> up with the same source being represented several times in the results, but
> without the user being aware of this.
> I just think there's clearly scope for bringing these sorts of services
> together in one place and providing people with some tools to address the
> problems they face, rather than the chaotic landscape we present at the
> On 16 Feb 2012, at 03:19, <Tony.Rees at csiro.au> <Tony.Rees at csiro.au> wrote:
> > Dear all,
> > My take on the issue/s below...
> > Basically what is sought is a taxonomic name reconciliation service
> (TNRS) - now, where have I heard that before...
> > In my mind this comprises three coupled components:
> > (1) a web (human and machine) interface to search a "master list" or
> global taxon register
> > (2) a fuzzy match component to cope with misspelled queries
> > (3) the master list itself.
> > Prototypes of just such a system have been in existence for several
> years, notably:
> > - the GRIN Taxonomic Nomenclature Checker
> http://pgrdoc.bioversity.cgiar.org/taxcheck/ (the granddaddy of such
> systems), operating over the GRIN database for higher plants
> > - my own IRMNG http://www.cmar.csiro.au/datacentre/irmng/, operating
> over many/most genus namesplus species names from catalogue of Life in the
> main, and more recently
> > - the iPlant TNRS http://tnrs.iplantcollaborative.org/TNRSapp.html,
> operatin g over the TROPICOS plant database
> > - the World Register of Marine Species portal
> http://www.marinespecies.org/aphia.php operating over its own data, and
> > - the Australian "National Species Lists" search service
> http://biodiversity.org.au/service/taxamatch (working over different
> master lists as their reference database) as mentioned by Greg; plus no
> doubt others I have not mentioned.
> > So we could say that the technical aspects of (1) and (2) are basically
> not a problem. The residual problem is the construction of the (actual or
> virtual) "master list". An example of a virtual "master list" (for plants,
> at genus level only) is provided by the Index Nominum Genericorum (ING)
> portal at http://botany.si.edu/ing/ : entering a query in the first text
> box does a real time distributed search of designated resources which
> between them provide a (partly overlapping) coverage of most/all plant
> genus names, comprising ING itself, IPNI, Index Nominum Algarum, TROPICOS,
> and Index Fungorum in this instance (however without any fuzzy match
> function). This is one approach; benefits are the removal of the need for
> one site to hold all the names to search over, plus the removal of any
> synchronization issues between a remote "point of truth" for the records
> and a central cache of the data. Downsides are the fact that
> deduplication/data harmonization of potential duplicates from multiple
> sources is not done; multiple taxonomic concepts/hierarchies may be in use
> at the various providers; it is difficult to provide fuzzy search over
> remotely hosted data; plus if any provider is off line at time of query,
> their data are not searched.
> > The alternative is for remote provider content to be regularly crawled
> or exported, then cached centrally for the search process. This also
> provides the option for additional QA / data deduplication and
> harmonization and also notionally improved performance (provided that
> sufficient resources can be thrown at the one machine where the searches
> execute in real time). Disadvantages then include the need for continuous
> assembly and reassembly of the aggregate dataset, and the possibility of
> the central "view" of the data being out-of-synch with the latest changes
> at the provider; but something which can be managed in the main (as is
> already done my many similar "aggregators" of species distribution records).
> > So the residual question is, where are all the data - also in some
> cases, whose content is the most up-to-date / complete / authoritative /
> accepted in case of potential multiple sources; plus of course, filling
> gaps where currently there is no obvious source of content for a particular
> group or region. These are questions which concern key projects at the
> present time, exemplified by the Catalogue of Life partnership (Sp2000 plus
> ITIS) for extant data, other sources for data on fossils, and the "Global
> Names" partnership, to name a few. Others more qualified than myself can
> answer this question and look at associated issues of resourcing,
> persistence, data completeness, data sharing culture and the like but at
> least what we have is a start...
> > So the upshot of the above is:
> > - For plants, GRIN and iPlant already provide most of the desired
> functionality (also ING distributed search for genera)
> > - for marine species, try WoRMS
> > - for (e.g.) Australian species, Greg's "NSLs" project resources as
> above, for Europaean species, PESI http://www.eu-nomen.eu/portal/, and so
> > - for Cat. of Life species, plus genera from ING (plants) and
> Nomenclator Zoologicus (animals) plus elsewhere, my own IRMNG.
> > None of the above resources are as yet complete or completely populated
> (in my case, definitely not...) however they are not only pointers along
> the road but useable resources today.
> > How do we get to where we want to be? Improve the master list, keep it
> up-to-date, continuously improve the quality and completeness of the
> accessible data... But it does require a "client focus" which provides
> strong directions for the types of services to be provide, and their actual
> useability when accessed. (Who has the mandate / who pays are different
> issues of course).
> > Probably little of this is news to Taxacomers, but I thought I would
> just show that it is not all gloom and doom. And this is not to mention the
> myriad (and often excellent) taxon-specific database projects out there, of
> which Paul Kirk and Chris Thompson have already mentioned shining examples,
> to name but two... - most of which are already engaged in either Sp2000,
> Global Names Architecture, or both.
> > Regards - Tony
> > Tony Rees
> > Manager, Divisional Data Centre,
> > CSIRO Marine and Atmospheric Research,
> > GPO Box 1538,
> > Hobart, Tasmania 7001, Australia
> > Ph: 0362 325318 (Int: +61 362 325318)
> > Fax: 0362 325000 (Int: +61 362 325000)
> > e-mail: Tony.Rees at csiro.au
> > Manager, OBIS Australia regional node, http://www.obis.org.au/
> > Biodiversity informatics research activities:
> > Personal info:
> > LinkedIn profile: http://www.linkedin.com/pub/tony-rees/18/770/36
> >> -----Original Message-----
> >> From: taxacom-bounces at mailman.nhm.ku.edu [mailto:taxacom-
> >> bounces at mailman.nhm.ku.edu] On Behalf Of Roderic Page
> >> Sent: Thursday, 16 February 2012 6:24 AM
> >> To: taxacom
> >> Subject: Re: [Taxacom] validation of taxon names
> >> Dear Doug,
> >> Regarding
> >>> #4 is something that cannot be objectively determined, because
> >>> synonymy is almost invariably subjective.
> >> presumably the fact that person x asserted that two names are synonyms
> >> can be determined objectively, and that's all we need to know.
> >> Regards
> >> Rod
> >> On 15 Feb 2012, at 18:40, Doug Yanega wrote:
> >>> I would observe that, for zoological names, of the following list:
> >>>> 1. Is this a name?
> >>>> 2. Is this the correct way to write it?
> >>>> 3. Is this name currently in use?
> >>>> 4. What other names are related to this name (e.g., synonyms,
> >>>> lexical variants)?
> >>>> 5. Where was this name published? Can I see that publication?
> >>> at least 1 and 5 are questions for which an objective and definitive
> >>> answer (via application of the ICZN for #1) can be arrived at, and
> >>> that the answer will not change. Thus, these are things which could
> >>> be made part of a permanent public archive (hopefully, something like
> >>> ZooBank).
> >>> #2 and 3 are things that can, in essence, be objectively determined
> >>> under the Code, but are subject to the nuance of "prevailing usage" -
> >>> that is, a sudden change in how taxonomists treat a name can shift
> >>> the answer from "no" to "yes" (in both cases) or from "yes" to "no"
> >>> (for #2). One hope that I have is that a mechanism for Registration
> >>> can be implemented in the future which will prevent such fluctuation,
> >>> and thus make the answers to 2 and 3 immutable, as well.
> >>> #4 is something that cannot be objectively determined, because
> >>> synonymy is almost invariably subjective.
> >>> Realistically, then, this list represents a mixed bag of the
> >>> immediately attainable, the potentially attainable, and the
> >>> unattainable. It might be more productive to focus on the former
> >>> categories, in terms of a community-wide goal. I'll further note that
> >>> if taxonomists want a system of Registration that will result in
> >>> permanently stable names, then they are probably going to have to
> >>> insist upon it, *and* be willing to participate in the process
> >>> (because such a process is likely to require public review). I'm not
> >>> 100% sure whether botanical names would work exactly the same way,
> >>> but I expect that the situation would be pretty much the same.
> >>> Peace,
> >>> --
> >>> Doug Yanega Dept. of Entomology Entomology Research
> >> Museum
> >>> Univ. of California, Riverside, CA 92521-0314 skype: dyanega
> >>> phone: (951) 827-4315 (standard disclaimer: opinions are mine, not
> >> UCR's)
> >>> http://cache.ucr.edu/~heraty/yanega.html
> >>> "There are some enterprises in which a careful disorderliness
> >>> is the true method" - Herman Melville, Moby Dick, Chap. 82
> Roderic Page
> Professor of Taxonomy
> Institute of Biodiversity, Animal Health and Comparative Medicine
> College of Medical, Veterinary and Life Sciences
> Graham Kerr Building
> University of Glasgow
> Glasgow G12 8QQ, UK
> Email: r.page at bio.gla.ac.uk
> Tel: +44 141 330 4778
> Fax: +44 141 330 2792
> AIM: rodpage1962 at aim.com
> Facebook: http://www.facebook.com/profile.php?id=1112517192
> Twitter: http://twitter.com/rdmpage
> Blog: http://iphylo.blogspot.com
> Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> The Taxacom archive going back to 1992 may be searched with either of
> these methods:
> (1) by visiting http://taxacom.markmail.org
> (2) a Google search specified as: site:
> mailman.nhm.ku.edu/pipermail/taxacom your search terms here
More information about the Taxacom