[Taxacom] validation of taxon names
r.page at bio.gla.ac.uk
Thu Feb 16 10:24:38 CST 2012
Yes, federated searches are problematic, and only scale with caching, so in effect you end up building a local database anyway.
If there's a single entry point for the web service, and it is using a local database, then you could batch the queries and still get a reasonable response time.
But if you have LOTS of names then having a local of the data copy is great. CouchDB is a wonderful tool, especially for semi-structured data. It excels at some things, but bulk checking of names probably isn't one of them (you'd probably get more joy doing an SQL join in a relational database then handing the names that don't match).
On 16 Feb 2012, at 13:12, Armand Turpel wrote:
> Dear Roderic,
> I have some doubts while reading the architecture description here:
> Amongst the problems your described, my experience is that web services are
> slow and technically difficult to maintain. I don’t think that such a
> service can handle my demand in the initial post of this thread. But there
> is a tool which is may worth to mention. Document based databases such as
> couchdb can solve a lot of those problems. Here an excerpt from the couchdb
> “CouchDB is a peer based distributed database system. Any number of CouchDB
> hosts (servers and offline-clients) can have independent “replica copies”
> of the same database, where applications have full database interactivity
> (query, add, edit, delete). When back online or on a schedule, database
> changes are replicated bi-directionally...”
> An other advantage of such a system is that it serves also as an
> application server from which applications can be replicated. Setup and
> maintaining a network based on this needs much less effort.
> What do you think? As far as I know you have worked with couchdb:
> 2012/2/16 Roderic Page <r.page at bio.gla.ac.uk>
>> Dear Tony,
>> Yes, in many ways we are close, we just haven't put the bits together in
>> one highly visible place that people can use.
>> I used to run the "Taxonomic Search Engine" described in
>> http://dx.doi.org/10.1186/1471-2105-6-48 , which was an example of a
>> federated search engine that queried multiple sources on the fly (a bit
>> like http://botany.si.edu/ing/ but I used web services and extracted and
>> reformatted the data rather than show results as web pages inside frames).
>> The two biggest problems were the source services going offline, or
>> changing their web services (breaking my code). There also performance
>> issues when doing "live" searching as opposed to searching a local list.
>> You've covered all these issues nicely. There's also the problem of
>> redundancy. Because some lists are aggregations of other lists, we can end
>> up with the same source being represented several times in the results, but
>> without the user being aware of this.
>> I just think there's clearly scope for bringing these sorts of services
>> together in one place and providing people with some tools to address the
>> problems they face, rather than the chaotic landscape we present at the
>> On 16 Feb 2012, at 03:19, <Tony.Rees at csiro.au> <Tony.Rees at csiro.au> wrote:
>>> Dear all,
>>> My take on the issue/s below...
>>> Basically what is sought is a taxonomic name reconciliation service
>> (TNRS) - now, where have I heard that before...
>>> In my mind this comprises three coupled components:
>>> (1) a web (human and machine) interface to search a "master list" or
>> global taxon register
>>> (2) a fuzzy match component to cope with misspelled queries
>>> (3) the master list itself.
>>> Prototypes of just such a system have been in existence for several
>> years, notably:
>>> - the GRIN Taxonomic Nomenclature Checker
>> http://pgrdoc.bioversity.cgiar.org/taxcheck/ (the granddaddy of such
>> systems), operating over the GRIN database for higher plants
>>> - my own IRMNG http://www.cmar.csiro.au/datacentre/irmng/, operating
>> over many/most genus namesplus species names from catalogue of Life in the
>> main, and more recently
>>> - the iPlant TNRS http://tnrs.iplantcollaborative.org/TNRSapp.html,
>> operatin g over the TROPICOS plant database
>>> - the World Register of Marine Species portal
>> http://www.marinespecies.org/aphia.php operating over its own data, and
>>> - the Australian "National Species Lists" search service
>> http://biodiversity.org.au/service/taxamatch (working over different
>> master lists as their reference database) as mentioned by Greg; plus no
>> doubt others I have not mentioned.
>>> So we could say that the technical aspects of (1) and (2) are basically
>> not a problem. The residual problem is the construction of the (actual or
>> virtual) "master list". An example of a virtual "master list" (for plants,
>> at genus level only) is provided by the Index Nominum Genericorum (ING)
>> portal at http://botany.si.edu/ing/ : entering a query in the first text
>> box does a real time distributed search of designated resources which
>> between them provide a (partly overlapping) coverage of most/all plant
>> genus names, comprising ING itself, IPNI, Index Nominum Algarum, TROPICOS,
>> and Index Fungorum in this instance (however without any fuzzy match
>> function). This is one approach; benefits are the removal of the need for
>> one site to hold all the names to search over, plus the removal of any
>> synchronization issues between a remote "point of truth" for the records
>> and a central cache of the data. Downsides are the fact that
>> deduplication/data harmonization of potential duplicates from multiple
>> sources is not done; multiple taxonomic concepts/hierarchies may be in use
>> at the various providers; it is difficult to provide fuzzy search over
>> remotely hosted data; plus if any provider is off line at time of query,
>> their data are not searched.
>>> The alternative is for remote provider content to be regularly crawled
>> or exported, then cached centrally for the search process. This also
>> provides the option for additional QA / data deduplication and
>> harmonization and also notionally improved performance (provided that
>> sufficient resources can be thrown at the one machine where the searches
>> execute in real time). Disadvantages then include the need for continuous
>> assembly and reassembly of the aggregate dataset, and the possibility of
>> the central "view" of the data being out-of-synch with the latest changes
>> at the provider; but something which can be managed in the main (as is
>> already done my many similar "aggregators" of species distribution records).
>>> So the residual question is, where are all the data - also in some
>> cases, whose content is the most up-to-date / complete / authoritative /
>> accepted in case of potential multiple sources; plus of course, filling
>> gaps where currently there is no obvious source of content for a particular
>> group or region. These are questions which concern key projects at the
>> present time, exemplified by the Catalogue of Life partnership (Sp2000 plus
>> ITIS) for extant data, other sources for data on fossils, and the "Global
>> Names" partnership, to name a few. Others more qualified than myself can
>> answer this question and look at associated issues of resourcing,
>> persistence, data completeness, data sharing culture and the like but at
>> least what we have is a start...
>>> So the upshot of the above is:
>>> - For plants, GRIN and iPlant already provide most of the desired
>> functionality (also ING distributed search for genera)
>>> - for marine species, try WoRMS
>>> - for (e.g.) Australian species, Greg's "NSLs" project resources as
>> above, for Europaean species, PESI http://www.eu-nomen.eu/portal/, and so
>>> - for Cat. of Life species, plus genera from ING (plants) and
>> Nomenclator Zoologicus (animals) plus elsewhere, my own IRMNG.
>>> None of the above resources are as yet complete or completely populated
>> (in my case, definitely not...) however they are not only pointers along
>> the road but useable resources today.
>>> How do we get to where we want to be? Improve the master list, keep it
>> up-to-date, continuously improve the quality and completeness of the
>> accessible data... But it does require a "client focus" which provides
>> strong directions for the types of services to be provide, and their actual
>> useability when accessed. (Who has the mandate / who pays are different
>> issues of course).
>>> Probably little of this is news to Taxacomers, but I thought I would
>> just show that it is not all gloom and doom. And this is not to mention the
>> myriad (and often excellent) taxon-specific database projects out there, of
>> which Paul Kirk and Chris Thompson have already mentioned shining examples,
>> to name but two... - most of which are already engaged in either Sp2000,
>> Global Names Architecture, or both.
>>> Regards - Tony
>>> Tony Rees
>>> Manager, Divisional Data Centre,
>>> CSIRO Marine and Atmospheric Research,
>>> GPO Box 1538,
>>> Hobart, Tasmania 7001, Australia
>>> Ph: 0362 325318 (Int: +61 362 325318)
>>> Fax: 0362 325000 (Int: +61 362 325000)
>>> e-mail: Tony.Rees at csiro.au
>>> Manager, OBIS Australia regional node, http://www.obis.org.au/
>>> Biodiversity informatics research activities:
>>> Personal info:
>>> LinkedIn profile: http://www.linkedin.com/pub/tony-rees/18/770/36
>>>> -----Original Message-----
>>>> From: taxacom-bounces at mailman.nhm.ku.edu [mailto:taxacom-
>>>> bounces at mailman.nhm.ku.edu] On Behalf Of Roderic Page
>>>> Sent: Thursday, 16 February 2012 6:24 AM
>>>> To: taxacom
>>>> Subject: Re: [Taxacom] validation of taxon names
>>>> Dear Doug,
>>>>> #4 is something that cannot be objectively determined, because
>>>>> synonymy is almost invariably subjective.
>>>> presumably the fact that person x asserted that two names are synonyms
>>>> can be determined objectively, and that's all we need to know.
>>>> On 15 Feb 2012, at 18:40, Doug Yanega wrote:
>>>>> I would observe that, for zoological names, of the following list:
>>>>>> 1. Is this a name?
>>>>>> 2. Is this the correct way to write it?
>>>>>> 3. Is this name currently in use?
>>>>>> 4. What other names are related to this name (e.g., synonyms,
>>>>>> lexical variants)?
>>>>>> 5. Where was this name published? Can I see that publication?
>>>>> at least 1 and 5 are questions for which an objective and definitive
>>>>> answer (via application of the ICZN for #1) can be arrived at, and
>>>>> that the answer will not change. Thus, these are things which could
>>>>> be made part of a permanent public archive (hopefully, something like
>>>>> #2 and 3 are things that can, in essence, be objectively determined
>>>>> under the Code, but are subject to the nuance of "prevailing usage" -
>>>>> that is, a sudden change in how taxonomists treat a name can shift
>>>>> the answer from "no" to "yes" (in both cases) or from "yes" to "no"
>>>>> (for #2). One hope that I have is that a mechanism for Registration
>>>>> can be implemented in the future which will prevent such fluctuation,
>>>>> and thus make the answers to 2 and 3 immutable, as well.
>>>>> #4 is something that cannot be objectively determined, because
>>>>> synonymy is almost invariably subjective.
>>>>> Realistically, then, this list represents a mixed bag of the
>>>>> immediately attainable, the potentially attainable, and the
>>>>> unattainable. It might be more productive to focus on the former
>>>>> categories, in terms of a community-wide goal. I'll further note that
>>>>> if taxonomists want a system of Registration that will result in
>>>>> permanently stable names, then they are probably going to have to
>>>>> insist upon it, *and* be willing to participate in the process
>>>>> (because such a process is likely to require public review). I'm not
>>>>> 100% sure whether botanical names would work exactly the same way,
>>>>> but I expect that the situation would be pretty much the same.
>>>>> Doug Yanega Dept. of Entomology Entomology Research
>>>>> Univ. of California, Riverside, CA 92521-0314 skype: dyanega
>>>>> phone: (951) 827-4315 (standard disclaimer: opinions are mine, not
>>>>> "There are some enterprises in which a careful disorderliness
>>>>> is the true method" - Herman Melville, Moby Dick, Chap. 82
>> Roderic Page
>> Professor of Taxonomy
>> Institute of Biodiversity, Animal Health and Comparative Medicine
>> College of Medical, Veterinary and Life Sciences
>> Graham Kerr Building
>> University of Glasgow
>> Glasgow G12 8QQ, UK
>> Email: r.page at bio.gla.ac.uk
>> Tel: +44 141 330 4778
>> Fax: +44 141 330 2792
>> AIM: rodpage1962 at aim.com
>> Facebook: http://www.facebook.com/profile.php?id=1112517192
>> Twitter: http://twitter.com/rdmpage
>> Blog: http://iphylo.blogspot.com
>> Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
>> Taxacom Mailing List
>> Taxacom at mailman.nhm.ku.edu
>> The Taxacom archive going back to 1992 may be searched with either of
>> these methods:
>> (1) by visiting http://taxacom.markmail.org
>> (2) a Google search specified as: site:
>> mailman.nhm.ku.edu/pipermail/taxacom your search terms here
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> The Taxacom archive going back to 1992 may be searched with either of these methods:
> (1) by visiting http://taxacom.markmail.org
> (2) a Google search specified as: site:mailman.nhm.ku.edu/pipermail/taxacom your search terms here
Professor of Taxonomy
Institute of Biodiversity, Animal Health and Comparative Medicine
College of Medical, Veterinary and Life Sciences
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK
Email: r.page at bio.gla.ac.uk
Tel: +44 141 330 4778
Fax: +44 141 330 2792
AIM: rodpage1962 at aim.com
Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
More information about the Taxacom