[Taxacom] TAXAMATCH

Tony.Rees at csiro.au Tony.Rees at csiro.au
Tue Nov 25 16:32:26 CST 2008


Dear all,

Have you ever entered "Caelorinchus" into a species search page, only to discover that the data you want is held under the name "Coelorinchus" (or even Coelorynchus, Coelorhynchis, etc...), or typed "Panulirus" when you really meant "Palinurus" (or vice versa)? Or can you really remember the correct spelling for "Syzygotettix boettcheri"? (I know I can't...)

If any of the above seems applicable to you, you may be interested in work I have been doing to develop TAXAMATCH - a fuzzy matching algorithm for species (and / or genus) scientific names; as a bonus it also does authority comparisons as well. My initial experiments in this area started from the premise that users looking for data could not always spell their search terms correctly, however over time (the last 7 years of experience) I have also realised that many stored names also vary in spelling from a single canonical form, either through mis-typing, phonetic errors, OCR errors, or other, or indeed relevant experts may disagree over what the correct canonical form is (or the latter changes through time - e.g. the Caelorinchus / Coelorinchus case cited above), so the algorithm is also useful for comparing names on stored lists to find candidate approximate matches, and/or for deduplication of existing data systems (ensuring that the same taxon or name is only included once, or alternatively, linking variant instances of the same name to each other to avoid double counting, etc.).

I have had prototype versions of TAXAMATCH in operation on my own system for over 12 months now (accessible via http://www.cmar.csiro.au/datacentre/irmng/) and recently (last week) produced the first distributable version for others to look at and, if sufficiently impressed, install on their own system. Currently this initial version is written in the Oracle PL/SQL programming language so is only useable "as is" with Oracle databases, however a small group of interested parties has expressed the intention to port it to other languages and database environments as well (probably PHP and mySQL in the first instance, others possibly to follow).

If you would like a copy of the present "version 1.0" of TAXAMATCH (for Oracle), or have an interest in participating in ongoing code development activities, please contact me at the email address below. Some further information about TAXAMATCH is also available via the following URL:

http://www.cmar.csiro.au/datacentre/biodiversity.htm#taxamatch

Regards - Tony

Tony Rees
Manager, Divisional Data Centre,
CSIRO Marine and Atmospheric Research,
GPO Box 1538,
Hobart, Tasmania 7001, Australia
Ph: 0362 325318 (Int: +61 362 325318)
Fax: 0362 325000 (Int: +61 362 325000)
e-mail: Tony.Rees at csiro.au<mailto:Tony.Rees at csiro.au>
Biodiversity informatics research activities: http://www.cmar.csiro.au/datacentre/biodiversity.htm
Personal info: http://www.fishbase.org/collaborators/collaboratorsummary.cfm?id=1566




More information about the Taxacom mailing list