[Taxacom] the hurdle for all biodiv informatics initiatives
dipteryx at freeler.nl
dipteryx at freeler.nl
Thu Feb 25 02:50:16 CST 2010
Van: Tony.Rees at csiro.au [mailto:Tony.Rees at csiro.au]
Verzonden: wo 24-2-2010 23:44
Dear Paul, all,
I like your postal analogy, but in fact the effort we are discussing (I think) is directed not towards misprinted stamps (which indeed would be mere curiosities) but mis-spelled or mis-addressed recipients (typically with something of value in the included content). The fact that the recipients can also change their names and addresses, or two persons can share a common name or address yet be distinct, is also relevant, and compounds the issue.
The postal system is not a bad analogy, although for this purpose
it would probably be better to take into account everything that has
ever been mailed.
* * *
So either we end up with a lot of undeliverable or wrongly delivered mail, or we try to handle the various problems - in part indeed with algorithms to deal with misspellings (a personal interest here, e.g. see my own attempts at http://www.cmar.csiro.au/datacentre/taxamatch.htm)
An algorithm is indeed a logical avenue to explore (so I am glad to
see this was not disregarded), but it is not necessarily misspellings
that have to be dealt with. There is a huge variation in author citation
that does not fall in that category.
An algoritm should for the most part be customized rather strictly:
there is a limited set of variations that are very likely to be found,
and it should be possible to deal with these efficiently.
* * *
, and in part by building reference lists of either "clean" names alone, or "clean plus dirty" (i.e. misspelled) names.
The extent to which the latter is required or desirable is a matter for debate; obviously if you have a lot of incoming data (such as museum specimens, field observations, and literature / nomenclator citations) labelled with such misspellings you cannot throw them away, and it is probably advantageous to keep them, but adequately reconciled and cross-referenced as required. More "secondary" misspellings such as OCR, non-specialist authored web content and database errors I am inclined not to keep (in my systems at least) since otherwise, where do you stop, however as with many such matters there is no exact line that is easily drawn.
So my approach, developed and iterated over more that a decade of handling such issues, is pragmatic but does not exclude "known" misspellings as well as those which are attached to data/information of interest. Of course one could try to predict and store the possible variants of even a four letter word such as "Aloe" but you soon run out of fingers or hands on which to count them, so definitely not worth it, and probably a waste of effort as you suggest...
Manager, Divisional Data Centre,
CSIRO Marine and Atmospheric Research,
GPO Box 1538,
Hobart, Tasmania 7001, Australia
Ph: 0362 325318 (Int: +61 362 325318)
Fax: 0362 325000 (Int: +61 362 325000)
e-mail: Tony.Rees at csiro.au
Manager, OBIS Australia regional node, http://www.obis.org.au/
Biodiversity informatics research activities: http://www.cmar.csiro.au/datacentre/biodiversity.htm
Personal info: http://www.fishbase.org/collaborators/collaboratorsummary.cfm?id=1566
More information about the Taxacom