[Taxacom] FW: Real time batch spell checking of scientific names now available via IRMNG
Tony.Rees at csiro.au
Tony.Rees at csiro.au
Wed Nov 13 23:46:36 CST 2013
Dear Quentin (also copying to the list for interest),
Further to your request as below I have added a facility for pre-formatting of exact and near match search results to the IRMNG search interface - look for the entry "Response format:" and check the option for "Delimited (for upload e.g. to spreadsheet)". I have chosen the pipe delimiter ("|") between fields (columns) but you can replace this by tabs or other preferred delimiters as desired after copying the content locally (commas are not recommended since they may appear within the included text strings).
Now, entering a batch of names such as (subset below) and requesting "delimited" format, after pressing "Check species name(s)" you will get this (inside some generic informative text):
At the IRMNG data search page http://www.cmar.csiro.au/datacentre/irmng/:
(input names entered):
(result: - NB for readability in this email I have added an additional line break between rows. The input name is given first, and may be reflected in multiple rows if multiple exact and/or near matches are detected; also note, as desired, matches can be restricted to just within a selectable higher taxonomic unit e.g. animal phyla, vertebrate classes/superclass "Pisces", land plants, algae, fungi, bacteria, archaea)
Acanthophora glomerata|none||||||||||||Found the following exact or phonetic near matches on input genus name: Acanthopora Moseley, 1876 in family Stylasteridae (Animalia-Cnidaria-Hydrozoa-Anthoathecata) - phonetic near match; Acanthopora d'Orbigny, 1849 in family Cyclostomatida (awaiting allocation) (Animalia-Bryozoa-Stenolaemata-Cyclostomatida) - phonetic near match; Acanthophoria Gorjanovic-Kramberger, 1895 in family Pisces (awaiting allocation) (Animalia-Chordata-Pisces (awaiting allocation)-Pisces (awaiting allocation)) - phonetic near match; Acanthophora Borgmeier, 1922 in family Phoridae (Animalia-Arthropoda-Insecta-Diptera) - exact match; Acanthophora Sollas, 1873 in family Astrophorida (awaiting allocation) (Animalia-Porifera-Demospongea-Astrophorida) - exact match; Acanthophora J.V.F. Lamouroux, 1813 in family Rhodomelaceae (Plantae-Rhodophyta-Florideophyceae-Ceramiales) - exact match; Acanthopora Verrill, 1864 in family Faviidae (Animalia-Cnidaria-Anthozoa-Scleractinia) - phonetic near match; Acanthopora Young & Young, 1876 in family Bryozoa (awaiting allocation) (Animalia-Bryozoa-Bryozoa (awaiting allocation)-Bryozoa (awaiting allocation)) - phonetic near match; Acanthophora Hulst, 1896 in family Geometridae (Animalia-Arthropoda-Insecta-Lepidoptera) - exact match; Acanthopora Valiukevicius, 2003 in family Ischnacanthidae (Animalia-Chordata-Acanthodii-Ischnacanthiformes) - phonetic near match; Acanthophora Merrill, 1918 in family Araliaceae (Plantae-Magnoliophyta-Magnoliopsida-Apiales) - exact match
Acrosterigma vlamigi|near|1|Acrosterigma vlamingi|Wilson & Stevenson, 1977|11121471|Cardiidae|Animalia-Mollusca-Bivalvia-Veneroida|E|M||||Found the following exact or phonetic near matches on input genus name: Acrosterigma Dall, 1900 in family Cardiidae (Animalia-Mollusca-Bivalvia-Veneroida) - exact match
Aeverrillia pilosa|none||||||||||||Found the following exact or phonetic near matches on input genus name: Aeverrillia Marcus, 1941 in family Aeverrilliidae (Animalia-Bryozoa-Gymnolaemata-Ctenostomata) - exact match
Alcospira rosea|near|1|Alocospira rosea|Macpherson, 1956|10889877|Olividae|Animalia-Mollusca-Gastropoda-Neogastropoda|E|M||||(no exact or phonetic match on input genus found)
Alliodoris hedley|near|2|Alloiodoris hedleyi|O'Donoghue, 1924|11888095|Discodorididae|Animalia-Mollusca-Gastropoda-Nudibranchia|E|M|Sebadoris fragilis|11463592||Found the following exact or phonetic near matches on input genus name: Aleodorus Say, 1839 in family Staphylinidae (Animalia-Arthropoda-Insecta-Coleoptera) - phonetic near match; Aleodorus Say, 1833 in family Staphylinidae (Animalia-Arthropoda-Insecta-Coleoptera) - phonetic near match
Anadara articulata|near|2|Anadara auriculata|(Lamarck, 1819)|11118075|Arcidae|Animalia-Mollusca-Bivalvia-Arcoida|E|M|||Authority given in CoL2006/ITS as Lamarck.|Found the following exact or phonetic near matches on input genus name: Anadora Kerremans, 1898 in family Buprestidae (Animalia-Arthropoda-Insecta-Coleoptera) - phonetic near match; Anadara Moore, 1883 in family Danaidae (Animalia-Arthropoda-Insecta-Lepidoptera) - exact match; Anadara Deshayes, 1830 in family Mollusca (awaiting allocation) (Animalia-Mollusca-Mollusca (awaiting allocation)-Mollusca (awaiting allocation)) - exact match; Anadara Gray, 1847 in family Arcidae (Animalia-Mollusca-Bivalvia-Arcoida) - exact match; Anadaria Kobelt, 1910 in family Mollusca (awaiting allocation) (Animalia-Mollusca-Mollusca (awaiting allocation)-Mollusca (awaiting allocation)) - phonetic near match
If you copy this text locally, replace all instances of "|" with the tab character, and paste it into e.g. an Excel spreadsheet you should have what you want I believe, suitable for some more standardized processing or input to other systems, as compared the alternative/default format previously offered which was designed primarily for human reading on the screen.
In some cases the nearest match returned may be lexically rather distant (as indicated by the value for "sp_edit_distance" where 0 = exact match, 1 = 1 character different, etc.) so the more distant matches should be viewed with some caution as compared to the nearer ones (in other words human review is still required to accept/reject in each case: you could say that the computer is providing a coarse filter and the human reviewer can then refine it according to their specialist knowledge and expectations). Also as previously stated, the reference database being searched (IRMNG) is not currently as complete at species level as it is for genera, so species names or combinations with no plausible exact near match may simply be awaiting entry to the reference name set and may well be correctly spelled.
In addition (as pointed out by another off-list comment) the lexically nearest match might be in a taxonomically remote group from that of the input name and therefore misleading: in such a case, at least where the higher taxonomic affinities of the input names are known and are consistent (e.g. all land plants, all fishes), selection of the relevant higher taxonomic filter before searching will produce a more taxonomically meaningful result.
Hoping this is of value to yourself and other list persons (and thanks for comments thus far on this facility). I also received another off-list comment requesting the same (or similar) as a web service - I am not quite sure what form that response should take at this time but will think further on it...
Regards - Tony
> -----Original Message-----
> From: Quentin Groom [mailto:quentin.groom at br.fgov.be]
> Sent: Tuesday, 29 October 2013 11:51 PM
> To: Rees, Tony (CMAR, Hobart)
> Subject: Re: [Taxacom] Real time batch spell checking of scientific
> names now available via IRMNG
> Hi Tony,
> thanks for posting this! I was just looking for such a service.
> the output isn't quite what I need and I think it might be worth
> considering in future versions.
> I want to be able to link the names I have in my document with
> names. So my preferred output would be a comma separated file with the
> input name in the first column and the rest of the information in
> subsequent columns, whether there is a match or not.
> The very human readable output you have at the moment does allow me to
> cut and paste long lists as much as I want to.
> Dr. Quentin Groom
> (Botany and Information Technology)
> National Botanic Garden of Belgium
> Domein van Bouchout
> B-1860 Meise
> ORCID: 0000-0002-0596-5376
> Tony.Rees at csiro.au wrote:
> > Dear Taxacomers,
> > Mindful of the present round of acronym-bashing, I thought I might
> let you know of a useful new feature added today to my own aggregated
> biodiversity database "IRMNG": real-time spelling correction of
> multiple supplied species names (previously this feature was only
> activated on single supplied names for performance reasons, to avoid
> potentially lengthy delays, now worked around).
> > Here's how to use it:
> > - Take a list of species names including potentially misspelled ones
> (can also have authorities appended too as available), one per line,
> max. around 1,500-2,000 at a time depending on word length.
> > ** Here is an example small set of real world marine species data I
> am currently working on for a user in my agency, from a field survey
> list, excluding names which already have an exact match in the main
> IRMNG list):
> > Acanthophora glomerata
> > Acrosterigma vlamigi
> > Aeverrillia pilosa
> > Alcospira rosea
> > Alliodoris hedley
> > Anadara articulata
> > Ancilla cingulata
> > Angula sphaeruia
> > Anquipecten aurantiacus
> > Arca avellana_MTQ
> > Arca avellana_QMS
> > Arcania foleolata
> > Ashtoret planipes
> > Australium tentoriformis
> > Austrolabidia gracilipes
> > Beania spinulosa
> > Biflustra limosa
> > Botryocladia skottsbergi
> > Bufonia margaritula
> > Bugula johnsoni
> > Bursa thersites
> > Calliostoma monile
> > Callyspongia schultzi
> > Cancilla fillaris
> > Caulerpa urvilliana
> > (and so on)
> > - Go to the IRMNG data access page at
> http://www.cmar.csiro.au/datacentre/irmng/, copy-and-paste the list
> into the search box
> > - Press "check species names"
> > Look for "Species names not found" at the bottom (obviously, names
> found will be resolved first, however in this case there are none).
> > After each name not found there will be information about whether at
> least the genus name is held in that form (for something, may not be
> the intended target of course) then either the nearest matching species
> name or names, or a "no match" message at species level. Click on any
> near match name to get the full taxonomic hierarchy, synonym status
> where known, and other information as presently held in the database.
> > For the record IRMNG could not be built without drawing on other
> "names aggregator" activities (acronyms if you must) including
> Catalogue of Life (only 2006 version as yet), WoRMS (World register of
> Marine Species) and more - eventually also to include names from The
> Plant List when that data is re-usable as advised earlier today (thanks
> Rafaël!) -
> > building on the efforts of their respective aggregators of course
> (since entering names individually would not be tractable). It is also
> not complete at this time (only 1.9 million species names held, lots
> including many fossil species and certain higher plants still missing),
> but will be added to further as time and resources may be available.
> Once I have all The Plant List data added plus more recent updates to
> Catalogue of Life included it should be useful to more people again.
> > I hope at least some on this list may find this feature useful in
> your work and I am very happy for you to recommend the site to others
> as appropriate.
> > Regards to all- Tony
> > Dr Tony Rees
> > Manager | Divisional Data Centre
> > Marine and Atmospheric Research
> > CSIRO
> > E Tony Rees at csiro.au T +61 3 6232 5318
> > CSIRO Marine and Atmospheric Research, GPO Box 1538, Hobart, TAS
> 7001, Australia
> > www.cmar.csiro.au/datacentre
> > Manager, OBIS Australia regional Node, http://www.obis.au
> > LinkedIn profile: http://www.linkedin.com/pub/tony-rees/18/770/36
> > _______________________________________________
More information about the Taxacom