[Taxacom] Data query

Stephen Thorpe stephen_thorpe at yahoo.co.nz
Tue Jun 25 04:44:26 CDT 2013

I still don't quite understand! If a human has to make the link between two identifiers, then why not just use the name, Wikispecies style, which would be completely automatic in all but a very few cases of homonymy, where a human could then intervene. The net result would be far less human work, surely! If all linked databases used the name as the identifier, then they could talk to each other quite happily! I really don't buy that variable identifier length would be a big problem. If everything is done correctly, a link to Wikispecies using a homonym will go to a disambiguation page from which you can choose which homonym you want. The individual URLs for the homonyms may have an arbitrary bit, but the numbers of cases is very small compared to the total, so only these relatively few would require human matching between databases, not the whole bloody lot!

From: Richard Pyle <deepreef at bishopmuseum.org>
To: 'Stephen Thorpe' <stephen_thorpe at yahoo.co.nz>; Tony.Rees at csiro.au; mesibov at southcom.com.au 
Cc: taxacom at mailman.nhm.ku.edu 
Sent: Tuesday, 25 June 2013 9:30 PM
Subject: RE: [Taxacom] Data query

> I may be wrong, but in order to record mappings of the kind 
> “ZB: 8BDC0735-FEA4-4298-83FA-D04F67C3FBEC maps to WoRMS:345844”
>, doesn't this have to be done manually and painstakingly?

It certainly *can* be done manually, but it usually is not painstaking in any way.  The vast majority of cross links can be done automatically, such as when a dataset is bulk imported.  For example, when I imported all the names from Eschmeyer's Catalog of Fishes database, all the cross links for the names AND the references AND the Journals came automatically.  Likewise, I built a routine that cross-links to ITIS TSNs (when they exist) more or less automatically as well.  In doing so, I can now bridge CoF to ITIS.  Meanwhile, FishBase already has links to CoF and to CoL.  This means we can automatically cross-link ZooBank to FB (and therefore automatically link ITIS to FB), and conversely link CoL back to CoF, ITIS, and ZB.  This is the magic of persistent identifiers.  Once a cross-link is established *once*, it can be automatically inherited.

Even the manual process can be streamlined.  For example, the plan with the GNI service is to put check boxes next to each link in the GNI resultset on ZB.  After a human confirms that the match is good, they can just check the boxes for the confirmed links and press a single button to capture all the cross-links.  It wouldn't be too hard to come up with a metric for name-match confidence, such that when the confidence is above a certain threshold, the links are created automatically.

The cool thing about this process is that it is accelerative.  That is, the more links you make, the faster new links are established.  It also moves in the direction of what Tony was getting at -- that is, once someone makes a link or data update on one database, the update should be propagated to all linked databases.  This doesn't mean that the records in the other databases are automatically updated, but rather than the managers of the other databases get a notification along the lines of "Stephen Thorpe just corrected the spelling of "Aus buus" to "Aus bus" on ZooBank.  Click here to make this change in your database." ... or some such thing (the FilteredPush team is already building this sort of service).

The real painstaking process is the reconciliation bit.  That is, comparing two separate databases to see the overlap.  This is not so hard for taxon names, but is a bear for literature citations.  I've spent MANY hours manually cross-linking Journal names between and among 76 different data sources of Journal names.  It's slow going at first, but as I add more and more links, the process becomes easier and easier (and faster and faster).  Donat did this for HNS Jounrals against ZooBank journals.  I'm now doing this between ZooBank and BHL. When I'm done, that means I'll have also created links between BHL and HNS (as well as many of the other journal sources).


More information about the Taxacom mailing list