[Taxacom] Data query

Stephen Thorpe stephen_thorpe at yahoo.co.nz
Mon Jun 24 22:54:43 CDT 2013


Tony: Again I see the HUGE assumption lurking behind your words that there are going to be available trustworthy and up-to-date data providers for all the major taxa, and that error correction times aren't going to be measured in years ...
 
Cheers, Stephen


________________________________
From: "Tony.Rees at csiro.au" <Tony.Rees at csiro.au>
To: stephen_thorpe at yahoo.co.nz; mesibov at southcom.com.au 
Cc: taxacom at mailman.nhm.ku.edu; deepreef at bishopmuseum.org 
Sent: Tuesday, 25 June 2013 3:47 PM
Subject: RE: [Taxacom] Data query



Sorry, I do not see the distinction – EoL, AFD, Catalogue of Life, ALA and so on are all taxonomic information systems and each aspires to eventual completeness within its stated bounds. They vary with respect to how much associated information is available for each name and thus provided on the taxon-centric web pages i.e. species page, genus page, etc. In theory, once you can gain access to the underlying data, you can run queries such as I have described against any of them. IRMNG is also a taxonomic information system but aspires to more completeness at genus level and less on species, and currently has less information to display against any particular taxon name (by choice), but that is a difference of degree only. There are differences amongst the above cited systems about the degree of originality versus re-use of data, for example everything (?) in AFD has been compiled there while the reverse is true for CoL (everything comes from elsewhere),
 but that is again a detail.
 
The only difference I see is that in addition, IRMNG supports a “multiple supplied taxonomic name” resolution service as per the example supplied, but then again is not unique in this regard, such a service being offered by WoRMS (World Register of Marine Species) at least, though in that case obviously not including non-marine taxa, at least in the main.
 
Basically we are talking about the generic area of computerized compilations of taxonomic data and associated attributes, and the benefits of storing these in static web pages as opposed to generating dynamic ones on demand from a data store of some sort. Incidentally if you investigate the data structures behind EoL, AFD, Catalogue of Life, ALA and most others you will find that almost invariably they use a back-end data store and generate dynamic web pages from that content as needed as opposed to using static pages unless I am mistaken (same from ZooBank, Index Fungorum, WoRMS, Fauna Europaea, Zoological Record and most everything else I could mention).
 
Regards - Tony
 
From:Stephen Thorpe [mailto:stephen_thorpe at yahoo.co.nz] 
Sent: Tuesday, 25 June 2013 1:30 PM
To: Rees, Tony (CMAR, Hobart); mesibov at southcom.com.au
Cc: r.page at bio.gla.ac.uk; taxacom at mailman.nhm.ku.edu; deepreef at bishopmuseum.org
Subject: Re: [Taxacom] Data query
 
I suspect also that there is some confusion here along these lines. Wikispecies is basically a rival to the likes of EoL, AFD, and now ALA. All of them purport to provide the general end user with authoritative and up-to-date information on taxa. The sort of database that Tony dreams of is along different lines. It purports to answer questions of a kind that might be posed by conservation managers, and other "meta-biologists", like other bioinformaticians. In other words, the sort of database that Costello should be using before coming out with his "conclusions". There are different issues involved in these two types of project, which aren't being clearly distinguished in discussion so far, it seems to me, and is partly my fault ...
 
Stephen
 
From:"Tony.Rees at csiro.au" <Tony.Rees at csiro.au>
To: mesibov at southcom.com.au; stephen_thorpe at yahoo.co.nz 
Cc: r.page at bio.gla.ac.uk; taxacom at mailman.nhm.ku.edu; deepreef at bishopmuseum.org 
Sent: Tuesday, 25 June 2013 2:34 PM
Subject: RE: [Taxacom] Data query

Bob Mesibov wrote:

> I'm having trouble understanding Tony's argument. Wikispecies is not
> primarily a data storage and management structure. It's the equivalent
> of the 'web pages as by-products' made from databases, a way to get the
> results of taxonomic activity made widely known. Stephen builds web
> pages by hand, databasers export web pages from their databases, but in
> both cases the information put online comes from the taxonomic
> literature, yes? From the point of view of the Web user looking at the
> information, there's no in-principle difference.
...
> It's not a criticism of Wikispecies to say that it's no good for 'bulk
> importing, internal data management and review, bulk queries (including
> machine-machine as well as human), and bulk export of relevant
> content'. That's like criticising cars because they don't fly like
> airplanes do. But they get you from A to B just the same.

OK - if all you want to serve is a human user / eye, presenting information for one taxon at a time on a web page or a printed list is fine. My use case is slightly different: typically I need to parse lists of taxa to find out information about them, starting with "what is this critter" as the most simple question, and when the lists start to get longer than a few hundred names then this is something you want to ask a machine rather than a human to do. Hence the need for machine-accessible data rather than (or in addition to) human-readable web pages. Also machines can do lots of other things like point out minor discrepancies between lists (these items are not identical but are very close, this is the same name but the families or other higher taxa do not match, this name is quoted as an extant taxon on list A but a fossil one on list B, this name is lacking an authority but the latter can be found here, and much more).

The reason for supporting bulk import of data is similar to that for the Catalogue of Life (which also presents human-readable web pages but stores the data as atomized entities suitable for machine query behind-the-scenes): if a source you trust, such as Index Fungorum, Catalogue of Diptera, etc. etc. has already painstakingly assembled the information you require, it is much better to import that to integrate with other information than to re-enter everything by hand from the same source. Similarly for bulk export: if you want to gain maximum benefit from your own data compilation efforts, it is good to be able to export it in some tabular or other format that others can ingest as opposed to a set of web pages which have to be crawled and parsed via custom routines (touching on your other question here). That among other things is the reason that TDWG, the Taxonomic Databases Working Group, has spent the past 10+ years defining exactly such standards
 for taxonomic data transfer, which are based in the main on Darwin Core and/or ABCD, neither of which have anything intrinsic to do with how the same data might be presented for human readable web pages.

So "getting from A to B just the same" is only a subset of what can be done with the data if presenting a web page is your only current final destination. Since computerisation of taxonomic information began several decades ago, we have the facility to enter once, use many times and also leverage off the work of others via machine-machine transactions where possible to bring in additional content at minimal cost where desired, as I previously hinted. The usefulness of the Catalogue of Life, the World Register of Marine Species, Australian Faunal Directory and many more such taxonomic compilations is no longer limited to human, one-at-a-time queries for which a HTML web result is sufficient, but now supports many more use cases in providing a "taxonomic backbone" for other projects who then have no need to recompile all that base data but can use it in new contexts.

Here is a quick recent example - I was contacted last week by a representative of a team interested in bringing together phylogenetic trees for different faunal groups who has access to the present GBIF NUB classification for machine access. He said:

Hi Everyone,
Here is another study where [genus-level] fossils are not mapping [i.e. to the present GBIF classification].
Study ID: 1804

Aistopoda
Amphibamus
Apateon
Captorhinidae
Doleserpeton
Eryops
Lysorophia
Microbrachis
Nectridea
Osteolepiformes
Procolophonidae
Seymouria
Synapsida
Tersomius
Triadobatrachus

Because of the way my IRMNG taxonomic database is set up, it is a simple task for either a human user or a machine to ask IRMNG in essence, what does it know about these names (supplied as genera, although obviously some are not so). You can go to the human access point at http://www.cmar.csiro.au/datacentre/irmng/,paste in the list, press "Check genus names" and see what comes out. The result is only possible because the underlying data are stored in atomized form (in a database) as opposed to static HTML-based web pages for purely human consumption (though of course I can generate the latter from the base data as needed, and am also not limited to a single presentation style). So, I guess what I am saying that in the context of "getting from A to B" where A is the taxonomic literature, my "B" is probably not the same as your "B" although yours (or something equivalent) can potentially be produced as a point along the way as desired.

Of course there are many other types of query which can be run against my system, as well as a potentially unlimited range of possible reporting formats.

Hope this helps you to better understand where I am coming from,

Best regards - Tony



> -----Original Message-----
> From: Bob Mesibov [mailto:mesibov at southcom.com.au]
> Sent: Tuesday, 25 June 2013 11:14 AM
> To: stephen_thorpe at yahoo.co.nz; Rees, Tony (CMAR, Hobart)
> Cc: Rod Page; TAXACOM; Richard Pyle
> Subject: Re: [Taxacom] Data query
> 
> Stephen Thorpe wrote:
> 
> "You mean until we have one place where anyone can "just do it" and add
> missing data at their leisure with minimum fuss, to help build a
> comprehensive catalogue of world biota ... oh, wait! We do have such a
> place ... Wikispecies!"
> 
> Tony Rees wrote:
> 
> "Also, most informatics persons (biological data specialists) would
> probably contend that there are more appropriate data structures than
> wikispecies for bulk importing, internal data management and review,
> bulk queries (including machine-machine as well as human), and bulk
> export of relevant content (which is why the bulk of present taxonomic
> information resides in databases, with web pages as by-products, rather
> than in web pages as their native format)."
> 
> I'm having trouble understanding Tony's argument. Wikispecies is not
> primarily a data storage and management structure. It's the equivalent
> of the 'web pages as by-products' made from databases, a way to get the
> results of taxonomic activity made widely known. Stephen builds web
> pages by hand, databasers export web pages from their databases, but in
> both cases the information put online comes from the taxonomic
> literature, yes? From the point of view of the Web user looking at the
> information, there's no in-principle difference. Wikispecies is less
> complete than some databases, but on the other hand Wikispecies is
> often more up to date, and sometimes more accurate.
> 
> It's not a criticism of Wikispecies to say that it's no good for 'bulk
> importing, internal data management and review, bulk queries (including
> machine-machine as well as human), and bulk export of relevant
> content'. That's like criticising cars because they don't fly like
> airplanes do. But they get you from A to B just the same.
> 
> It's also not a criticism of database managers to say that they don't
> allow just anyone to edit their data, the way Wikispecies does, at the
> Web-output stage. Database managers ask that users suggest their edits
> 'off-Web', so the database can be changed, then the changes exported to
> the Web. It's a different mechanism for editing.
> 
> Which brings me back to that first post of mine, that pushed some
> unintended buttons. What *are* appropriate data structures for storing
> the complex relationships involved in taxonomic, nomenclatural and
> bibliographic data? And are those structures capable of being marked up
> on a webpage? If so, then the gap between Wikispecies and Big Databases
> disappears. The marked-up page can be generated by a database, or
> built/edited by hand.
> 
> I asked if anyone had experience with graph databases because they seem
> to me to be the logical way to store and manipulate objects ('nodes' =
> names, authors, publications, type specimens...) and relationships
> ('published by', 'cited by', 'synonym of'...). Doing this with RDBMS
> and joins seems to be out of the question for anything but very simple
> cases.
> --
> Dr Robert Mesibov
> Honorary Research Associate
> Queen Victoria Museum and Art Gallery, and
> School of Agricultural Science, University of Tasmania
> Home contact:
> PO Box 101, Penguin, Tasmania, Australia 7316
> (03) 64371195; 61 3 64371195


More information about the Taxacom mailing list