[Taxacom] WTaxa and data harvesting via CoL, EoL, etc.

Stephen Thorpe stephen_thorpe at yahoo.co.nz
Fri Jul 29 18:42:07 CDT 2011

thanks Chris, for the full reply, though my issue was less with WTaxa itself and 
more with those who harvest its data

the fact is that, for whatever reason, the WTaxa derived data in CoL, EoL, etc. 
is utterly awful at all taxonomic levels, and does not reflect your own work (in 
collaboration with Miguel) on the weevils, even though it is ultimately credited 
to you (and Miguel) via WTaxa. IMHO, the weevil data currently on CoL, EoL, etc. 
is far worse than no data at all, and it is way too early for them to start 
presenting data on Curculionoidea

one unfortunate problem is that if you don't know what the data should look 
like, then you simply don't see all the problems with it...and if you do know 
what the data should look like, then you are probably too busy elsewhere to be 
checking the quality of data in CoL, EoL, etc. So, yet again, the average user 
is held completely hostage to the claimed authority of the secondary database 
and its data provider. I am beginning to see that there must be significant 
problems associated with harvesting the data from the source database, even when 
the source database *is* reliable

it is hard to give good examples at this stage, because there is so much work 
still to do on Wikispecies to populate it with good data (and I can't rely on 
CoL, EoL, etc. as reliable data sources), but consider: 

although my data comes largely from A.-Z. & Lyal (1999), which could be 
considered a tad dated now, the only subsequent changes have been a new genus by 
Anderson (2005), and an accepted reclassification by Marvaldi et al. (including 
you) 2006. It is hard to prove to a skeptic that my data is better than CoL's 
data, but just look at the sorry state of the latter (see the bottom of the 
Wikispecies page, under 'links')...

...pretty much all the Curculionoidea are like this in CoL, EoL, so I stand by 
my contention that something is *very* wrong here ...



From: Chris Lyal <C.lyal at nhm.ac.uk>
To: Stephen Thorpe <stephen_thorpe at yahoo.co.nz>; taxacom at mailman.nhm.ku.edu
Sent: Sat, 30 July, 2011 1:04:34 AM
Subject: RE: [Taxacom] WTaxa and data harvesting via CoL, EoL, etc.

Apologies for the issues with names in WTaxa.  We are still in the
process of completing the database, so many of the names are not of
valid species.  The first pass in the project was to enter as many names
as possible, from the secondary literature; we received funds from GBIF
to help us with that.  The second pass is to check the original papers
and correct entries, working from oldest to newest, checking
availability and validity as we go, and this is underway.  We also have
had a problem with the database about displaying links between original
and subsequent combinations, which is the issue that Stephen highlights,
and which is fixed in WTaxa but will not transmit through to CoL until
the next data upload.  We are lucky that we have been able to obtain
some funds though partnership with Species 2000 in an EU project, and
later this year we will be able to use some of those funds to improve
the harvesting from WTaxa to Species 2000-CoL.  The fundamental problem
still pertains - a small number of taxonomists who are working to
complete a large task with insufficient time and resources.  However,
without the 'acronyms' we would not have been able to achieve anything
at all.  

Aside from natural disappointment that despite the rather intensive
efforts of a number of people to capture data and disseminate them the
data are not yet perfect, we might consider several serious questions.

How we develop opportunities for funding data population on a large
scale.  Given the amount of data currently available on the web and the
relatively low investment there has been in data population (leading
many of us to work in 'spare' time on this activity) how do we press the
arguments to finish the job.  There are global level policy agreements
through the CBD that this work should be done, so what are people's
experiences in successful arguments for funding?

Secondly, should we (as taxonomists) should expose incomplete
information (it was a condition of the first grant that we received for
WTaxa that we do so).  I have been in meetings where users were appalled
that nomenclators were freely available, since they were using them as
if they listed only valid names (actually a similar situation to WTaxa
as it currently is), but I guess we would generally agree that
nomenclators are a useful tool. 

Finally, a related point; should we develop a standard means in metadata
of indicating fitness for use of any record or item of data - perhaps
TDWG might consider this. 

This is not an invitation to debate (again) the relative merits of
different means of putting information on the web - we've really done
that to death.  Suffice to say that I know very few idle taxonomists (or
people in CoL, GBIF etc, come to that) - we are all trying to populate
systems with data in the ways we see fit.  Nor is it an invitation to
argue (again) that money obtained by initiatives exploring and
catalysing dissemination techniques should have been spent in a
different way - it wouldn't have been, and our project for one has
benefitted - and we're not alone.  


More information about the Taxacom mailing list