[Taxacom] Chameleons, GBIF, and the Red List

Chuck Miller Chuck.Miller at mobot.org
Mon Aug 25 13:09:44 CDT 2014


Rod,
I was referring more to the commenting & rating aspects of Yelp & TripAdvisor when you mentioned a "social" approach.

In TripAdvisor, each commenter makes a rating of a record (a hotel or restaurant in this case), then the individual ratings are summarized into summary scores.  The comments are displayed together, both positive and negative, for a record.  The commenters get "badges" but only for the quantity of their posts.  But, there are also "votes" on the comments by the users of the data if a comment is considered "helpful".  The records are ranked #1 to #Last based on the ratings. And finally there can be a response by the "owner" of the record. 

All of it adds up to "social annotation" - a collective think without rigor or standardization.  And for hotels and restaurants, it's very helpful.  Would it be as helpful for biodiversity data records?

Chuck


-----Original Message-----
From: Roderic Page [mailto:Roderic.Page at glasgow.ac.uk] 
Sent: Monday, August 25, 2014 11:34 AM
To: Chuck Miller
Cc: TAXACOM; Bob Mesibov
Subject: Re: [Taxacom] Chameleons, GBIF, and the Red List

Hi Chuck,

No, or at least not in the way I think that you mean.

The "TripAdvisor" model is the Hotel publishes data about the hotel, and users then add comments supporting or disputing the attributes of the hotel. So, it's assumed that there is an authoritative source of data on the hotel, and we get to put sticky notes on that information. These notes may be ignored by the hotel.

The model I'm proposing (based on http://fluidinfo.com ) is that everyone gets to publish the same kind of data, and then we reconcile that (based, in part, on how much we trust the sources).

Imagine, for example, that a Hotel says "our address is Cool Street". In TripAdvisor someone may add a comment saying "the address is 130 Cool Street", and somebody else might add "the post code is 12345". At this point, there's no mechanism for the hotel description to be updated to include the street number and post code. TripAdvisor will keep saying the hotel is in Cool Street until somebody at TripAdvisor reads the comments, talks to the hotel, and updates the information.

Imagine, instead, that we treat the three sources as equivalent, then we can add

"Cool Street"
"130"
"12345"

and get "130 Cool Street 12345".

So, it's a bit like TripAdvisor, but imagine that we restrict ourselves to just the comments, and we don't treat the hotel as the definitive source of information. So, we combine the information from the comments, and from that make a summary of the data.

Of course, we might trust some commenters more than other - I find it useful to ignore any complaints about room size if the comments come from the US, because their expectations are frankly ridiculous ;) We might give extra weight to information provided by the hotel itself, or we may choose to trust someone else.

In the context of a museum or herbarium specimen, I would imagine that we'd have multiple sources of data, which might include:

1. the digitised museum catalogue
2. the literature that mentions the specimen 3. the voucher information recorded in GenBank

Given this we can do a number of things:

1. If we trust the museum, we can simply ignore the other sources and go with the "primary source"
2. If we trust the literature more, we may accept that 3. If the sequences suggest a different identification to what the museum says, we may choose to accept GenBank 4. We may choose to take the consensus of all sources, perhaps weighted by some measure of their past performance

One advantage of this approach is that it doesn't rely on waiting for the museum to accept or reject corrections or annotations. I could geo-reference a bunch of specimens, upload those, and they'd be immediately available to anyone to use. But these wouldn't overwrite the original museum's data (in the same way that if I add a comment to a hotel listing, it doesn't overwrite yours). Users could elect to ignore my georeferencing (for example, by saying "give me only data from the original provider"), they may elect to take just my version of the data, or they may take a synthesis of the data (a bit like the overall hotel rating TripAdvisor computes).

Hope this makes sense, I suspect I've not explained this terribly well. One nice outcome of this approach is that the problem of duplicate records becomes less a disaster and more of an opportunity.

Regards

Rod

---------------------------------------------------------
Roderic Page
Professor of Taxonomy
Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK

Email:  Roderic.Page at glasgow.ac.uk<mailto:Roderic.Page at glasgow.ac.uk>
Tel:  +44 141 330 4778
Skype:  rdmpage
Facebook:  http://www.facebook.com/rdmpage
LinkedIn:  http://uk.linkedin.com/in/rdmpage
Twitter:  http://twitter.com/rdmpage
Blog:  http://iphylo.blogspot.com
ORCID:  http://orcid.org/0000-0002-7101-9767
Citations:  http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ


On 25 Aug 2014, at 16:47, Chuck Miller <Chuck.Miller at mobot.org<mailto:Chuck.Miller at mobot.org>> wrote:

Rod,
Re: "I prefer a different model, where data is considered to be "social" and we can all annotate it (in effect, the museums are themselves simply one annotator)."

Are you talking about a "TripAdvisor" or "Yelp" kind of application for biodiversity data records?

Chuck

-----Original Message-----
From: Roderic Page [mailto:Roderic.Page at glasgow.ac.uk]
Sent: Thursday, August 21, 2014 5:52 PM
To: TAXACOM
Cc: Bob Mesibov
Subject: Re: [Taxacom] Chameleons, GBIF, and the Red List

A couple of quick comments.

Regarding expertise, I agree that there is lots that non experts can do, but also take Doug's point about the value of taxonomic input. I once saw a talk by Charles Godfray we he was describing the role taxonomic expertise played in building maps of mosquitoes that transmitted malaria (see, e.g. http://dx.doi.org/10.1371/journal.pmed.1000209 ). He said the role of the taxonomist wasn't the oft-assumed one of identifying specimens, instead it was to interpret distributional data from the literature in the light of changing taxonomies.

Where I differ from James is that I'm not really a fan of an annotation model where the focus is on annotating data and pushing those annotations back to the "primary providers". Given the scale of the problem, and that evidence is likely to be widely distributed (problems are often only uncovered when data is aggregated from different sources) I prefer a different model, where data is considered to be "social" and we can all annotate it (in effect, the museums are themselves simply one annotator). There's a bit more about this here: http://iphylo.blogspot.co.uk/2014/04/more-on-annotating-biodiversity-data.html Note that I'm not disputing that it would be nice to feed annotations back to collections, but that this isn't the main goal (and it think that it's pretty clear that there is going to be a huge bottle neck involving this process).

Regards

Rod


Sent from Acompli<http://t.acompli.com/ac_sig>




On Thu, Aug 21, 2014 at 11:52 AM -0700, "James Macklin" <james.macklin at gmail.com<mailto:james.macklin at gmail.com><mailto:james.macklin at gmail.com>> wrote:

Hi Rod,

Sorry, a little slow... I also think it is important to stress the data quality life cycle here. What we still as yet do not do well is connect the expert work done on these specimens or their digital derivatives (or observations, I guess), which are not done by the source/owner, back to them so the source/owner can clean/update the record and provide it to GBIF and/or other aggregators. The literature is one path where there is reference to the specimens used but as we know not everything ends up published this way. Further, extracting the information from the literature can be challenging even today. Lyubomir and Pensoft make this easy (thanks!) but we are still a long way from convincing other publishers to include the specimen data in a readily accessible form (or even mandating its presence as evidence). Another way to get expert knowledge back to the source is through annotation. Those of you who know me realize that my colleagues and I have spent a fair bit of time studying this problem and coming up with solutions (FilteredPush). I would say that in general there are now reasonable solutions for achieving distributed annotation at various levels of complexity but there is still a challenge/bottleneck in pushing these annotations back to the source and into their collection management system. The bottleneck is potentially at the source that must process the annotations. If we automate (or even semi-auto) the annotation process through curation workflows, something my colleagues and I are now focusing on, we could potentially flood the "curators" of the specimens/data. Then the question becomes how much the owners are committed to processing potentially valuable modifications/additions and adding them to their database. Certainly data curation and positions to support it are in their infancy. The annotations that are not processed by the source still have value and can inform the aggregators but have to be dealt with in a slightly different manner. So, this returns to the issue of when GBIF takes in a record update (or a new record), what metadata follows it to say it has been changed (created) based on some form of expertise...

I think we also need to be careful of the use of the term "expert."  I think it is reasonable to assume that a taxonomist is not going to be any better at georeferencing a specimen based on the collecting event data  (assuming this person was not associated with the collecting event) than a geographer, historian or even a citizen that happens to live near where the event took place. So, in the case of the Chameleon paper, and others like it, the issue really relates to taxonomic expertise and thus the name that appears associated with the record and not the entire record necessarily.

Papers like the Chameleon are quick to judge the end product but do not take into consideration what an achievement it is to simply have a GBIF resource and the challenges the greater "we" have overcome just to get this far! Let's stop highlighting the problem yet again and get to work on solving it and making the GBIF resource more valuable to all ;-)

Best,  JAmes

James Macklin, Ph.D.
Research Scientist
Botany and Biodiversity Informatics
Associate Curator of the AAFC National Vascular Plant Collection (DAO) Agriculture and Agri-Food Canada Ottawa, Ontario, Canada


On Thu, Aug 21, 2014 at 6:20 AM, Roderic Page <Roderic.Page at glasgow.ac.uk<mailto:Roderic.Page at glasgow.ac.uk><mailto:Roderic.Page at glasgow.ac.uk>> wrote:
Just to follow up on this discussion:

Stephen, I think I often come across as grumpy, but your cynicism makes me look like a fanboy, so thank you for that ;) Can we maybe assume that GBIF's primary goal isn't to keep bureaucrats happy, that it's genuinely trying to provide access to basic biodiversity information in one place because that seems like a worthwhile goal - leaving aside whether GBIF is the best way to tackle that goal.

Bob, if I understand your argument correctly, it's that access to mostly unveiled biodiversity data isn't much use, and in your view that's mostly what GBIF is serving up. Assuming that it would be nice to have access to good-quality distributional data in one place, what if GBIF provided, say, distributions of species that had been cleaned and had some degree of expert scrutiny. In other words, say a researcher publishes an evidence-based distribution map, what if that was stored on GBIF in a citable form (e.g., had a DOI), and others could download that distribution and make use of it?

I guess this was the thinking behind the now abandoned SDR project (see https://code.google.com/p/gbif-sdr/wiki/PortalIntegration ), and is perhaps where the Map of Life http://mol.org is headed (although at the moment it's simply showing you a bunch of distributions from different sources).

Lyubo, I couldn't agree more, having links to literature related to a record would be great. Many of our online biodiversity databases are devoid of links to the evidence for a particular assertion, but as more and more literature comes online we can do something to fix that. +1 for extracting from the literature, especially if we can automate this at scale (although that will give Bob nightmares).

Regards

Rod

---------------------------------------------------------
Roderic Page
Professor of Taxonomy
Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK

Email:  Roderic.Page at glasgow.ac.uk<mailto:Roderic.Page at glasgow.ac.uk><mailto:Roderic.Page at glasgow.ac.uk><mailto:Roderic.Page at glasgow.ac.uk<mailto:Roderic.Page at glasgow.ac.uk>>
Tel:  +44 141 330 4778<tel:%2B44%20141%20330%204778>
Skype:  rdmpage
Facebook:  http://www.facebook.com/rdmpage
LinkedIn:  http://uk.linkedin.com/in/rdmpage
Twitter:  http://twitter.com/rdmpage
Blog:  http://iphylo.blogspot.com
ORCID:  http://orcid.org/0000-0002-7101-9767
Citations:  http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ

_______________________________________________
Taxacom Mailing List
Taxacom at mailman.nhm.ku.edu<mailto:Taxacom at mailman.nhm.ku.edu><mailto:Taxacom at mailman.nhm.ku.edu>
http://mailman.nhm.ku.edu/cgi-bin/mailman/listinfo/taxacom
The Taxacom Archive back to 1992 may be searched at: http://taxacom.markmail.org

Celebrating 27 years of Taxacom in 2014.







More information about the Taxacom mailing list