[Taxacom] Data quality of aggregated datasets

Mary Barkworth Mary.Barkworth at usu.edu
Tue May 7 13:58:29 CDT 2013

I agree - and we often have a comment in our georeferencing field that we have modified the data provided by Geolocate. 

-----Original Message-----
From: taxacom-bounces at mailman.nhm.ku.edu [mailto:taxacom-bounces at mailman.nhm.ku.edu] On Behalf Of Doug Yanega
Sent: Tuesday, May 7, 2013 12:38 PM
To: taxacom at mailman.nhm.ku.edu
Subject: Re: [Taxacom] Data quality of aggregated datasets

On 5/7/13 5:15 AM, Mary Barkworth wrote:
> The basic reason that the data will always be "raw" is that we have no reliable means of communicating with the dead. When a label says Logan, Utah, I am told to use the city's current boundaries. Technically, I could look up its boundaries at the time the specimen was collected but perhaps all the collector was doing was naming the nearest settlement that he or she knew of, or the postal district, or home based for that day or week. Moreover no one is willing to pay the herbarium for the additional work required to check into alternative estimates. Actually, they are not willing to pay for anything; we (like all collections) provide the data for free. When it comes to collection data, we can adhere to standard protocols but all that provides are estimates calculated in standard way. Whether that is good enough depends on the question being addressed and the organism(s) involved. Data users should always evaluate the data they wish to use - and be grateful for the quantity being made available (gratitude can be expressed by informing the head of the collection of any errors that need fixing, mention in the acknowledgements, and an email to the head of the collection who may not otherwise know that its records have been used).
This is an example of how different standards and protocols make a difference. In our database, we accept the USGS GNIS georef placement of Logan, Utah as 41 44 08 N, 111 50 04 W. However, we use an error radius of 10 km around that point. This is an *arbitrary* error radius used to account for the very real potential that someone whose label simply said "Logan" could have been outside of the boundary of the city proper (which, incidentally, has a radius of around 8 km, if one uses a satellite image to determine the extent of the densely populated zone). 
The decision to use a 10 km radius is part of our in-house standard protocol for contending with "populated place" category names, which uses several criteria, virtually all of which include the process "...and then round UP". This does not require any investment of time or energy to look for alternative estimates; we simply opt to play things conservatively, and use the largest minimum error radius (even though that sounds like an oxymoron), to avoid false precision while giving
*realistic* accuracy. To further clarify, if (hypothetically) the next nearest city was only 10 km from Logan, then the largest minimum error radius for "Logan" would extend to roughly halfway between the two cities (5 km), because the protocol assumes that a person collecting in between two towns will make labels referring to the nearest one, if they do not otherwise specify displacement.

A few rules of thumb like these can serve to make georeferencing easier and more practical than the rather elaborate set of "best practices" 
that Dean Pentcheff linked here; those "best practices" are spectacular IF you can afford the time and energy and IF you are really, really focused on precision and objectivity rather than accuracy (especially if you want a computer to do your work for you). This is linked to the desire of the authors of that set of guidelines to automate the process of georeferencing, while developing the Biogeomancer georeferencing tool. But guidelines that are intended to work for automation do not necessary correspond to protocols that are intended for a human being using, say, Google Earth. A pertinent example is the label in our collection that reads "campground 4 mi E Logan". Biogeomancer uses the GNIS point I mentioned above as the origin and then measures exactly 4 miles from that, and draws a rather large error radius around the resulting point (based on the theoretical possibility that the angle of displacement could have been anything between NE and SE). Very objective, and very precise - and utterly wrong; the campground in question is not within this circle, because most of that circle is inside the city limits of Logan, which is more than 4 miles in radius. A human-powered protocol would start measuring from the eastern edge of the city, rather than its center, and measure actual distances along roads, rather than fixed compass directions in straight lines. A human using Google Earth can see that there is indeed a campground along the highway almost exactly four miles east of the mouth of Logan Canyon, which abuts the eastern edge of the city, and one can plot that point with a very small error radius (basically, the limits of the campground
itself) - which is in fact both more accurate AND more precise than the "objective" protocol. The reason I bother to go through this example in such detail is the end result: a data provider using an automated georef tool will give a point that is 5 miles away from the actual collecting site, AND in a completely different habitat. That is an extremely significant error, resulting solely from the reliance on automation - two data providers starting with original data of the exact same quality (a label reading "campground 4 mi E Logan") and following different "standard protocols" will produce data sets of completely *different* quality. I doubt that data aggregators or users are paying any attention to WHAT the georeferencing protocols are behind the datasets they are using.


Doug Yanega      Dept. of Entomology       Entomology Research Museum
Univ. of California, Riverside, CA 92521-0314     skype: dyanega
phone: (951) 827-4315 (disclaimer: opinions are mine, not UCR's)
   "There are some enterprises in which a careful disorderliness
         is the true method" - Herman Melville, Moby Dick, Chap. 82

Taxacom Mailing List
Taxacom at mailman.nhm.ku.edu

The Taxacom Archive back to 1992 may be searched with either of these methods:

(1) by visiting http://taxacom.markmail.org

(2) a Google search specified as:  site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here

Celebrating 26 years of Taxacom in 2013.

More information about the Taxacom mailing list