[Taxacom] Data quality of aggregated datasets

Doug Yanega dyanega at ucr.edu
Tue May 7 13:37:41 CDT 2013


On 5/7/13 5:15 AM, Mary Barkworth wrote:
> The basic reason that the data will always be "raw" is that we have no reliable means of communicating with the dead. When a label says Logan, Utah, I am told to use the city's current boundaries. Technically, I could look up its boundaries at the time the specimen was collected but perhaps all the collector was doing was naming the nearest settlement that he or she knew of, or the postal district, or home based for that day or week. Moreover no one is willing to pay the herbarium for the additional work required to check into alternative estimates. Actually, they are not willing to pay for anything; we (like all collections) provide the data for free. When it comes to collection data, we can adhere to standard protocols but all that provides are estimates calculated in standard way. Whether that is good enough depends on the question being addressed and the organism(s) involved. Data users should always evaluate the data they wish to use - and be grateful for the quantity being made available (gratitude can be expressed by informing the head of the collection of any errors that need fixing, mention in the acknowledgements, and an email to the head of the collection who may not otherwise know that its records have been used).
>
This is an example of how different standards and protocols make a 
difference. In our database, we accept the USGS GNIS georef placement of 
Logan, Utah as 41 44 08 N, 111 50 04 W. However, we use an error radius 
of 10 km around that point. This is an *arbitrary* error radius used to 
account for the very real potential that someone whose label simply said 
"Logan" could have been outside of the boundary of the city proper 
(which, incidentally, has a radius of around 8 km, if one uses a 
satellite image to determine the extent of the densely populated zone). 
The decision to use a 10 km radius is part of our in-house standard 
protocol for contending with "populated place" category names, which 
uses several criteria, virtually all of which include the process 
"...and then round UP". This does not require any investment of time or 
energy to look for alternative estimates; we simply opt to play things 
conservatively, and use the largest minimum error radius (even though 
that sounds like an oxymoron), to avoid false precision while giving 
*realistic* accuracy. To further clarify, if (hypothetically) the next 
nearest city was only 10 km from Logan, then the largest minimum error 
radius for "Logan" would extend to roughly halfway between the two 
cities (5 km), because the protocol assumes that a person collecting in 
between two towns will make labels referring to the nearest one, if they 
do not otherwise specify displacement.

A few rules of thumb like these can serve to make georeferencing easier 
and more practical than the rather elaborate set of "best practices" 
that Dean Pentcheff linked here; those "best practices" are spectacular 
IF you can afford the time and energy and IF you are really, really 
focused on precision and objectivity rather than accuracy (especially if 
you want a computer to do your work for you). This is linked to the 
desire of the authors of that set of guidelines to automate the process 
of georeferencing, while developing the Biogeomancer georeferencing 
tool. But guidelines that are intended to work for automation do not 
necessary correspond to protocols that are intended for a human being 
using, say, Google Earth. A pertinent example is the label in our 
collection that reads "campground 4 mi E Logan". Biogeomancer uses the 
GNIS point I mentioned above as the origin and then measures exactly 4 
miles from that, and draws a rather large error radius around the 
resulting point (based on the theoretical possibility that the angle of 
displacement could have been anything between NE and SE). Very 
objective, and very precise - and utterly wrong; the campground in 
question is not within this circle, because most of that circle is 
inside the city limits of Logan, which is more than 4 miles in radius. A 
human-powered protocol would start measuring from the eastern edge of 
the city, rather than its center, and measure actual distances along 
roads, rather than fixed compass directions in straight lines. A human 
using Google Earth can see that there is indeed a campground along the 
highway almost exactly four miles east of the mouth of Logan Canyon, 
which abuts the eastern edge of the city, and one can plot that point 
with a very small error radius (basically, the limits of the campground 
itself) - which is in fact both more accurate AND more precise than the 
"objective" protocol. The reason I bother to go through this example in 
such detail is the end result: a data provider using an automated georef 
tool will give a point that is 5 miles away from the actual collecting 
site, AND in a completely different habitat. That is an extremely 
significant error, resulting solely from the reliance on automation - 
two data providers starting with original data of the exact same quality 
(a label reading "campground 4 mi E Logan") and following different 
"standard protocols" will produce data sets of completely *different* 
quality. I doubt that data aggregators or users are paying any attention 
to WHAT the georeferencing protocols are behind the datasets they are using.

Sincerely,

-- 
Doug Yanega      Dept. of Entomology       Entomology Research Museum
Univ. of California, Riverside, CA 92521-0314     skype: dyanega
phone: (951) 827-4315 (disclaimer: opinions are mine, not UCR's)
              http://cache.ucr.edu/~heraty/yanega.html
   "There are some enterprises in which a careful disorderliness
         is the true method" - Herman Melville, Moby Dick, Chap. 82






More information about the Taxacom mailing list