# [Taxacom] Data quality of aggregated datasets

Doug Yanega dyanega at ucr.edu
Tue May 7 15:21:17 CDT 2013

```On 5/7/13 12:04 PM, Dave Vieglais wrote:
> In addition, the confidence associated with the error estimate should also be recorded. For example, does "+/- 100m" refer to the 95% confidence interval? Circular Error Probable (50%)? 3 sigma ellipse (98.9%)? etc.
>
This sounds like what I mean by false precision. If your error radius is
"The ballpark estimate for the wandering radius of this particular
entomologist after he parked his car" then there is NO utility in trying
to assign confidence limits and probability values to that radius. If
you just say "We arbitrarily draw a 2 km radius in order to accommodate
for all reasonable sources of error LESS than 2 km in extent", then,
likewise, no further parameters are necessary or appropriate. Even for
georeferences that I personally recorded using a GPS, I will at least
double the actual distance I walked from that point when providing an
error radius, rounded to the nearest multiple of 100 meters, just so it
is NOT necessary to specify anything more detailed - because it is
vastly simpler to just use a large enough error value to SUBSUME any of
the error values one could painstakingly calculate or quantify using
other means. Why would anyone bother trying to calculate a confidence
interval for a label that says only "6.35 miles S of Chicago, IL", or
"Delhi, India"? Worrying about false precision just wastes time, since
it doesn't increase the value of one's data; that is, in effect, the
criterion that distinguishes *false* precision. More numbers, or more
decimal places, is valuable *only up to a point*. What is important, as
Rich Pyle notes, is ACCURACY.

On 5/7/13 12:35 PM, Dean Pentcheff wrote:
> I agree. I see those "elaborate best practices" (as encoded by Chapman
> & Wieczorek in the Geomancer document) as a codification of
> well-throught-through rules of thumb that can be applied in the
> absence of other information. The key (in my mind) is that the
> "objective, and very precise" estimates always yield to additional
> information.
>
> "campground". If the hypothetical label was just "4 mi E Logan", I
> don't think you could do much better than the automatic estimated
> location. But "campground 4 mi E Logan" lets you (yes you, an expert
> :) snuffle around for that feature, find it, assess whether it's
> likely to be the campground in question (how many other campgrounds
> are in that area?), and if it seems reasonable, assign that as the
> high-probability collection location.
>
Actually, a human would still do much better than Biogeomancer if the
label was just "4 mi E Logan", because it would start from the center of
the city, where a human would start from the edge AND measure along a
road. The bigger the city, and the more roads deviate from perfectly
straight, and along cardinal compass headings, the worse an automated
system will perform. One does not have to be an expert, one simply has
to realize that (1) human beings driving cars measure distances from
boundaries and landmarks, using their odometers, (2) roads can curve,
and (3) roads can go at angles other than increments of 45 degrees
relative to a starting point. Show me an automated georeferencing tool
that incorporates all three of those realities and I'll be the first
person to hail that tool as the answer to our prayers. For the time
being, I simply don't accept that you can get high-quality georef data
except by human analysis. Remember, empirical results indicate only
about a 40% match between the two protocols, and only 60% of records
overlap within a reasonable error radius.

Sincerely,

--
Doug Yanega      Dept. of Entomology       Entomology Research Museum
Univ. of California, Riverside, CA 92521-0314     skype: dyanega
phone: (951) 827-4315 (disclaimer: opinions are mine, not UCR's)
http://cache.ucr.edu/~heraty/yanega.html
"There are some enterprises in which a careful disorderliness
is the true method" - Herman Melville, Moby Dick, Chap. 82

```