[Taxacom] Data quality of aggregated datasets

Quentin Groom quentin.groom at br.fgov.be
Tue May 7 05:33:24 CDT 2013

Hi Rich,
I take your point about the description of accuracy and precision 
however, for all practical purposes occupancy within a grid is much more 
useful. Currently, all GBIF data is converted from points into grid 
occupancy before any modelling is done, even though much of the data in 
GBIF was gridded data to start with. The conversion of squares to 
circles and circles back to squares could be the origin of many of the 
discrepancies between original data and GBIF and will lead to the 
rejection of potentially useful data when the circle borders overlap 
with those of the square. I'd like to see anyone uses those radii for 
anything else except for determining if a record belongs within a grid 
square. Why shouldn't taxonomists collect gridded data in the first 
place, just as ecologist have been for years?

Richard Pyle wrote:
> The problem with using number of decimal places to represent accuracy is
> that you're limited to such representation only in powers of ten. Moreover,
> assuming you store these values as numbers, ten percent of your values will
> be one order of magnitude off in terms of precision (i.e., ten percent of
> values will have the final digit as "0", which numeric data fields will
> trim); one percent will be off by two orders of magnitude, and so on.  The
> combination of an arbitrarily precise point plus a radius, when used as Bob
> describes below, is a far more flexible & powerful method for representing
> both place and accuracy.  As Bob says, the correct way to interpret a
> point+radius is as the definition of a circle, within which there is a high
> probability for the occurrence to have happened. There are methods for
> calculating the radius, such that unknown datum, datum error, and other
> factors are taken into account.
> Note: Place some emphasis on the word "arbitrarily" above when I talk about
> "arbitrarily precise"  As long as an appropriate radius value is provided,
> there is absolutely no harm in representing the "point" part via arbitrarily
> precise numbers, such as -17.6000003814697, 145.699996948242.  There is,
> however, non-trivial harm when doing so while relying on the number of
> included digits as a representation of accuracy.
> Note also the difference between precision and accuracy.  The location of a
> collected insect could (theoretically) be represented by coordinates of
> precision to a few cm to a few mm (depending on what kind of insect we're
> talking about).  The point+radius approach is intended to represent
> accuracy, not precision.  The precision of the true location will always be
> limited by the physical size of the organism; but the accuracy will
> generally be much larger than this.
> Aloha,
> Rich
>> -----Original Message-----
>> From: taxacom-bounces at mailman.nhm.ku.edu [mailto:taxacom-
>> bounces at mailman.nhm.ku.edu] On Behalf Of Robert Mesibov
>> Sent: Monday, May 06, 2013 11:20 PM
>> To: Quentin Groom
>> Subject: Re: [Taxacom] Data quality of aggregated datasets
>> Quentin Groom wrote:
>> "This is the problem I've always had with the point-radius method. It
>> encourages people to document a very precise coordinate and then account
>> for the error in the radius. The error should be obvious from the number
> of
>> decimal places you write, just like any other measurement."
>> I think that depends on how you understand the point-radius method. The
>> idea is that there's a circle which completely contains the area searched
> or
>> sampled. The point is simply an estimate of that circle's centre, and is
> not
>> meant to be an estimate of the location of the actual collecting site
> (assuming
>> there was just one) plus a measurement error. The point+radius define a
>> circular *area* containing the collecting site in an easily understandable
> way.
>> How many decimal places you use for the point's location should obviously
>> (to me, anyway) depend on the magnitude of the radius. Recording
>> 22°06'57.54"S 117°53'15.31"E +/- 100 m is, I think, bizarre. [We've had a
>> discussion about this on Taxacom before, and some listers think that
>> rounding off is throwing away data.]
>> You can also define a collecting *area* by the implied uncertainty in a
> single
>> point estimate, as you suggest. However, I don't think many people on this
>> list would know the uncertainty at a glance in 22.116°S 117.888°E. A
> computer
>> can pull it out, but eyeballing the area in a point-radius record is
> easier.
>> Another difficulty with implied uncertainty occurs with the
> above-mentioned
>> computer when UTM data are converted to lat/lon, or vice-versa, or for
> that
>> matter with lat/lon format conversions. In my audit paper in ZooKeys I
> cite a
>> wonderful GBIF/ALA example where '12 km SE of Millaa Millaa' (Queensland,
>> 1971) got processed from 17°36'S 145°42'E to -17.6000003814697
>> 145.699996948242. Implied uncertainty of a few atomic radii, maybe?
>> --
>> Dr Robert Mesibov
>> Honorary Research Associate
>> Queen Victoria Museum and Art Gallery, and School of Agricultural Science,
>> University of Tasmania Home contact: PO Box 101, Penguin, Tasmania,
>> Australia 7316
>> Ph: (03) 64371195; 61 3 64371195
>> _______________________________________________
>> Taxacom Mailing List
>> Taxacom at mailman.nhm.ku.edu
>> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
>> The Taxacom Archive back to 1992 may be searched with either of these
>> methods:
>> (1) by visiting http://taxacom.markmail.org
>> (2) a Google search specified as:
>> site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here
>> Celebrating 26 years of Taxacom in 2013.

Dr. Quentin Groom
(Botany and Information Technology)

National Botanic Garden of Belgium
Domein van Bouchout
B-1860 Meise

ORCID: 0000-0002-0596-5376

Landline; +32 (0) 226 009 20 ext. 364
FAX:      +32 (0) 226 009 45

E-mail:     quentin.groom at br.fgov.be
Skype name: qgroom
Website:    www.botanicgarden.be

More information about the Taxacom mailing list