[Taxacom] Data quality of aggregated datasets

Richard Pyle deepreef at bishopmuseum.org
Tue May 7 12:03:45 CDT 2013

> I take your point about the description of accuracy and precision however,
for all practical purposes occupancy within a grid is much more useful. 

That's debatable.  It depends on what you want to do with the data. 

Also, you can easily convert point+radius values into a defined grid (which
includes rejecting datapoints whose circle footprint spans more than one
grid cell -- or defining rules about whether or not an occurrence is known
with enough accuracy to be included in a particular grid analaysis).  But if
you store your data in the form of occurrence within a particular grid cell,
you can't convert the other direction.  Grid-cell structures may be great
for certain kinds of data analysis, but not so great for capturing/storing
the data.  I'm not sure whether your point in your previous post was that
data should be captured/stored that way (gred-cell), or simply presented to
end-users that way.

Personally, I'd rather have the data presented to me as point+radius, then
I'll make my own rules about how to reject points from the analysis and what
cell to score each point in (if I want to do a cell analysis).  So, in
either case, while you have a good point for how people use and transform
these data to perform certain kinds of analysis, I'm not sure it has bearing
on how data should be captured/stored/presented to end users.

> The conversion of squares to circles and circles back to squares could be
> origin of many of the discrepancies between original data and GBIF and
> lead to the rejection of potentially useful data when the circle borders 
> overlap with those of the square. 

Yes -- which is why you don't want to convert the data back and forth.  But
you have to store it somehow, and as I said above, it's better to start with
the point-radius that is captured/stored as accurately as the original data
allow for, then convert to a grid (=bounding box) if/when the analysis calls
for it; than it is to start with an arbitrary grid (i.e., as opposed to a
bounding box whose shape/borders are optimized to represent the smallest
footprint within which the occurrence is likely to have happened).  In other
words, the error is not symmetrical when doing the conversion.  Converting
from point-radius (or even from custom bounding box) to arbitrary grid
involves less data loss than the other way around.

> I'd like to see anyone uses those radii for anything else except for
> if a record belongs within a grid square. Why shouldn't taxonomists
> gridded data in the first place, just as ecologist have been for years?

I use the radius to reject points with insufficient accuracy.  When data
have been converted to a grid, you lose that information.  For example, if
you capture the data for a particular grid, you cannot later analyze the
data based on a different grid pattern.  If your data are in point+radius
format, you can very easily determine which points are appropriate for use
within a grid of different scale or offset.


More information about the Taxacom mailing list