[Taxacom] Errors in compilations
mesibov at southcom.com.au
Wed Jan 15 23:50:07 CST 2014
"The experience I have says that if the number of names in a list exceeds a number of around 2000, getting the error rate down to zero will become most unlikely. 2000 seems to be some kind of a limit. Usual error rates in longer lists range around 2-5 %, also depending on how many data are incorporated to each entry."
Why is the "2-5%" error rate you experience so much higher than in other big lists of records, such as the customer account databases kept by many businesses, i.e the ones in which records start off with manual entry of the personal data supplied by customers on paper forms?
How is the error split up between
(a) errors in data sources, such as original publications
(b) inaccurate manual data entry during the listing process
(c) errors in processing of data from one format or compilation to another?
Does your experience include formal data cleaning (e.g. http://en.wikipedia.org/wiki/Data_cleansing, http://www.datacleansing.net.au/Data_Cleansing_Services)? If so, what are the residual errors? Are they all (or almost all) problems in original publications?
My own experience with *good*, low-error compilations is that most of the inconsistencies have to do with the spelling of names. Is that also your experience?
Dr Robert Mesibov
Honorary Research Associate
Queen Victoria Museum and Art Gallery, and
School of Agricultural Science, University of Tasmania
PO Box 101, Penguin, Tasmania, Australia 7316
(03) 64371195; 61 3 64371195
More information about the Taxacom