Here are a couple of interesting comments that have come up for my recent article on Smart Data Collective: “Big Data Blasphemy: Why Sample?”
From Simon Geletta, Associate professor at Des Moines University, commenting in the SAS Analytics & BI group on LinkedIn:
“That (everything being equal) bigger samples result in narrower confidence intervals is not a matter of opinion… The argument for sampling as presented in this blog, would benefit (become worth following) if the blogger can demonstrate that estimations that are based on sampe (from a bad sampling frame) yield better results as compared to estimations that are based on the bad sampling frame itself.”
My thoughts:
Yes, it is a matter of fact that bigger samples result in narrower confidence intervals.
The case for sampling does not depend on samples from a data resource producing better estimates than using all of the data available. There are cases in which the sample data can be more carefully inspected, corrected and otherwise cleaned, and in those cases estimates may, indeed, be better than those which would be made using a larger, dirtier, data resource. However, even when there is no improvement in data quality for a sample, there is always the issue of balancing the resources consumed with the value of the information obtained. The question is not whether the estimate obtained using a sample is better than the estimate that would be obtained by using all available data. Rather, it is a matter of obtaining “good-enough” information to address the business problem at hand, and of doing so in a manner that does not waste resources which could yield greater returns if used in some other way.
From Blaise Egan, Lead Network Infrastructure Analyst at British Telecommunications PLC, commenting in the Predictive Analytics group on LinkedIn:
“This reflects my own experience encountering data miners with a background in computer science rather than statistical science. To some extent they have been hoodwinked by misleading sales material from vendors of large-scale computing systems, both hardware and software.
It’s an important message that you’re putting out.”