A lot of people talk big about analytics, and sometimes talk is just talk. If you want to separate the women from the girls in this business, ask a few questions about sampling.
Why sampling? Sampling is the very heart of statistical analysis. In fact, a “statistic,” by definition, is a measure derived from a sample. Each and every statistical technique is based on sampling. Data miners, historically, have avoided the study of sampling. But as datasets grow ever larger, so do the costs for hardware, software, and the time of the data.
Often, a sample would provide information that, for any practical purpose, is just as good. Maybe it's even better, since a smaller quantity of data is easier to prepare and faster to explore and model, so the results are available sooner. In the age of big-data, every stripe of data analyst needs to know about sampling methods.
You don’t have to be an expert yourself to get a good sense of whether that big talker is a genuine expert or full of hot air. When you shop for analytics talent, ask questions like these:
- How would you determine the quantity of data we need to address a particular business issue?
- Some groups of customers are hard to reach with surveys. How could that affect our analysis?
- What can we do to avoid introducing bias when we collect data?
You would also do well to ask directly about what training the analyst has had in sampling theory and methods. Here are some key terms to help you along in that discussion.
Simple random sample
This is the most commonly used sample type in statistical analysis, by far. Every individual in the population has an equal chance to be included in the sample. Imperfections in the selection process, which cause some individuals to have greater or lesser chances of inclusion than others, are called “bias.”
Several random samples are taken from different segments of a population, usually to ensure representation of segments that are small but important. Special techniques are required to properly analyze data from stratified samples.
Cluster sampling is typically used when travel or other requirements make it expensive or difficult to collect data. A cluster sample addresses this problem by first taking a sample of locations -- perhaps just a handful of cities -- and then sampling individuals within those locations. As with stratified samples, special techniques are needed to properly analyze this data.
You have some data lying around. It may not have been collected thoughtfully, and it’s probably far from a random selection from the population, but you just might use it anyway, because it’s handy. That’s a convenience sample.
Want to boost your knowledge about sampling methods? You’ll find basics in most introductory statistics textbooks. Most large university and public libraries stock advanced texts. Online, you can start with the old standby -– Wikipedia -– here.
How to you separate those who really know sampling from those who don't? Tell me below.