The focus in discovery is finding something "truly interesting" in the "potentially interesting" data gathered in step one. The discovery step often starts with descriptive statistics, which provide a summary of what the data tell us about themselves. Let's consider a specific problem: understanding the reaction in social media to different methods of presenting advertisements. I'll consider three descriptions of possible interest and three distributions of data errors.
- Average behavior: What is the average, or typical, reaction to an advertisement (e.g., the middle of the distribution)?
- Variation: How variable is the reaction (e.g., how much spread is there, what constitutes "good" and "excellent" reactions, respectively, are "categories" of advertisements evident, etc.)?
- Extreme behavior/rare events: "What is the reaction in the tails (e.g., what goes viral)?"
- Random: The errors are randomly distributed throughout the datasets.
- Skewed: The errors are skewed in some fashion. It may be that there are more errors in some important sub-sample, more errors on one side of the overall distribution, or more errors in some datasets than others, for example.
- Way out of line: Whether random or skewed, some data errors are extreme.
The following table summarizes the likely implications for summary statistics. The table is color-coded -- green means little impact, yellow means some impact, and red means trouble. I've included a light green for potentially dangerous situations that good analysts should be able to accommodate.
In the short term, they must take rigorous steps to identify and eliminate errors. When I was at Bell Labs, we called the process "rinse, wash, scrub." They must make data quality statistics part and parcel of their overall descriptive statistics. And, overall, they must be extremely cautious! At best these steps are time-consuming, expensive, and fraught. They are only suitable for the short term.
Garbage in, garbage out.
What are you doing to ensure data quality for big-data analytics? Share on the message board below.