As I mentioned in my previous post, data quality is of significant importance to most big-data work, as looking at the impact through the D4, or "data to discovery to deliver to dollars," process shows. While I zeroed in on the quality implications during data gathering in that post, here I'm focusing in on the discovery phase.
The focus in discovery is finding something "truly interesting" in the "potentially interesting" data gathered in step one. The discovery step often starts with descriptive statistics, which provide a summary of what the data tell us about themselves. Let's consider a specific problem: understanding the reaction in social media to different methods of presenting advertisements. I'll consider three descriptions of possible interest and three distributions of data errors.
- Average behavior: What is the average, or typical, reaction to an advertisement (e.g., the middle of the distribution)?
- Variation: How variable is the reaction (e.g., how much spread is there, what constitutes "good" and "excellent" reactions, respectively, are "categories" of advertisements evident, etc.)?
- Extreme behavior/rare events: "What is the reaction in the tails (e.g., what goes viral)?"
- Random: The errors are randomly distributed throughout the datasets.
- Skewed: The errors are skewed in some fashion. It may be that there are more errors in some important sub-sample, more errors on one side of the overall distribution, or more errors in some datasets than others, for example.
- Way out of line: Whether random or skewed, some data errors are extreme.
The following table summarizes the likely implications for summary statistics. The table is color-coded -- green means little impact, yellow means some impact, and red means trouble. I've included a light green for potentially dangerous situations that good analysts should be able to accommodate.
Overall, the picture is pretty dismal. Good analysts need to be aware of and deal with these issues. At a minimum (and as noted in the previous post) they must know the provenance and quality of all data. Over the long term, they must strive to improve the quality of data, getting data to "trusted quality levels." And they must understand the patterns of errors that remain.
In the short term, they must take rigorous steps to identify and eliminate errors. When I was at Bell Labs, we called the process "rinse, wash, scrub." They must make data quality statistics part and parcel of their overall descriptive statistics. And, overall, they must be extremely cautious! At best these steps are time-consuming, expensive, and fraught. They are only suitable for the short term.
Garbage in, garbage out.
What are you doing to ensure data quality for big-data analytics? Share on the message board below.