Garbage in, Garbage out: What It Means for Big-Data Quality

As I mentioned in my previous post, data quality is of significant importance to most big-data work, as looking at the impact through the D4, or "data to discovery to deliver to dollars," process shows. While I zeroed in on the quality implications during data gathering in that post, here I'm focusing in on the discovery phase.

The focus in discovery is finding something "truly interesting" in the "potentially interesting" data gathered in step one. The discovery step often starts with descriptive statistics, which provide a summary of what the data tell us about themselves. Let's consider a specific problem: understanding the reaction in social media to different methods of presenting advertisements. I'll consider three descriptions of possible interest and three distributions of data errors.


  1. Average behavior: What is the average, or typical, reaction to an advertisement (e.g., the middle of the distribution)?
  2. Variation: How variable is the reaction (e.g., how much spread is there, what constitutes "good" and "excellent" reactions, respectively, are "categories" of advertisements evident, etc.)?
  3. Extreme behavior/rare events: "What is the reaction in the tails (e.g., what goes viral)?"
Error patterns:
  1. Random: The errors are randomly distributed throughout the datasets.
  2. Skewed: The errors are skewed in some fashion. It may be that there are more errors in some important sub-sample, more errors on one side of the overall distribution, or more errors in some datasets than others, for example.
  3. Way out of line: Whether random or skewed, some data errors are extreme.

The following table summarizes the likely implications for summary statistics. The table is color-coded -- green means little impact, yellow means some impact, and red means trouble. I've included a light green for potentially dangerous situations that good analysts should be able to accommodate.

Overall, the picture is pretty dismal. Good analysts need to be aware of and deal with these issues. At a minimum (and as noted in the previous post) they must know the provenance and quality of all data. Over the long term, they must strive to improve the quality of data, getting data to "trusted quality levels." And they must understand the patterns of errors that remain.

In the short term, they must take rigorous steps to identify and eliminate errors. When I was at Bell Labs, we called the process "rinse, wash, scrub." They must make data quality statistics part and parcel of their overall descriptive statistics. And, overall, they must be extremely cautious! At best these steps are time-consuming, expensive, and fraught. They are only suitable for the short term.

Garbage in, garbage out.

What are you doing to ensure data quality for big-data analytics? Share on the message board below.

Thomas Redman,

Dr. Thomas C. Redman, Founder, Navesink Consulting
Dr. Thomas C. Redman (the Data Doc) is an innovator, advisor, and teacher. He was first to extend quality principles to data and information, in the late 80s. Since then he has crystallized a body of tools, techniques, roadmaps, and organizational insights that help organizations make order-of-magnitude improvements. More recently he has developed keen insights into the nature of data and formulated the first comprehensive approach to "putting data to work."  Taken together, these enable organizations to treat data as assets of virtually unlimited potential. Tom has personally helped dozens of leaders and organizations better understand data and data quality and start their data programs.  He is a sought-after lecturer and the author of dozens of papers and four books. The most recent, <i>Data Driven: Profiting From Your Most Important Business Asset</i> (Harvard Business Press, 2008), was a <i>Library Journal</i> best buy of 2008. Prior to forming Navesink in 1996, Dr. Redman conceived the Data Quality Lab at AT&T Bell Laboratories in 1987 and led it until 1995.  He holds a PhD in statistics from Florida State University and holds two patents.

Why Big-Data Needs Trustworthiness & High Quality

You can't make money from poor-quality data that folks don't trust.

Why Twain's 'A Man With a Watch' Observation Applies to Big-Data

Sometimes a well-constructed small sample experiment yields more relevant results than what's pulled out of big-data.

Re: Quality personnel
  • 8/13/2012 10:50:15 AM

Seth, and let's not forget that S&P $2 trillion error from last year! I don't know if we can link that to spreadsheets, but certainly there was a data quality issue there with widespread ramifications.

Re: Quality personnel
  • 8/13/2012 9:57:59 AM

The potential errors are mind-boggling, especially from companies that should know better. (ie, a company that specializes in metrics that apparently did not realize how excel truncates numbers after a certain number of digits. yes, this really happened.)

Re: Quality personnel
  • 8/13/2012 9:46:43 AM

I'm apprehensive about the numbers of companies that still rely on spreadsheets -- and the potential errors created from less than adequaltely treined employees who use them

Re: Quality personnel
  • 8/12/2012 2:30:29 AM

@ Beth, that's a good question. Big data could lead to big errors.   The more data there is the more potential.  For example, audits of financial instituations can have tens of millions of spread sheets , which any one could have an error.  In Oct, 2011, the German government announced it  €55bn richer after an accountancy error undervalued assets at the state-owned mortgage lender Hypo Real Estate. It was due to a sum being entered twice. 

Richard Cuthbert, CEO of UK outsourcing specialist Mouchel, stepped down after a spreadsheet-based accounting error reduced Mouchel's full-year profits by more than £8.5 million to below £6 million.   Ooops! 

With these spreadsheets being connected to various data bases, a simple error can spread like wild fire. 

Quality personnel
  • 8/10/2012 10:13:06 AM

Hi Tom. I'm wondering what you see internally these days relative to how companies handle  data quality and how that might change as big-data comes into play. Do you see distinct data quality groups set up? If not, do you think such would be a necessity going forward?