Big-Data Will Suffer From Low Quality

As I’ve opined on the importance of data quality to big-data over the last few posts, I fear I’ve come across as a bit of a curmudgeon, cynical about the prospects for big-data. Nothing could be further from the truth.

I’m not just a supporter, but an ardent fan of big-data. I believe big-data is crucial to solving some of the most important problems of our day, from improving healthcare and making it more affordable, to making the world safer and advancing individual liberties, to growing the economy and protecting the environment.

It's also true that I think all of these things are going to prove far more difficult than we can imagine. Improving data quality is not the only difficulty, but it's up there.

I find two examples especially instructive: the search for the Higgs boson, or Higgs particle, and collateralized debt obligations. The discovery of the Higgs boson is a crowning achievement. Collateralized debt obligations, not so much.

One can't help but be impressed by the rigor scientists follow as they create and analyze data. They design the experiments carefully, specify definitions of key terms, carefully define and manage their data collection processes, calibrate their instruments, and scrutinize the data for inadvertent errors. They analyze their data from numerous perspectives. Many seek connections with other disciplines. “Control” is the watchword throughout.

In the case of the Higgs boson, physicists waited until they could rule out the chance of a spurious result with confidence (less than one chance in three million). The thoroughness has paid off: Just yesterday, as widely reported, the European Organization for Nuclear Research, or CERN as its commonly known, announced that its official findings have received peer-review approval. But even now they aren’t claiming they’ve found it, only that they’ve found a “Higgs-like” particle.

The Higgs boson example stands in marked contrast to the rush to package and sell collateralized debt obligations. If financial institutions pride themselves on one thing, it's their ability to price risk. And collateralized debt obligations held enormous promise to slice and dice mortgages, packaging risk to suit individual investors. But, as everyone knows, the underlying data proved bad. And all of us are still suffering from the fallout.

In this new age of big-data, the old truism surely persists: Massive quantities of garbage in, massive quantities of garbage out.

Related posts

Thomas Redman,

Dr. Thomas C. Redman, Founder, Navesink Consulting
Dr. Thomas C. Redman (the Data Doc) is an innovator, advisor, and teacher. He was first to extend quality principles to data and information, in the late 80s. Since then he has crystallized a body of tools, techniques, roadmaps, and organizational insights that help organizations make order-of-magnitude improvements. More recently he has developed keen insights into the nature of data and formulated the first comprehensive approach to "putting data to work."  Taken together, these enable organizations to treat data as assets of virtually unlimited potential. Tom has personally helped dozens of leaders and organizations better understand data and data quality and start their data programs.  He is a sought-after lecturer and the author of dozens of papers and four books. The most recent, <i>Data Driven: Profiting From Your Most Important Business Asset</i> (Harvard Business Press, 2008), was a <i>Library Journal</i> best buy of 2008. Prior to forming Navesink in 1996, Dr. Redman conceived the Data Quality Lab at AT&T Bell Laboratories in 1987 and led it until 1995.  He holds a PhD in statistics from Florida State University and holds two patents.

Why Big-Data Needs Trustworthiness & High Quality

You can't make money from poor-quality data that folks don't trust.

Why Twain's 'A Man With a Watch' Observation Applies to Big-Data

Sometimes a well-constructed small sample experiment yields more relevant results than what's pulled out of big-data.

Re: Good points
  • 9/17/2012 4:46:13 PM

@ Lyndon, yes, yes, double check the work always. 

Re: Good points
  • 9/17/2012 9:46:52 AM

Yes, agreed. It is important in making sense of the data as well reports. A successful solution means better insights. This further builds trust for the organization making it more accurate and beneficial to the decision making process.

Re: Good points
  • 9/16/2012 9:23:58 PM


Seth writes

When it comes to the quanity of data, there are so many human eyes to look at it and ony so much one person can digest.  This is why automated data cleansing and    analysis becomes so much more important to do the the dirty and tedious work when possible.


I'd think, however, that what's critical is the criteria used for the automated process (software, presumably?) to perform in the analysis and cleansing.  I'd assume that those criteria are established and coded by humans.

I've seen numerous instances of attempts to "clean up" data one way or the other through mods to a software program, but oops, we forgot to account for...

...sometimes resulting in a big mess or even a disaster.

Moral:  Trust in the robots, but verify...


Re: Good points
  • 9/15/2012 4:18:50 PM

When it comes to the quanity of data, there are so many human eyes to look at it and ony so much one person can digest.  This is why automated data cleansing and    analysis becomes so much more important to do the the dirty and tedious work when possible.

Good points
  • 9/14/2012 11:12:24 AM

There is nothing cynical about suggesting the best strategy is to be thorough, accurate and confident in the data.