Mo' Data Blues

Recent news stories about big-data have pumped up the idea that gargantuan datasets guarantee value. In fact, they don't.

Prospective software buyers are often concerned about dealing with large quantities of data. I've heard a lot of questions along the lines of "We have a 12-oodlebyte database; can the software handle that?" No matter how much data you have, there is software to handle it, but if this is the kind of question you're asking, you're doing it wrong.

Very large datasets always demand more resources than small ones. If you collect lots and lots of data, you will certainly need more hardware, pricier software licenses, and additional labor to manage it, let alone analyze it. Is the extra expense worth it? Often, it is not. The problem is that it's easy to get so wound up over the size of the data that you lose perspective.

One area where I have seen this come up time and time again is social media analytics. Sentiment analysis is relatively new and very popular; everyone would like to know what people are saying about brands online. So many service providers are collecting tweets and other social media mentions, assessing them for sentiment (positive, neutral, or negative), and providing a summary for the client. Many use automated software to analyze every single mention they can get their virtual hands on -- sometimes millions of mentions to produce a single summary.

What's wrong with that? Think about the goal first, and then take a step back. The end goal is, say, a pie chart showing how many mentions are positive, how many are neutral, and how many are negative. Sentiment analysis, at its best, is not very precise, so the numbers cannot reasonably be expected to offer anything beyond a rough approximation. Another weakness of this process: The automated tools that select the relevant mentions are less than perfect. You could get your pie chart from a sample of a few hundred cases, and the results would not be significantly different from what you'd get analyzing of millions of mentions. If you insist on using much more data than necessary to meet the goal, you're making the job harder than it needs to be, and you're pouring money down the drain.

Sometimes you really need big-data, and sometimes you really don't. When you need to understand the behavior of many people as individuals -- perhaps as a direct marketer preparing personalized product recommendations, or a campaign manager using a similar approach in politics -- you need details about each person in a group of millions, and that adds up. If you are interested in summarizing the behavior of groups of people or things, a small dataset or sample is enough. Small data has advantages. Since the quantity isn't overwhelming, you can do a much better job of finding and correcting any data quality problems early. And it is easier and cheaper to analyze small datasets.

Using small data when it makes sense will help you get the job done and still have some money left for the next project. Do you agree?

Meta S. Brown, Business Analytics Consultant

Meta S. Brown is a consultant, speaker, and writer who promotes the use of business analytics. A hands-on analyst who has tackled projects with up to $900 million at stake, she is a recognized expert in cutting-edge business analytics. She has conducted more than 4,000 hours of presentations about business analytics, and written guides on neural networks, quality improvement, statistical process control, and many other statistical methods. Meta's seminars have attracted thousands of attendees from across the US and Canada, from novices to professors.

Tell Me a Story: Why Data Analysts Must Be Storytellers, Too

Data alone won't make an analyst's work memorable or actionable in the eyes of a business executive. A story puts it into perspective.

It's the Data, Stupid

When it comes to acquiring the data that will feed your analytics initiative, "free" isn't always the best approach.

Re: What is the goal?
  • 9/6/2013 1:29:22 AM

Cats? Someone say cats? Try these ones:

Re: What is the goal?
  • 9/5/2013 5:04:39 PM

@ SaneIT,  I'm sure it is mostly cat videos.  I have a friend who's cat has his own Facebook page.  And that is not uncommon.

So I guess the question, what is the source of most of this data.  Is it financial institutions or Yelp?  What is really relevant?  Are all these cat videos driving up storage costs.  (And yes, I watch cat videos.)


Re: What is the goal?
  • 9/5/2013 7:17:44 AM

"5247GB of data for every person in the world."

That's a sobering thought  especially when you realize someone has to manage that data.  As the data sets grow it's just going to be more important that the quality of data rises otherwise you'll never be able to make sense of what data you do have.

Re: What is the goal?
  • 9/4/2013 8:22:46 PM

I feel that many people feel "Why have 95% certainty, when I can have a 100%?".  Of course, especially with sentiment analysis, even with all of the data, interpretation will never be 100%. 

There have been times, when I've had a small number of subjects to analyze and including everyone wasn't a problem, but when data sets becoming massive, one has to pick and choose. 

@ SaneIT. Data hoarding is a good way to describe it.  IDC projects that by 2020, there will be approximately 5247GB of data for every person in the world.  Now hoard that!

Re: What is the goal?
  • 9/4/2013 7:28:48 AM

I agree that knowing what you're looking for is more important than how much data you have.  I see the difference in a big database with good people pulling out good data as a smooth running warehouse.  They know where each part is and they know how to get to it in the most efficient way plus they know how those pieces go together so they know right away if something looks off.  I see the big database without thought like an episode of the show Hoarders.  There are just big piles of stuff everywhere and even though they aren't quite sure what they have or where they have it they can't get rid of anything because they might want it later.

Re: Sentiment Analysis
  • 9/3/2013 8:01:27 PM

If the goal is to summarize the sentiment expressed across a large group of messages, then I agree that using a sample and manually assessing sentiment in each is often a good way to go. There are limitations to that approach, though.

If you have many such groups of messages to evaluate, then time is a limiting factor. Sentiment categorization often uses a complex set of rules, and many organizations have had bad experiences using junior staff or outside vendors to perform categorization, due to lack of experience and high turnover of these analysts. To be fair, lack of training is also an issue. Not every organization is willing to invest in training analysts, either. So, for speed and consistency, it may be more effective to use automated techniques.

Sentiment Analysis
  • 9/3/2013 7:11:06 PM


In sentiment analysis, I think  you've chosen an especially good example of the weakness of large data sets.  Sometimes, maybe even often, reading the full Tweet leads a human to a different sentiment conclusion than the software finds by just parsing words.

Maybe it would be better to randomly select a couple hundred Tweets and just read them all?  Or select a thousand and have a (paid) intern read them all?



Re: When big is small, or vice versa
  • 9/3/2013 10:32:23 AM

Gil always has interesting insight to share. Thanks!


Re: When big is small, or vice versa
  • 9/3/2013 9:58:20 AM

Speaking of buzzwords, readers might enjoy this Forbes post by Gil Press:

Data Science: What's The Half-Life Of A Buzzword? - Forbes

Gil discusses the proliferation of new university programs in data science, and presents differing views about them.

Re: When big is small, or vice versa
  • 9/3/2013 9:49:08 AM

So maybe soon we'll be back to calling it just plain old "data"...  until some new buzzword strikes everybody's fancy.


Page 1 / 2   >   >>