Recent news stories about big-data have pumped up the idea that gargantuan datasets guarantee value. In fact, they don't.
Prospective software buyers are often concerned about dealing with large quantities of data. I've heard a lot of questions along the lines of "We have a 12-oodlebyte database; can the software handle that?" No matter how much data you have, there is software to handle it, but if this is the kind of question you're asking, you're doing it wrong.
Very large datasets always demand more resources than small ones. If you collect lots and lots of data, you will certainly need more hardware, pricier software licenses, and additional labor to manage it, let alone analyze it. Is the extra expense worth it? Often, it is not. The problem is that it's easy to get so wound up over the size of the data that you lose perspective.
One area where I have seen this come up time and time again is social media analytics. Sentiment analysis is relatively new and very popular; everyone would like to know what people are saying about brands online. So many service providers are collecting tweets and other social media mentions, assessing them for sentiment (positive, neutral, or negative), and providing a summary for the client. Many use automated software to analyze every single mention they can get their virtual hands on -- sometimes millions of mentions to produce a single summary.
What's wrong with that? Think about the goal first, and then take a step back. The end goal is, say, a pie chart showing how many mentions are positive, how many are neutral, and how many are negative. Sentiment analysis, at its best, is not very precise, so the numbers cannot reasonably be expected to offer anything beyond a rough approximation. Another weakness of this process: The automated tools that select the relevant mentions are less than perfect. You could get your pie chart from a sample of a few hundred cases, and the results would not be significantly different from what you'd get analyzing of millions of mentions. If you insist on using much more data than necessary to meet the goal, you're making the job harder than it needs to be, and you're pouring money down the drain.
Sometimes you really need big-data, and sometimes you really don't. When you need to understand the behavior of many people as individuals -- perhaps as a direct marketer preparing personalized product recommendations, or a campaign manager using a similar approach in politics -- you need details about each person in a group of millions, and that adds up. If you are interested in summarizing the behavior of groups of people or things, a small dataset or sample is enough. Small data has advantages. Since the quantity isn't overwhelming, you can do a much better job of finding and correcting any data quality problems early. And it is easier and cheaper to analyze small datasets.
Using small data when it makes sense will help you get the job done and still have some money left for the next project. Do you agree?