What to Do With Unstructured Data


With so much data in the world, it’s time to figure out how much of it is unstructured -- that which a human needs to look at in order to understand it best -- and what to do about it.

IT research firm IDC has estimated 7.9 zettabytes of digital data in the world by 2015, and I think the biggest chunk of it will come from social media generated on mobile platforms and driven from email. Intel estimates that at least 2.5 billion people will be online by 2015, generating more and more data and requiring more resources for storing and processing the data, as reported in InformationWeek. Such outlooks have led evangelist analysts to gush over unstructured data's potential; Google's Avinash Kaushik, for example, publicly claimed to have “orgasms over big, unstructured data.”

How to structure unstructured data
Many of us are really only getting started with unstructured data, looking for ways to get started and trying to figure out how best to handle it all. Actually, we need to ask ourselves if we should even bother to try working with it, as many previous attempts to add structure robotically have been disappointing to say the least, and fail at least much of the time. After all, dealing with and automating processes around structured data is tough enough!

Here I've put together a few things you can do with or relative to unstructured data:

  • Distribute the data in the cloud -- just store more of it and hope you can see useful patterns in the data with advanced big-data analytics and predictive analytics platforms.
  • Develop more powerful analytics engines to analyze the data, most of which will be in the cloud, in real time
  • Transforming dark data/dark social and ultraviolet data into useable, structured information from which you can gain insights, as I discussed in my post Putting Analytics Fragmentation Into Perspective.
  • Merge as much data as you can into large data files, a lesson learned by Team Obama in preparing for the 2012 election recently; merging several different databases and cleaning the data made developing predictions and gleaning insights easier.
  • Clean the data -- this assumes unstructured data is dirty, or not useful for analysis in its current state. You can purge duplicate information, ensure consistency in the naming of entities, and empty and sparse datasets, for example. Consider checking out Saleforce Data.com's Social Key, which ties customer data records to social media accounts and online content by those accounts. Perhaps Salesforce is on to something here; the cost of cleaning data might be shared, as the data a company is able to clean for its own use also goes back (the part that can be shared) to the overall Data.com repository in Salesforce’s cloud.

Working with unstructured data won't be easy -- but it will be necessary. What advice do you have for working with unstructured data? Share below.

Marshall Sponder, Web Analytics and SEO/SEM Specialist

Marshall Sponder is a Web analytics and SEO/SEM specialist with expertise in market research, social media, networking, and public relations. As both an in-house team leader and consultant, he has used sophisticated analysis to optimize the social media marketing efforts of companies and brands including IBM, Monster, Porter Novelli, WCG, Gillette, Pfizer, Warner Brothers, Laughing Cow, The New York Times, and Havana Central. Sponder is a board member emeritus at the Web Analytics Association, a member of the Search Engine Marketing Professionals Organization (SEMPO), and a member of the Certified Institute of Public Relations Social Media Measurement Study Group (CIPR).

Putting Analytics Fragmentation Into Perspective

When the data we don't know is as important as the data we do, our analytics platform are all but guaranteed to fail us.

3 Musts You'd Find in an Ideal Analytics Platform

Segmentation, multichannel integration, and intelligent dashboard reporting are vital capabilities, yet many business analytics solutions fall short.


Re: cleaning with integrity
  • 10/30/2014 1:10:11 PM
NO RATINGS

Data searched from several different kinds of unstructured sources (mostly text based at the moment) can be restructured (flattened) quickly and inexpensively via this tool within  IRI NextForm.

Re: Graphic
  • 6/1/2014 10:28:55 AM
NO RATINGS

Hi BritInBigD, 

as far as I can tell that is taken from a paper from The Data Warehousing Institute. See the corresponding article for more: BI Search and Text Analytics

Graphic
  • 1/21/2013 8:45:31 AM
NO RATINGS

Hi Marshall,

I was curious to know the source of the graphic that accompanies your post (and specifically the indicative growth rates shown for the different data categories). Is that data from IDC? 

Re: cleaning with integrity
  • 12/31/2012 6:56:50 PM
NO RATINGS

@Hospice I agree, but I don't know of any standard protocools because their is so much varaibaility with unstructured data depending on industry etc. Still a challenge to get standards.

Re: Yikes - some misconceptions here that need to be addressed
  • 12/31/2012 8:50:05 AM
NO RATINGS

webmetricsguru, - One big challenge i see is analytics of sentiment oriented data even though the area has grown a great deal in the course of this year -- new apps and all. This kind of data at least for now may still need to be reviewed closer because its difficult to automate it in one given pattern without locking out new sentiments that are outside of that original set. However as intelligent systems learn this data, it will cut down what we have to look at even in sentiment analytics.

Re: Yikes - some misconceptions here that need to be addressed
  • 12/30/2012 11:35:11 PM
NO RATINGS

If I understand you correctly, that's the bane of the PR / Marcom industry - that you can actually look at bunch of verbatim (maybe that's ok for 10-30) but what happens when you have thousands or more?

I think a discussion on just what cleaning data is and how to to best do it would be good for AllAnalytics.com personally.  I'd like to see what we come up with, and I bet a lot of others would too.

Re: Yikes - some misconceptions here that need to be addressed
  • 12/30/2012 11:21:17 PM
NO RATINGS

"If we can find the essential information or pattern, we might not need to look at most of it"

I see. I suppose that those hand-written patterns can just be domain specific and will be difficult to generalize. I agree that extracting the most useful patterns might be enough in most cases, as it difficult to think of all possible patterns. One of drawback of such model is that human patterns are often low-recall, even if precision is high. 

Re: Step 1
  • 12/30/2012 11:05:50 PM
NO RATINGS

It is true that some of the points you mentioned are debatable - like "Distribute the data in the cloud". But they are valid points to take into account when dealing with unstructured data. To the question how to clean unstructured data? I think that it depends on the shape and the model that has been defined.

Re: Yikes - some misconceptions here that need to be addressed
  • 12/30/2012 11:05:17 PM
NO RATINGS

What I meant is that currently, people usually end up needing to look at the data to understand it (because it is un structured information) and attempts to use software to understand it, in my opinion, won't work, at least not today. What you can do, I think, and maybe our friends here can confirm or argue this, is cut down on what we have to look at. If we can find the essential information or pattern, we might not need to look at most of it - and hopefully the software created can help surface that information, and maybe that's the best we can hope for (big data hype or not). At any rate, this is an interesting discussion and I don't have all the answers - but I am wondering just what they are.

Re: Yikes - some misconceptions here that need to be addressed
  • 12/30/2012 10:52:47 PM
NO RATINGS

@marshall,

"I define it as something a human needs to look at to fully process"

I still don't get it. Do you mean that there is the need for human intervention to figure out whether the data is unstructured or not? Won't that be time consuming and practically impossible for human to go through all the instances of the data due to it size? Maybe it is not what you mean?

Page 1 / 3   >   >>
INFORMATION RESOURCES
ANALYTICS IN ACTION
CARTERTOONS
VIEW ALL +
QUICK POLL
VIEW ALL +