With so much data in the world, it’s time to figure out how much of it is unstructured -- that which a human needs to look at in order to understand it best -- and what to do about it.
IT research firm IDC has estimated 7.9 zettabytes of digital data in the world by 2015, and I think the biggest chunk of it will come from social media generated on mobile platforms and driven from email. Intel estimates that at least 2.5 billion people will be online by 2015, generating more and more data and requiring more resources for storing and processing the data, as reported in InformationWeek. Such outlooks have led evangelist analysts to gush over unstructured data's potential; Google's Avinash Kaushik, for example, publicly claimed to have “orgasms over big, unstructured data.”
How to structure unstructured data
Many of us are really only getting started with unstructured data, looking for ways to get started and trying to figure out how best to handle it all. Actually, we need to ask ourselves if we should even bother to try working with it, as many previous attempts to add structure robotically have been disappointing to say the least, and fail at least much of the time. After all, dealing with and automating processes around structured data is tough enough!
Here I've put together a few things you can do with or relative to unstructured data:
Distribute the data in the cloud -- just store more of it and hope you can see useful patterns in the data with advanced big-data analytics and predictive analytics platforms.
Develop more powerful analytics engines to analyze the data, most of which will be in the cloud, in real time
Merge as much data as you can into large data files, a lesson learned by Team Obama in preparing for the 2012 election recently; merging several different databases and cleaning the data made developing predictions and gleaning insights easier.
Clean the data -- this assumes unstructured data is dirty, or not useful for analysis in its current state. You can purge duplicate information, ensure consistency in the naming of entities, and empty and sparse datasets, for example. Consider checking out Saleforce Data.com's Social Key, which ties customer data records to social media accounts and online content by those accounts. Perhaps Salesforce is on to something here; the cost of cleaning data might be shared, as the data a company is able to clean for its own use also goes back (the part that can be shared) to the overall Data.com repository in Salesforce’s cloud.
Working with unstructured data won't be easy -- but it will be necessary.
What advice do you have for working with unstructured data? Share below.
webmetricsguru, - One big challenge i see is analytics of sentiment oriented data even though the area has grown a great deal in the course of this year -- new apps and all. This kind of data at least for now may still need to be reviewed closer because its difficult to automate it in one given pattern without locking out new sentiments that are outside of that original set. However as intelligent systems learn this data, it will cut down what we have to look at even in sentiment analytics.
If I understand you correctly, that's the bane of the PR / Marcom industry - that you can actually look at bunch of verbatim (maybe that's ok for 10-30) but what happens when you have thousands or more?
I think a discussion on just what cleaning data is and how to to best do it would be good for AllAnalytics.com personally. I'd like to see what we come up with, and I bet a lot of others would too.
"If we can find the essential information or pattern, we might not need to look at most of it"
I see. I suppose that those hand-written patterns can just be domain specific and will be difficult to generalize. I agree that extracting the most useful patterns might be enough in most cases, as it difficult to think of all possible patterns. One of drawback of such model is that human patterns are often low-recall, even if precision is high.
It is true that some of the points you mentioned are debatable - like "Distribute the data in the cloud". But they are valid points to take into account when dealing with unstructured data. To the question how to clean unstructured data? I think that it depends on the shape and the model that has been defined.
What I meant is that currently, people usually end up needing to look at the data to understand it (because it is un structured information) and attempts to use software to understand it, in my opinion, won't work, at least not today.
What you can do, I think, and maybe our friends here can confirm or argue this, is cut down on what we have to look at.
If we can find the essential information or pattern, we might not need to look at most of it - and hopefully the software created can help surface that information, and maybe that's the best we can hope for (big data hype or not).
At any rate, this is an interesting discussion and I don't have all the answers - but I am wondering just what they are.
"I define it as something a human needs to look at to fully process"
I still don't get it. Do you mean that there is the need for human intervention to figure out whether the data is unstructured or not? Won't that be time consuming and practically impossible for human to go through all the instances of the data due to it size? Maybe it is not what you mean?
"So I guess, we need to find better search and organizing processes."
I think that is what the cleaning and storage processes are all about. Some data can fit in many categories depending on the search parameters. This may complicate the storage process as the same data will be duplicated - sometimes unnecessary.
Tech Marketing 360 The only event dedicated to technology marketers. Discover the most current and cutting-edge innovations and strategies to drive tech marketing success. Hear from and engage with companies like Mashable, SAS, Dun & Bradstreet, ExactTarget, Google+, IDC, Microsoft, LinkedIn, Oracle Eloqua, Leo Burnett, Young & Rubicam, Juniper Networks and more – all in an intimate, upscale setting. Register at http://www.techmarketing360.com with priority code CMANALYTICS14 to save $100.
SAS Global Forum Executive Conference 2014 The Executive Conference is held in conjunction with SAS Global Forum, a SAS users technology event. Investing in thought leadership and technical training are two of the best moves a successful company can make so take advantage of the world-class speakers, sessions and discussions around Analytics, Big data, Risk, Fraud and Data management.
LEADERS FROM THE BUSINESS AND IT COMMUNITIES DUEL OVER CRITICAL TECHNOLOGY ISSUES
The Current Discussion
Visual Analytics: Who Carries the Onus? The Issue: Data visualization is an up-and-coming technology for businesses that want to deliver analytical results in a visual way, enabling analysts the ability to spot patterns more easily and business users to absorb the insight at a glance and better understand what questions to ask of the data. But does it make more sense to train everybody to handle the visualization mandate or bring on visualization expertise? Our experts are divided on the question. The Speakers: Hyoun Park, Principal Analyst, Nucleus Research; Jonathan Schwabish, US Economist & Data Visualizer
David Tishgart, senior director of marketing and alliances at security provider Gazzang, explains the importance of data encryption for companies that are rolling out Hadoop environments to leverage big data analytics.
At the Strata Conference / Hadoop World 2013, Samuel Kommu, technical marketing engineer at Cisco Systems, shares some of the benefits that Hadoop brings to analytics platforms that leverage next-generation hardware. Kommu looks at big data operations that required 3,500 nodes in 2009, 2,000 in 2011, and now require only 64 nodes.
Wayne Thompson, manager of SAS Data Sciences Technologies, delivers a fascinating preview demonstration of SAS Visual Statistics, a tool that enables fast and flexible modeling against massive datasets on the fly. Visual Statistics will be made generally available in March, but you can see it here first.
At Strata/Hadoop World 2013, Cloudera CEO Tom Reilly discusses the new Enterprise Data Hub offering, explaining how it works with Hadoop, how it creates a single repository of full-history and full-fidelity data, and how it exposes that data to all users interested in exploratory analytics.
At this year's Strata Conference/Hadoop World 2013, SAS big data vice president Paul Kent presented a session on setting up Hadoop clusters for advanced analytics. We caught up with several audience members and recorded their impressions of the presentation.
In hearing directly from a doctorate-level Hadoop specialist, a healthcare data analyst, and a marketing executive, it's clear that big data analytics is a burgeoning field that cutting-edge companies are eager to explore.
At this year's Strata Conference/Hadoop World 2013 event, SAS VP of Big Data Paul Kent presented several sessions about modernizing and deploying advanced data analytics infrastructures based on Hadoop. In this video, he talks about the state of Hadoop adoption among enterprises today and looks out to the big data-driven applications of the future.
Companies that use SAS analytics tools for their traditional databases are looking to derive even more value by mining unstructured data. Data management platforms like Hortonworks enable that relationship by delivering an enterprise-ready Hadoop framework.
In this video, Shaun Connolly, vice president of corporate strategy at Hortonworks, explains how companies can incorporate Hadoop into their data analytics streams.
At the SAS Premier Business Leadership Series in Orlando, Manuel Sanchez, CRM Manager for Club Premier Aeromexico, explains the challenges and opportunities of transaction data. Using dozens of data sources among participating airlines and merchants, Club Premier creates robust customer profiles and works to maximize benefits for members and business partners alike while protecting individual privacy.
At SAS's October Premier Business Leadership Series (PBLS) in Orlando, attendees from the corporate and academic worlds joined thought leaders and analytics professionals to share insights and strategies around big data.
Will Hakes, CEO and co-founder of Link Analytics and keynote speaker at the SAS Analytics 2013 conference in Orlando, Fla., last month, talks candidly about the challenges that large enterprises face as they explore advanced analytics solutions. He also shares some practical tips for smoothing the transition.
At the SAS Analytics 2013 conference in Orlando, Bob Gladden, vice president for decision support and informatics at the Ohio nonprofit health insurance provider CareSource, explains how his company uses advanced analytics to keep administrative costs down and to identify at-risk patients for targeted healthcare initiatives.
At the Analytics 2013 conference in Orlando, Fla., two analytics experts from Dell -- global decision sciences manager Natalie Kortum and senior credit risk consultant Jack Chen -- share their real-world advice for analysts who want to sell their project ideas to business executives.
At the SAS Premier Business Leadership Series in Orlando, Fla., Lousiana State Representative Chris Broadwater outlined the state's success with analytics-driven fraud detection and shared his vision for streamlined processes at the DMV, the healthcare system, and even the department of corrections -- all delivered via a centralized repository of rich customer data.
Organizations that are ready to leverage big data need to move beyond buzzwords and approach the challenges with a business focus. Peter Guerra, principal at Booz Allen Hamilton, shares his insight and experience in helping clients transition to Hadoop and embrace new decision support platforms.
At this year's Strata Conference / Hadoop World 2013, Michael Steinhart chats with Rackspace Product Marketing Manager Sean Anderson about Hadoop, cloud computing, and how the two come together for companies that want to undertake a "proof of value" project.
With today's advanced visual analytics tools, you can stream data into memory for real-time processing, provide users the ability to explore and manipulate the data, and bring your data to life for the business.
Dynamic data visualizations let analysts and business users interact with the data, changing variables or drilling down into data points, and see results in a flash. Advance your use of data visualization with tools that support features like auto-charting, explanatory pop-ups, and mobile sharing.
No doubt your enterprise is amassing loads of data for fact-based decision-making. Hand in hand with all that data comes big computational requirements. Can traditional IT infrastructure handle the increasing number and complexity of your analytical work? Probably not, which is why you need a backend rethink. Big data calls for a high-performance analytics infrastructure, as Fern Halper, a partner at the IT consulting and research firm, Hurwitz & Associates, discusses here.
Redbox's bright-red DVD kiosks are all but ubiquitous these days, located in more than 28,000 spots across the country. Jayson Tipp, Redbox VP of Analytics and CRM, provides an insider's look at how the company has accomplished its phenomenal nine-year growth.
InterContinental Hotels Group (IHG), a seven-brand global hotelier, has woven analytics into the fabric of its operations. David Schmitt, director of performance strategy and planning, shares IHG's analytics story and his lessons learned.