We see so many stories these days about big data and its vast, unstructured nature that sometimes it's easy to forget that just a little organization makes it possible to reveal new ideas.
Simple tools for cleaning unformatted data can help remind us.
Google Refine, formerly Freebase Gridworks, is one such tool. Google Refine downloads and operates from your desktop, though the interface is through a standard browser. The user can upload data into a file project, and then separate text threads that appear throughout the dataset for placement in separate columns, rows, or in any tabular arrangement. Refine permits this editing over a large scale of data, as you can see in the video below. Google even encourages users to approach Refine differently from a spreadsheet.
The feature I like is that Refine saves each edit, much like Apple’s Time Machine backup software. Users can move back and forth to compare tabular changes. A user can use a thread separation methodology in another project by saving it as a JSON file, then importing it.
That leads to another nice aspect. You don't need expert knowledge of code language to use Google Refine. A much more valuable skill is imagination. Paramount is knowing how to refine data based on one’s understanding of what needs to be separated or combined.
Let's consider, for example, a GM analyst who is matching data related to the Chevrolet Corvette. If he understands that a 7.0 liter V-8 engine and a 427 cubic inches V-8 engine have some similar references and meanings, he can group data with a focus on the similarity. Google Refine maintains a few features that permit the user to focus on the purpose behind the dataset rather than how to express the arrangement of the data in a specific technical language.
Google has recently updated Refine with a new format. The improvements do not radically change the user interface, unlike updates to Google Analytics and Google YouTube Analytics, as I discuss in a previous post, Google Spruces Up YouTube Analytics.
Additional tools for refining data include DataWrangler, created by the Visualization Group at Stanford University.
Developers need to do some work on some features that aren't fully intuitive yet. For example, initially it would appear that you can only select one of three example datasets before realizing you can edit and paste your intended set.
Tools like Google Refine and DataWrangler will not eliminate challenges of finding and arranging data. But they do make the data cleaner when imported. As a result, analysts will be able to envision how a table can potentially be created, and ultimately how a dataset can potentially help your business.
Do you have a data cleansing tool of choice? Share on the message board below.
Exactly right - Refine is a desktop solution viewed through a broswer, so given Google's past with its products, usage can seem like its online (The user downloads the program). Data Wrangler is online, provided vis Stanford U.
I am seeing other products that operate with a private cloud aspect - it will be interesting to see how this niche develops. It reminds me of BMW and how it has tried to cover a number of niches within their product lines. We seem to be in the era of addressing a niche, with inspiration on and offline. :-)
You're right. Tools like Data Wrangler and Google Refine can empower. These tools are essentially adhoc in purpose, so there's an implication within organizations to communicate with care, ensuring that information discoverded from their usage is shared.
These tools definitely sound interesting. I know some get antsy about BI-as-a-service being readily available, but I've always believed that more information empowers. It sounds like these technologies are just the thing to place more BI power in the hands of all employees for the benefit of an entire organization.
The files created can be exported, so the table created can be ported. For Google Refine, the application resides on the computer, even though you are using a browser for the interface, so there is not portability for the application.
Joe, when you suggest there be "one final arbiter 'handling' the data in a uniform way" are you thinking along the lines of creating a master file/single source of truth, as I talk about in this recent blog about master data management? Or did you have something less formalized in mind for this level of data/tool/user?
One article points out that upto 80 percent of data analysis efforts are spent on cleaning the data. In light of this he proposes data marketplaces as a means of obtaining common data. Something i agree with as a way of reducing data cleaning effort as a whole and leaving it for some firms to specialize.
However when push comes to shove and you really must collect the data yourself from raw source, such tools as google refine come in handy. The positive thing is that the learning curve for Google tools is not too steep, enabling even companies with a limited budget for analytics experts to handle a good chunk of their analytics by themselves
2015 Visual Analytics Interactive RoadshowSAS(r) experts are coming to a city near you in a series of live, interactive workshops focused on SAS Visual Analytics, including how to prepare your data for VA, the integration of VA with Office Analytics and a Visual Statistics demo.
January 22: King of Prussia, PA
February 24: Austin, TX
March 26: Redwood City, CA
April 22: NYC, NY (1st of 2 stops)
May 13: Seattle, WA
June 18: Minneapolis, MN
July 21: Rockville, MD
August 18: Chicago, IL
September 24: Irvine, CA
October 9: Cary, NC (during SAS Championship)
October 21: NYC, NY (2nd of 2 stops)
November 17: Orlando, FL
December 8: Atlanta, GA
LEADERS FROM THE BUSINESS AND IT COMMUNITIES DUEL OVER CRITICAL TECHNOLOGY ISSUES
The Current Discussion
Visual Analytics: Who Carries the Onus? The Issue: Data visualization is an up-and-coming technology for businesses that want to deliver analytical results in a visual way, enabling analysts the ability to spot patterns more easily and business users to absorb the insight at a glance and better understand what questions to ask of the data. But does it make more sense to train everybody to handle the visualization mandate or bring on visualization expertise? Our experts are divided on the question. The Speakers: Hyoun Park, Principal Analyst, Nucleus Research; Jonathan Schwabish, US Economist & Data Visualizer
The hospitality industry gathers massive amounts of customer data, and mining that data effectively can yield tremendous results in terms of improved CRM, better-targeted marketing spend, and more efficient back-end processes. Roger Ares, vice president of analytics at Hyatt Corp., discusses the ways he and his staff use big data.
Charged with keeping track of travel assets, including employees, iJET International relies on data management best-practices and advanced analytics to keep its clients in the know on current and potential world events affecting travel, Rich Murnane, Director of Enterprise Data Operations & Data Architect, told All Analytics in an interview from the 2014 SAS Global Forum Executive Conference.
Jason Dorsey, chief strategy officer for the Center for Generational Kinetics and keynote speaker at last month's SAS Global Forum 2014, describes how Gen Y professionals are enhancing the makeup of multigenerational analytics organizations.
From analytics talent development to the power of visual analytics, All Analytics found a variety of common themes circulating throughout the exhibition floor and session discussions at the 2014 SAS Global Forum and SAS Global Forum Executive Conference events held last month in Washington, DC.
Talking with All Analytics live from the 2014 SAS Global Forum Executive Conference, Eric Helmer, senior manager of campaign design and execution for T-Mobile, discussed the importance of customer data -- starting internally -- in devising the mobile operator's marketing plans.
The big-data analytics market can be a confusing place. Among the vendors vying for your dollars are traditional database management providers, Hadoop startup services, and IT giants. In this video, All Analytics editors Beth Schultz and Michael Steinhart sit down in a Google+ Hangout on Air with Doug Henschen, executive editor of InformationWeek. Henschen discusses use cases for big-data analytics, purchase considerations, and his recent roundup of the top 16 big-data analytics platforms.
At the National Retail Federation BIG Show last month, All Analytics executive editor Michael Steinhart noted a host of solutions for tracking and analyzing customer activity in retail stores. From Bluetooth beacons to RFID tags to NFC connections to video analytics, retailers must find the right combination of tools to help optimize the shopper experience, streamline operations, and boost revenues.
The days when historical shipment trends and gut feelings were enough to forecast retail demand accurately are long over. SAS chief industry consultant Charles Chase outlines the benefits of pulling real-time sales information from point-of-sale and product scanner systems, then flowing that data into dynamic forecasting tools from SAS.
With today's advanced visual analytics tools, you can stream data into memory for real-time processing, provide users the ability to explore and manipulate the data, and bring your data to life for the business.
Dynamic data visualizations let analysts and business users interact with the data, changing variables or drilling down into data points, and see results in a flash. Advance your use of data visualization with tools that support features like auto-charting, explanatory pop-ups, and mobile sharing.
No doubt your enterprise is amassing loads of data for fact-based decision-making. Hand in hand with all that data comes big computational requirements. Can traditional IT infrastructure handle the increasing number and complexity of your analytical work? Probably not, which is why you need a backend rethink. Big data calls for a high-performance analytics infrastructure, as Fern Halper, a partner at the IT consulting and research firm, Hurwitz & Associates, discusses here.
Redbox's bright-red DVD kiosks are all but ubiquitous these days, located in more than 28,000 spots across the country. Jayson Tipp, Redbox VP of Analytics and CRM, provides an insider's look at how the company has accomplished its phenomenal nine-year growth.
InterContinental Hotels Group (IHG), a seven-brand global hotelier, has woven analytics into the fabric of its operations. David Schmitt, director of performance strategy and planning, shares IHG's analytics story and his lessons learned.