As I mentioned in my previous post, data quality is of significant importance to most big-data work, as looking at the impact through the D4, or "data to discovery to deliver to dollars," process shows. While I zeroed in on the quality implications during data gathering in that post, here I'm focusing in on the discovery phase.
The focus in discovery is finding something "truly interesting" in the "potentially interesting" data gathered in step one. The discovery step often starts with descriptive statistics, which provide a summary of what the data tell us about themselves. Let's consider a specific problem: understanding the reaction in social media to different methods of presenting advertisements. I'll consider three descriptions of possible interest and three distributions of data errors.
Average behavior: What is the average, or typical, reaction to an advertisement (e.g., the middle of the distribution)?
Variation: How variable is the reaction (e.g., how much spread is there, what constitutes "good" and "excellent" reactions, respectively, are "categories" of advertisements evident, etc.)?
Extreme behavior/rare events: "What is the reaction in the tails (e.g., what goes viral)?"
Random: The errors are randomly distributed throughout the datasets.
Skewed: The errors are skewed in some fashion. It may be that there are more errors in some important sub-sample, more errors on one side of the overall distribution, or more errors in some datasets than others, for example.
Way out of line: Whether random or skewed, some data errors are extreme.
The following table summarizes the likely implications for summary statistics. The table is color-coded -- green means little impact, yellow means some impact, and red means trouble. I've included a light green for potentially dangerous situations that good analysts should be able to accommodate.
Overall, the picture is pretty dismal. Good analysts need to be aware of and deal with these issues. At a minimum (and as noted in the previous post) they must know the provenance and quality of all data. Over the long term, they must strive to improve the quality of data, getting data to "trusted quality levels." And they must understand the patterns of errors that remain.
In the short term, they must take rigorous steps to identify and eliminate errors. When I was at Bell Labs, we called the process "rinse, wash, scrub." They must make data quality statistics part and parcel of their overall descriptive statistics. And, overall, they must be extremely cautious! At best these steps are time-consuming, expensive, and fraught. They are only suitable for the short term.
Garbage in, garbage out.
What are you doing to ensure data quality for big-data analytics? Share on the message board below.
Seth, and let's not forget that S&P $2 trillion error from last year! I don't know if we can link that to spreadsheets, but certainly there was a data quality issue there with widespread ramifications.
The potential errors are mind-boggling, especially from companies that should know better. (ie, a company that specializes in metrics that apparently did not realize how excel truncates numbers after a certain number of digits. yes, this really happened.)
@ Beth, that's a good question. Big data could lead to big errors. The more data there is the more potential. For example, audits of financial instituations can have tens of millions of spread sheets , which any one could have an error. In Oct, 2011, the German government announced it €55bn richer after an accountancy error undervalued assets at the state-owned mortgage lender Hypo Real Estate. It was due to a sum being entered twice.
Richard Cuthbert, CEO of UK outsourcing specialist Mouchel, stepped down after a spreadsheet-based accounting error reduced Mouchel's full-year profits by more than £8.5 million to below £6 million. Ooops!
With these spreadsheets being connected to various data bases, a simple error can spread like wild fire.
Hi Tom. I'm wondering what you see internally these days relative to how companies handle data quality and how that might change as big-data comes into play. Do you see distinct data quality groups set up? If not, do you think such would be a necessity going forward?
2015 Visual Analytics Interactive RoadshowSAS(r) experts are coming to a city near you in a series of live, interactive workshops focused on SAS Visual Analytics, including how to prepare your data for VA, the integration of VA with Office Analytics and a Visual Statistics demo.
January 22: King of Prussia, PA
February 24: Austin, TX
March 26: Redwood City, CA
April 22: NYC, NY (1st of 2 stops)
May 13: Seattle, WA
June 18: Minneapolis, MN
July 21: Rockville, MD
August 18: Chicago, IL
September 24: Irvine, CA
October 9: Cary, NC (during SAS Championship)
October 21: NYC, NY (2nd of 2 stops)
November 17: Orlando, FL
December 8: Atlanta, GA
LEADERS FROM THE BUSINESS AND IT COMMUNITIES DUEL OVER CRITICAL TECHNOLOGY ISSUES
The Current Discussion
Visual Analytics: Who Carries the Onus? The Issue: Data visualization is an up-and-coming technology for businesses that want to deliver analytical results in a visual way, enabling analysts the ability to spot patterns more easily and business users to absorb the insight at a glance and better understand what questions to ask of the data. But does it make more sense to train everybody to handle the visualization mandate or bring on visualization expertise? Our experts are divided on the question. The Speakers: Hyoun Park, Principal Analyst, Nucleus Research; Jonathan Schwabish, US Economist & Data Visualizer
The hospitality industry gathers massive amounts of customer data, and mining that data effectively can yield tremendous results in terms of improved CRM, better-targeted marketing spend, and more efficient back-end processes. Roger Ares, vice president of analytics at Hyatt Corp., discusses the ways he and his staff use big data.
Charged with keeping track of travel assets, including employees, iJET International relies on data management best-practices and advanced analytics to keep its clients in the know on current and potential world events affecting travel, Rich Murnane, Director of Enterprise Data Operations & Data Architect, told All Analytics in an interview from the 2014 SAS Global Forum Executive Conference.
Jason Dorsey, chief strategy officer for the Center for Generational Kinetics and keynote speaker at last month's SAS Global Forum 2014, describes how Gen Y professionals are enhancing the makeup of multigenerational analytics organizations.
From analytics talent development to the power of visual analytics, All Analytics found a variety of common themes circulating throughout the exhibition floor and session discussions at the 2014 SAS Global Forum and SAS Global Forum Executive Conference events held last month in Washington, DC.
Talking with All Analytics live from the 2014 SAS Global Forum Executive Conference, Eric Helmer, senior manager of campaign design and execution for T-Mobile, discussed the importance of customer data -- starting internally -- in devising the mobile operator's marketing plans.
The big-data analytics market can be a confusing place. Among the vendors vying for your dollars are traditional database management providers, Hadoop startup services, and IT giants. In this video, All Analytics editors Beth Schultz and Michael Steinhart sit down in a Google+ Hangout on Air with Doug Henschen, executive editor of InformationWeek. Henschen discusses use cases for big-data analytics, purchase considerations, and his recent roundup of the top 16 big-data analytics platforms.
At the National Retail Federation BIG Show last month, All Analytics executive editor Michael Steinhart noted a host of solutions for tracking and analyzing customer activity in retail stores. From Bluetooth beacons to RFID tags to NFC connections to video analytics, retailers must find the right combination of tools to help optimize the shopper experience, streamline operations, and boost revenues.
The days when historical shipment trends and gut feelings were enough to forecast retail demand accurately are long over. SAS chief industry consultant Charles Chase outlines the benefits of pulling real-time sales information from point-of-sale and product scanner systems, then flowing that data into dynamic forecasting tools from SAS.
With today's advanced visual analytics tools, you can stream data into memory for real-time processing, provide users the ability to explore and manipulate the data, and bring your data to life for the business.
Dynamic data visualizations let analysts and business users interact with the data, changing variables or drilling down into data points, and see results in a flash. Advance your use of data visualization with tools that support features like auto-charting, explanatory pop-ups, and mobile sharing.
No doubt your enterprise is amassing loads of data for fact-based decision-making. Hand in hand with all that data comes big computational requirements. Can traditional IT infrastructure handle the increasing number and complexity of your analytical work? Probably not, which is why you need a backend rethink. Big data calls for a high-performance analytics infrastructure, as Fern Halper, a partner at the IT consulting and research firm, Hurwitz & Associates, discusses here.
Redbox's bright-red DVD kiosks are all but ubiquitous these days, located in more than 28,000 spots across the country. Jayson Tipp, Redbox VP of Analytics and CRM, provides an insider's look at how the company has accomplished its phenomenal nine-year growth.
InterContinental Hotels Group (IHG), a seven-brand global hotelier, has woven analytics into the fabric of its operations. David Schmitt, director of performance strategy and planning, shares IHG's analytics story and his lessons learned.