As I mentioned in my previous post, data quality is of significant importance to most big-data work, as looking at the impact through the D4, or "data to discovery to deliver to dollars," process shows. While I zeroed in on the quality implications during data gathering in that post, here I'm focusing in on the discovery phase.
The focus in discovery is finding something "truly interesting" in the "potentially interesting" data gathered in step one. The discovery step often starts with descriptive statistics, which provide a summary of what the data tell us about themselves. Let's consider a specific problem: understanding the reaction in social media to different methods of presenting advertisements. I'll consider three descriptions of possible interest and three distributions of data errors.
Descriptions:
Average behavior: What is the average, or typical, reaction to an advertisement (e.g., the middle of the distribution)?
Variation: How variable is the reaction (e.g., how much spread is there, what constitutes "good" and "excellent" reactions, respectively, are "categories" of advertisements evident, etc.)?
Extreme behavior/rare events: "What is the reaction in the tails (e.g., what goes viral)?"
Error patterns:
Random: The errors are randomly distributed throughout the datasets.
Skewed: The errors are skewed in some fashion. It may be that there are more errors in some important sub-sample, more errors on one side of the overall distribution, or more errors in some datasets than others, for example.
Way out of line: Whether random or skewed, some data errors are extreme.
The following table summarizes the likely implications for summary statistics. The table is color-coded -- green means little impact, yellow means some impact, and red means trouble. I've included a light green for potentially dangerous situations that good analysts should be able to accommodate.
Overall, the picture is pretty dismal. Good analysts need to be aware of and deal with these issues. At a minimum (and as noted in the previous post) they must know the provenance and quality of all data. Over the long term, they must strive to improve the quality of data, getting data to "trusted quality levels." And they must understand the patterns of errors that remain.
In the short term, they must take rigorous steps to identify and eliminate errors. When I was at Bell Labs, we called the process "rinse, wash, scrub." They must make data quality statistics part and parcel of their overall descriptive statistics. And, overall, they must be extremely cautious! At best these steps are time-consuming, expensive, and fraught. They are only suitable for the short term.
Garbage in, garbage out.
What are you doing to ensure data quality for big-data analytics? Share on the message board below.
Seth, and let's not forget that S&P $2 trillion error from last year! I don't know if we can link that to spreadsheets, but certainly there was a data quality issue there with widespread ramifications.
The potential errors are mind-boggling, especially from companies that should know better. (ie, a company that specializes in metrics that apparently did not realize how excel truncates numbers after a certain number of digits. yes, this really happened.)
I'm apprehensive about the numbers of companies that still rely on spreadsheets -- and the potential errors created from less than adequaltely treined employees who use them
@ Beth, that's a good question. Big data could lead to big errors. The more data there is the more potential. For example, audits of financial instituations can have tens of millions of spread sheets , which any one could have an error. In Oct, 2011, the German government announced it €55bn richer after an accountancy error undervalued assets at the state-owned mortgage lender Hypo Real Estate. It was due to a sum being entered twice.
Richard Cuthbert, CEO of UK outsourcing specialist Mouchel, stepped down after a spreadsheet-based accounting error reduced Mouchel's full-year profits by more than £8.5 million to below £6 million. Ooops!
With these spreadsheets being connected to various data bases, a simple error can spread like wild fire.
Hi Tom. I'm wondering what you see internally these days relative to how companies handle data quality and how that might change as big-data comes into play. Do you see distinct data quality groups set up? If not, do you think such would be a necessity going forward?
LEADERS FROM THE BUSINESS AND IT COMMUNITIES DUEL OVER CRITICAL TECHNOLOGY ISSUES
The Current Discussion
Visual Analytics: Who Carries the Onus? The Issue: Data visualization is an up-and-coming technology for businesses that want to deliver analytical results in a visual way, enabling analysts the ability to spot patterns more easily and business users to absorb the insight at a glance and better understand what questions to ask of the data. But does it make more sense to train everybody to handle the visualization mandate or bring on visualization expertise? Our experts are divided on the question. The Speakers: Hyoun Park, Principal Analyst, Nucleus Research; Jonathan Schwabish, US Economist & Data Visualizer
To save this item to your list of favorite AllAnalytics content so you can find it later in your Profile page, click the "Save It" button next to the item.
If you found this interesting or useful, please use the links to the services below to share it with other readers. You will need a free account with each service to share an item via that service.
Dynamic data visualizations let analysts and business users interact with the data, changing variables or drilling down into data points, and see results in a flash. Advance your use of data visualization with tools that support features like auto-charting, explanatory pop-ups, and mobile sharing.
No doubt your enterprise is amassing loads of data for fact-based decision-making. Hand in hand with all that data comes big computational requirements. Can traditional IT infrastructure handle the increasing number and complexity of your analytical work? Probably not, which is why you need a backend rethink. Big data calls for a high-performance analytics infrastructure, as Fern Halper, a partner at the IT consulting and research firm, Hurwitz & Associates, discusses here.
Redbox's bright-red DVD kiosks are all but ubiquitous these days, located in more than 28,000 spots across the country. Jayson Tipp, Redbox VP of Analytics and CRM, provides an insider's look at how the company has accomplished its phenomenal nine-year growth.
InterContinental Hotels Group (IHG), a seven-brand global hotelier, has woven analytics into the fabric of its operations. David Schmitt, director of performance strategy and planning, shares IHG's analytics story and his lessons learned.
Elizabeth Barth-Thacker, a BI and informatics technology manager at Humana, tells us how her team is creating data transparency and building engagement with the business – with the help of an internal collaboration portal called Humanalytics.
Speaking at SAS Global Forum Executive Conference, Rajeev Kaul, SVP of pricing at OfficeMax, uses a Chinese proverb to explain one of the reasons he's deploying SAS Visual Analytics.
In an All Analytics interview, Mike Cavaretta, technical leader, predictive analytics at Ford Research & Advanced Engineering, shares how big-data is fueling vehicle decisions.
Analytics professionals and SAS executives share how organizations can get on with their work so much faster when working in a high-performance and visual analytics environment.
Analytics professionals who attended SAS's recent Executive Briefing in New York share how they think visual analytics might help their organizations get better value from data.
At Boeing, effective decision making comes down to this simple formula: QxA=E, as executive Jerry Allyne explained at the recent INFORMS analytics conference.
Whether working in major league sports, financial services, or healthcare, analytics, and data, professionals are checking out how visual analytics and high-performance technologies can help them optimize their environments, shrink their cycle times, and improve decision making, as attendees at the recent SAS Executive Briefing in New York share with us.
SAS CEO Jim Goodnight speaks with us at a recent SAS Executive Briefing about getting a feel for what's in your big-data and other new realities powered by advanced analytics.
Jim Davis, SVP and CMO at SAS, talks with us at a recent SAS Executive Briefing about how high-performance analytics and visual analytics take away the concerns over big-data and let companies get down to business with their data.