Regression methods are so familiar, it's easy to overlook vital information that hides in plain sight, so always give some thought to regression to the mean.
Because regression is one of the most common tools in business analytics, managers usually just glance at the probability for each F-statistic, and the estimate and standard deviation for each coefficient and intercept, and move on. But hidden in the word "regression" is a clue to other information that may be right there in front of you.
Regression is called so because of a realization that came to Francis Galton, a Victorian polymath, who did much of the early development work when graphing children's heights against their parents' heights. Although parental height was a good predictor of a child's height, there was a strong tendency for very tall parents to have children shorter than themselves, and for very short parents to have children taller than themselves. He called this a regression line, because in repeated samples from the same population, the outliers (points extremely far from the estimated regression line) tended to regress -- move closer to that line -- on the next repetition of the test.
The normal (bell-curve) distribution of errors causes this (in statistics, errors are not mistakes, but simply differences between predicted and observed values). A normal distribution of errors means that small errors are common and large ones are rare. When one of the few large errors happens to occur on an extreme value (i.e., a value close to the minimum or maximum), you get an outlier.
The chances of an extreme value are small, the chances of a big error are small, and the chances of both together are very small, which is why outliers are scarce.
Regression to the mean happens naturally, because the odds of being an outlier twice in a row are even lower. The highest probability is that the next time the process is repeated, the cases that were outliers before will be closer to the predicted line. That's the meaning of regression to the mean: today's star (or biggest loser), tomorrow's Average Joe.
Regression to the mean explains the "sophomore slump" in sports stars: A brilliant rookie was also unusually lucky, racking up great stats his first year, but unlikely to do it again in the second season. It's a factor in the sales manager's frustration with the salesperson who sells like mad for his or her first period, then turns average. And it explains why some managers become addicted to penalties and negatives: Regression to the mean causes top people to deteriorate and poor ones to improve drastically, so it looks like the bonuses to the top people are failing and the punishments for the bottom people are succeeding.
In analytics-driven management, if you are alert for regression to the mean, you can avoid being fooled, spot unusual problems and opportunities, and brace for some likely trouble:
Not being fooled: It's important to sort out talent from luck. Before launching an incentive program, look to see if the "star" and the "goat" performers regress to the mean over time. If they do, providing an incentive will just lower morale.
Unusual problems and opportunities: If most of the population regresses to the mean, but you have a few consistent stars or duds, you're almost certainly looking at a black swan that warrants further investigation. Somebody has a skill they're not sharing, or a few people have a disastrous procedure you want to warn against in training.
Bracing for trouble: Some businesses have sharply spiked distributions for sales, supply costs, orders, etc. One customer might be a third of all orders, one field office might be handling half of all customer complaints, and so on. If that is the case, and you have observed regression to the mean in the overall data, then sometime in the future, you're going to get a perfect storm or a jackpot, because one of the biggest sources of blessings or troubles will turn outlier. Then, numbers that have been stable, steady, predictable, and generally comfy for years or decades will lurch into completely new territory.
Understanding regression to the mean can be a quick pathway to sort out the accidents from the permanent changes, figure out where the "100-year floods" are bound to hit eventually, and perhaps even have a plan in place for an overabundance of blessings.
Watch your outliers. Regression to the mean happens -- in fact, the basic technique is named after it. If you keep that possibility in mind, you've positioned yourself to be the good kind of outlier.
Louis, I'd phrase it just a little differently, that it's a matter of what field you're in. If you're a sandwich maker, you can be right there on that regression line, or even a bit below it, and still work. But if you're going to be a pro tennis player, you need to be an outlier -- way over to the right and consistently above that regression line. And if you want to be a pop star ... all that, plus being an extreme outlier (i.e. VERY lucky).
Something I just noticed thanks to this excellent blog post (which I found in David Brin's even-more-excellenter blog post) is that although the black swan metaphor is catchy, the book itself suffers from the problem that the author thinks that because you can't predict the specifics, you'll always be surprised by rare events. But often just knowing that the surprise is possible carries a lot of information; we know a massive earthquake, bigger than the California Big One, will happen someday with an epicenter somewhere around New Madrid, Missouri; we know that the use of nuclear EMP to disable machinery around a wide area, which is an extremely likely thing for a government or terrorist organization to do in the next fifty years, also means that an enormous number of financial records (such as bank accounts) could be wiped; we know that if a late-season hurricane, a high plains blizzard, and a large arctic air mass all converge on the East Coast megalopolis .... oh, wait.
Before the Panic of 1907 it was well-known that just a few insurance companies covered the city of San Francisco, and their stock was mostly owned by a few large banks in New York, which in turn were linchpins of that city's (and therefore the country's) banking system (back before the Federal Reserve). How likely was it that suddenly all those insurance companies would have to pay up on all their policies all at once? (Hint: what happened in San Francisco in 1906?)
The extreme outliers are important and worthy of our attention, even if it's just to confirm they are flukes.
Thanks John, for the refresher on Regression, I do think that because it is not a sexy tool - it's information often get's ignored. But I like how you have shown how regression to the mean explains a whole host of occurrences ( I really like the sophomore slump example).
It seems like everyone is destine to be average except for the occasional outlier, which is I guess how it has always been - so the question is how to become and stay an outlier against the odds ?
Very interesting insight and food for thought - thanks again John.
Seth -- that's very true, though not itself an example of regression to the mean. But an analyst would still use regression to the mean as an alternate hypothesis, to demonstrate the reality of fatigue. Here's how the analyst would reason it out:
In either a genuine regression-to-the-mean case, or in the case of fatigue, nearly all top performers in the first period evaluated would fall down the ladder in some later period, but in a classic regression to the mean, the fall would be overwhelmingly likely to happen between period 1 and period 2. If there were any surviving top performers who had top-performed in both period 1 and period 2, then the fall would be overwhelmingly likely (and with the same probability) between 2 and 3, and so on; a high probability of a top performer taking a fall between any two periods, and that probability would be fairly constant.
But in fatigue, you'd see the probability of a fall start out fairly low and rise with time, probably nonlinearly. You could quickly confirm this by plotting and/or regressing your residuals (statistical errors) against time. Comparing the errors of the two models would quickly reveal that fatigue explained things much better.
So a reasonably sharp analyst, put on the problem, could tell the manager that this was not regression to the mean (which fundamentally can't be fixed) but a situation where the right resources applied correctly (sabbaticals, incentives) could make everyone better off.
That's the beauty of always considering and testing for regression to the mean; whether you find it or not, it always puts you a long step closer to understanding the situation.
When it comes to employee performance, fatigue must be countered in. Very few people can be top performers for years. I had problems with one boss when I became less productive, yet still the top producer. You can drive a Ferrari at top speed only for so long before it breaks down.
The strange this is that some of the hotshots created by regression to the mean believe their own hype; they think there must be something they're doing that's causing the streaks. Addicted gamblers are notoriously that way; they think they had mojo that accounted for that one wonderful time early in their career that everything went so well, and they can spend (and destroy) the rest of their lives trying to get that illusory mojo back. But some businesses, occupations, and situations are just streaky by nature (Claude Shannon, back in the 1940s, worked out why in a purely random process, streakiness is more likely than steadiness) and quiet, persistent, do-it-right-every-time effort only pays out on the average over the very long run, so more "streak addicts" are born every year -- and end up wasting their lives chasing after the "streak magic" that doesn't exist.
John, on the money with your sales and sports stars slumping or falling off a cliff comparisons. I've been involved in sales for years, many of them as a sales manager. I've always looked for performance consistency over hotshots and spikes. Why? The mercurial hotshots or rainmakers are prone to be hot and cold, not steady, even dipping below the mean. And as a sales manager I would never completely tie my wagon to a star sales person or customer for that matter.
There was one guy I know (used to work with him) who was one of these slam-bam type of sales people who would outshine everyone...but only for brief periods and then he'd sink and his sales would dip. That caused management to start asking questions about commitment andeffort. This guy was always ready and would typically respond by not pumping up his effort but quickly bailing out and going to a new company where they were impressed with his hot sales record. He'd typically last a year at a company and leave before the annual performance review.
Beth, I learned it the hard way, from someone who was not a better analyst than me but who paid attention to the analytics at a time when I didn't. She pointed to the variability in payment times and amounts from various clients, and said, "John, right here is where you'll go broke. You can afford to have all this wobble in the behavior of the minor clients, and let it even out, but the first time a major client does that, you'll be broke."
Unfortunately she was absolutely right; one day the single client that was 40% of my income turned into an outlier for payment time and for commissioning new work. It's not the average wave, but the biggest one, that can sink you.
And it was all perfectly predictable from the fact that clients in that business showed a distinct pattern of regressing to the mean. The rest of you don't need to put your hand on the hot stove to find out it's a bad idea; just sniff my charred fingers!
2015 Visual Analytics Interactive RoadshowSAS(r) experts are coming to a city near you in a series of live, interactive workshops focused on SAS Visual Analytics, including how to prepare your data for VA, the integration of VA with Office Analytics and a Visual Statistics demo.
January 22: King of Prussia, PA
February 24: Austin, TX
March 26: Redwood City, CA
April 22: NYC, NY (1st of 2 stops)
May 13: Seattle, WA
June 18: Minneapolis, MN
July 21: Rockville, MD
August 18: Chicago, IL
September 24: Irvine, CA
October 9: Cary, NC (during SAS Championship)
October 21: NYC, NY (2nd of 2 stops)
November 17: Orlando, FL
December 8: Atlanta, GA
LEADERS FROM THE BUSINESS AND IT COMMUNITIES DUEL OVER CRITICAL TECHNOLOGY ISSUES
The Current Discussion
Visual Analytics: Who Carries the Onus? The Issue: Data visualization is an up-and-coming technology for businesses that want to deliver analytical results in a visual way, enabling analysts the ability to spot patterns more easily and business users to absorb the insight at a glance and better understand what questions to ask of the data. But does it make more sense to train everybody to handle the visualization mandate or bring on visualization expertise? Our experts are divided on the question. The Speakers: Hyoun Park, Principal Analyst, Nucleus Research; Jonathan Schwabish, US Economist & Data Visualizer
The hospitality industry gathers massive amounts of customer data, and mining that data effectively can yield tremendous results in terms of improved CRM, better-targeted marketing spend, and more efficient back-end processes. Roger Ares, vice president of analytics at Hyatt Corp., discusses the ways he and his staff use big data.
Charged with keeping track of travel assets, including employees, iJET International relies on data management best-practices and advanced analytics to keep its clients in the know on current and potential world events affecting travel, Rich Murnane, Director of Enterprise Data Operations & Data Architect, told All Analytics in an interview from the 2014 SAS Global Forum Executive Conference.
Jason Dorsey, chief strategy officer for the Center for Generational Kinetics and keynote speaker at last month's SAS Global Forum 2014, describes how Gen Y professionals are enhancing the makeup of multigenerational analytics organizations.
From analytics talent development to the power of visual analytics, All Analytics found a variety of common themes circulating throughout the exhibition floor and session discussions at the 2014 SAS Global Forum and SAS Global Forum Executive Conference events held last month in Washington, DC.
Talking with All Analytics live from the 2014 SAS Global Forum Executive Conference, Eric Helmer, senior manager of campaign design and execution for T-Mobile, discussed the importance of customer data -- starting internally -- in devising the mobile operator's marketing plans.
The big-data analytics market can be a confusing place. Among the vendors vying for your dollars are traditional database management providers, Hadoop startup services, and IT giants. In this video, All Analytics editors Beth Schultz and Michael Steinhart sit down in a Google+ Hangout on Air with Doug Henschen, executive editor of InformationWeek. Henschen discusses use cases for big-data analytics, purchase considerations, and his recent roundup of the top 16 big-data analytics platforms.
At the National Retail Federation BIG Show last month, All Analytics executive editor Michael Steinhart noted a host of solutions for tracking and analyzing customer activity in retail stores. From Bluetooth beacons to RFID tags to NFC connections to video analytics, retailers must find the right combination of tools to help optimize the shopper experience, streamline operations, and boost revenues.
The days when historical shipment trends and gut feelings were enough to forecast retail demand accurately are long over. SAS chief industry consultant Charles Chase outlines the benefits of pulling real-time sales information from point-of-sale and product scanner systems, then flowing that data into dynamic forecasting tools from SAS.
With today's advanced visual analytics tools, you can stream data into memory for real-time processing, provide users the ability to explore and manipulate the data, and bring your data to life for the business.
Dynamic data visualizations let analysts and business users interact with the data, changing variables or drilling down into data points, and see results in a flash. Advance your use of data visualization with tools that support features like auto-charting, explanatory pop-ups, and mobile sharing.
No doubt your enterprise is amassing loads of data for fact-based decision-making. Hand in hand with all that data comes big computational requirements. Can traditional IT infrastructure handle the increasing number and complexity of your analytical work? Probably not, which is why you need a backend rethink. Big data calls for a high-performance analytics infrastructure, as Fern Halper, a partner at the IT consulting and research firm, Hurwitz & Associates, discusses here.
Redbox's bright-red DVD kiosks are all but ubiquitous these days, located in more than 28,000 spots across the country. Jayson Tipp, Redbox VP of Analytics and CRM, provides an insider's look at how the company has accomplished its phenomenal nine-year growth.
InterContinental Hotels Group (IHG), a seven-brand global hotelier, has woven analytics into the fabric of its operations. David Schmitt, director of performance strategy and planning, shares IHG's analytics story and his lessons learned.