The emphasis on big-data and big-data analytics has focused the predictive analytics discussion on the overwhelming importance of the data itself, rather than the mathematics.
This is not to say that data was never considered important in building predictive analytics solutions. But prior to the big-data hype, much of the emphasis concentrated on mathematics and the need to identify the next breakthrough technology. Putting on our marketing science hats, practitioners would test these new techniques/technologies and determine whether or not they yielded incremental value.
In most cases, these newer techniques/technologies yielded minimal value over and above the traditional multivariate techniques such as logistic regression and multiple regression. But why is this the case if newer techniques and technologies -- in theory -- can produce powerful results, as seen in academic research and other non-business settings?
Business scenarios are different than those in academia and other settings due to the data environment in which the so-called random error component in many cases is quite large. Certainly, this random component is much larger when compared to the more esoteric and pristine data environments that exist within research academia.
In the practical world of business, our ability to explain the actual behavior of our predictive variable is quite small. The more traditional and simple multivariate techniques work quite well under these scenarios, with limited analytics potential delivering acceptable results, while the more advanced techniques yield minimal improvement in performance.
This is best demonstrated by looking at the R2 of a multiple regression equation where the R2 values are well below 10 percent, implying that any predictive analytics solution is only able to explain 10 percent or less of the targeted behavior. With so much unexplained variation, the real risk in employing newer techniques and technologies is that some of this truly unexplained variation now becomes explained.
Overstatement of results is the consequence of this action, with disappointing results being the outcome of an implementation.
Yet, our real opportunity to improve results resides with the data itself. Altering data inputs, or creating new variables, can significantly improve performance.
This is evident by looking at how model performance and R2 will vary significantly by the data inputs or variables used in the model. The notion of putting the right eggs in your basket is the key driver in building effective solutions. This fact results in practitioners spending more time on creating the analytical file and the potential data inputs into any model.
It's not unusual that 80 to 90 percent of the practitioner’s time is spent in this area working on the data, with the remaining time conducting the more advanced routines.
Most experienced practitioners welcome the discussion and hype surrounding big-data. It has reinforced the discipline of data, and more importantly, the process of creating the right data environment. New terms, such as data science, now profess to the importance of data within the predictive analytics process. Describing data as a science implies that practitioners undertake a rigorous and methodological approach in dealing with data.
But from a practitioner’s standpoint, is any of this really new? Data as a discipline has always represented the bedrock foundation in the development of any predictive analytics solution, after all. What do you think?