If you consider yourself among the new breed of data scientist, somebody great at manipulating data and at applying advanced analytics while living and breathing in the Hadoop ecosystem, then SAS has you in its sights.
Right about now, you may be wondering, What?! SAS is known more among the analytics and statistical traditionalists than the newbie data-science and Hadoop crowd. But the expansion of the targeted SAS user base naturally flows from all the work it's been doing to bring Hadoop and advanced analytics together, as evidenced most recently in the SAS In-Memory Statistics for Hadoop environment the company has been showing off at this week's Strata 2014 in Santa Clara, Calif.
In-Memory Statistics for Hadoop is an analytics programming environment for the Hadoop framework. As the name indicates, it takes advantage of in-memory technology. This is the same in-memory technology that comes out of SAS's work on high-performance analytics and goes into play for
products like Visual Analytics, an interactive and highly dynamic data-visualization tool that has been shown to power through a billion rows of data nearly instantaneously.
In-Memory Statistics for Hadoop moves analytical processing away from the "blocking and tackling" of old, where one procedure ends and the next begins. Rather, being able to do the statistical work at the speed allowed by in-memory technology means the ability to string all those processes together as a series of actions, said Mike Ames, director of data science and emerging technology at SAS, whom I talked with by phone yesterday.
"Since we don't have to flush data from memory, we can keep it there for the entire session and intermix the data management and the exploratory analysis with the statistical analysis," he added.
Wayne Thompson, SAS chief data scientist, was on the call as well. He emphasized SAS's goal of providing a framework for supporting the complete analytical lifecycle -- from the data wrangling (or preparation), to the exploration and the modeling, and then through to deployment. "This provides the ability to not have to use something like MapReduce, but to be able to co-locate and compute across the cluster, never dropping data back down to disc," said Thompson, adding that Hadoop provides a "fire hydrant" of data to consume and feed into predictive models and prescriptive models.
Ames said he believes SAS is way ahead of the competition with its ability to support distributed computing and interactivity in this manner. "Today, you'd have to write most of the code yourself, but with this, you get pre-built libraries of statistical and machine learning methods."
In a press release announcing first-quarter 2014 availability of In-Memory Statistics for Hadoop, SAS noted clustering, regression, generalized linear models, analysis of variance, decision trees, random decision forests, text analytics, and recommendation systems as the statistical and machine-modeling techniques supported. The screen capture below, for example, shows random decision forests.
SAS In-Memory Statistics for Hadoop supports a wide range of statistical algorithms and machine-learning techniques, including random-decision forests.
All this built-in machine learning brings us back to those data scientists and Hadoopsters. As Thompson said, "We didn't develop this environment to target traditional inferential statistics."
In-Memory Statistics for Hadoop enables the iterative approach that data scientists thrive on. They can submit models and get models back in a continuous flow, engaging in work that, in a lot of cases, hadn't before been computationally feasible, Ames said. All of this is changing the way people work, shifting the focus from individual effort to data sciences teamwork.
"Wayne can build his models on the same set of data that I'm using and share them with me. We don't need to replicate effort," Ames said.
Of course, as SAS targets the new breed of Hadoop-inspired data scientist, it's not leaving its traditional users behind. Case in point, Thompson talked of one analytics director at a global hospitality company who stopped by at Strata to check out the product's support for hotel load analysis. Like others, "He sees Hadoop as a low-cost and powerful, flexible and scalable storage environment. To have the analytics co-located with that -- well, that gives us the ability to take existing customers like him to the next level."
With In-Memory Statistics for Hadoop, Ames said, analysts can build hundreds if not thousands of models in a concurrent run of the software. That sounds like a game-changer to me. What would you do with that kind of power? Jump in with your ideas below.
— Beth Schultz, Editor in Chief, AllAnalytics.com