Ask analytics directors whether they need a high-performance analytics infrastructure, and many will hem and haw. But not Mark Pitts, director, data science, solutions & strategy, at UnitedHealthcare.
"I want one," Pitts told a crowd of colleagues at last month's SAS Analytics 2012 conference.
This was no idle chatter or wishful thinking. Pitts has put high-performance analytics (HPA) through its paces and has concluded it's highly desirable for UnitedHealthcare, the health benefits arm of global healthcare company UnitedHealth Group. "We do an excellent job with member services today, but we want to make it the best experience in the industry. This is about taking analytics to the next level," Pitts told me in an interview.
With HPA, UnitedHealthcare can change up how it approaches analytics and the type of data it can analyze. Without HPA, Pitts's team is limited by the sample size it can use in its models and the iterations it can run, which results in "models that aren't as good as we know how to make them." Many datasets and business problems went untouched.
HPA, because it enables massively parallel processing (MPP), means the ability to run vastly more cycles simultaneously than possible on traditional analytics servers. Take this quick example: When Pitts's team launched its HPA proof-of-concept (PoC) testing, it intentionally loaded up the big-data environment with a simulation that was computationally and I/O intensive, with four million rows of data -- a process that took four hours and 15 minutes on traditional infrastructure. When it ran in the HPA environment, results came back immediately, as in 10 seconds.
"It came back so fast, we at first thought there was a syntax error," Pitts said.
For the PoC, he put up an HPA infrastructure comprising SAS High-Performance Analytics software running on Greenplum's big-data analytics platform, the Data Computing Appliance (DCA). I won't go into all the technical nitty-gritty here, but suffice it to say the SAS HPA procedures take advantage of the DCA's MPP capabilities to execute in rapid-fire fashion. The Greenplum DCA architecture comprises two racks with 384 CPU cores, 1.536 terabytes of memory, and 992TB of usable storage, resulting in a data load rate that hits 10TB/hour. In one typical test, for example, Pitts's team loaded 26.8 million rows of data in five minutes and 17 seconds, yielding a 9.2 TB/hour measured load rate.
"Before we were limited in the number of rows of unstructured data we could use to train the predictive models -- to maybe a few hundred thousand. With HPA, we hope to be able to use the entire database, which is hundred of millions of rows. We're really excited to let this loose on our entire database," he added.
Even more so, Pitts said he's looking forward to what HPA can do with the unstructured data UnitedHealthcare collects in the form of medical records, case notes and email text, call center transcripts, machine-generated logs, and much more. Toward that end, a second phase of the PoC testing focused on the use of text analytics on the HPA appliance.
That testing went well, too, Pitts said. "Using HPA text miner we are able to parse millions of rows of text data in a few minutes. The SVD (singular value decomposition -- an important mathematical step in the text mining process) also calculates in MPP mode alongside the database."
Ultimately, Pitts said he envisions a day when all of the traditional symmetric multiprocessing analytics servers disappear and HPA, with in-memory processing, becomes standard fare at UnitedHealthcare. But this big-data analytics stuff is expensive, so Pitts has a plan for getting the most bang for the company's buck, too. IT wants to use the HPA platform to provide data analytics as a service, he said.
"We think we can justify the expense by having it be a shared capability among data scientists across the enterprise. Data scientists could rapidly load large datasets, perform analytic processes quickly, and get out of the way so the next team could load their data and repeat. Sized appropriately, the platform can also support several teams working at once."
HPA makes sense for UnitedHealthcare, as its PoC testing has proven in one scenario after another. Do you see a fit for an ultra-fast big-data analytics environment at your company? Share your thoughts on HPA below.