If Mike Cavaretta, a Ford analyst, had his way, he'd be able to march around the company and pinpoint where every last dataset is stored and replicated.
But that's impossible, as he told attendees during his "Where will Big Data and Data Science take O.R.?" session at this week's INFORMS Analytics Conference in San Antonio. Of course, the sheer volume of it is one issue. By one count in 2009, the company had about 830 terabytes of data. It's not hard to imagine that being well into the petabyte realm now.
Consider sensor data from Ford autos. The Fusion Energi, a plug-in hybrid, generates 25 gigabytes of data hourly, as described in a January Forbes article citing stats from Ford CTO Paul Mascarenas:
The car "has more than 145 actuators, 4,716 signals, and 74 sensors to monitor the perimeter around the car as well as the cars functions and driver responses. These sensors produce more than 25 gigabytes of data hourly from more than 70 on-board computers that analyze it in real-time."
That's a lot of data, but not when compared to this: A Fusion Energi test vehicle outfitted with multiple high-resolution sensors might generate one terabyte of data in a single four-hour test, Cavaretta said.
What will Ford do with all that on-board data? Would sending it out for external storage and subsequent analysis even be possible? Would mashing it up with open data and making it available for smart city initiatives be feasible? Ford is pondering such questions, according to Cavaretta, who carries the official title of technical leader, predictive analytics at Ford Research & Advanced Engineering.
Culture comes into play, too. Sometimes certain areas of the company don't really want to share their data. So Ford has lots of "dark data," or data that's hidden away from the enterprise data warehouse in "shadow IT."
But at least that data is there somewhere. Another challenge Cavaretta cited during his presentation is that some datasets just aren't available anymore. "I've run into too many situations where I'd be talking to internal customers, I'd say, 'It'd be great if we had this data,' and they'd say, 'Oh, but we did... but we only save it for 30 days.'"
As a result of his experiences with such scenarios, he suggested that companies planning to conduct big-data analytics start collecting data at the lowest possible level and sooner, rather than later. Watch the All Analytics video below with Cavaretta to learn more about Ford and its big-data use.
Despite the challenges not only with the volume but also with the variety and velocity of big-data, he believes big-data will flatten out, and the data problem will be solved. "The focus will come back to the analytics, the machine learning, the statistics. The techniques we're good at applying will come back to the fore."
I like that line of thinking. How about you?