Documents received by The Washington Post describe Muscular, an NSA effort that infiltrated Google and Yahoo networking traffic. Muscular gave NSA analysts access to millions of emails, attachments, and other web communications each day -– including entire Yahoo mailboxes.
The NSA needed to develop new filtering and distribution systems to process this data mother lode, as indicated in the documents. Even with these systems, the new data (particularly from Yahoo) proved too much to handle. Yahoo email began to account for approximately 25 percent of daily data being processed by the NSA's main analytics platform for intercepted Internet traffic. Most of the data was more than six months old and virtually useless. Analysts became so frustrated that they requested "partial throttling" of Yahoo data.
"Numerous target offices have complained about this collection 'diluting' their workflow," according to one NSA document. "The sheer volume" of data is unjustified by its "relatively small intelligence value."
Other NSA data mining programs have overwhelmed the agency, as reported elsewhere. When spammers hacked a target Yahoo account last year, the account's address book blew up with irrelevant email addresses. Consequently, the NSA had to limit its address book data collection efforts to only Facebook contacts.
These broad data sweeps have been significantly less successful than the NSA's more targeted operations. In an interview with the Daily Caller, former NSA official William Binney said the NSA's inefficient big data processes crippled its ability to react to a tipoff about Tamerlan Tsarnaev -– information that could have curtailed the Boston Marathon bombing.
They're making themselves dysfunctional by collecting all of this data. They've got so much collection capability but they can't do everything… The basic problem is they can't figure out what they have, so they store it all in the hope that down the road they might figure something out and they can go back and figure out what's happening and what people did.
Still, the White House and other government departments and agencies place the NSA under what The New York Times calls an intense "pressure to get everything" -- a pressure that has spawned a data obsession.
The problem with this obsession is twofold. The first issue is the ROI of gathering haystacks -– resources better spent elsewhere are diverted to finding, gathering, filtering, and ultimately throttling and fixing oversized and under-relevant data.
The other issue is one of public relations. The NSA may have assumed that, as a super-secret spy agency, its accountability would always remain limited, but leaks happen. This data gluttony has cost it trust and goodwill from the American public and from foreign powers -– just as companies often face public backlash over their customer analytics programs.
The ever-present question for the big data enterprise is this: What are the costs -– all the costs –- of your data mining efforts? Are they manageable? Will there be backlash or some other loss of goodwill? What other consequences might occur?
Or, more importantly, is there a simpler, better way?