Call me paranoid. But de-identification of information only improves the odds of anonymity. It doesn't guarantee it.
In simplest terms, de-identification is the opposite of identification. Rather than linking data to a particular source, information is stripped out to prevent someone from discovering the source.
It's an awkward word for a concept invented to create an illusion of privacy around the collection, use, and reporting of data. And although some people claim that fears of re-identification have been greatly overblown, it's reasonable to have reservations.
De-identified only means that it would be very difficult to re-identify the people in the original record. It doesn't mean it would be impossible.
It's not even as secure as "anonymized" information, which means all of the links between a person and the person's record have been "irreversibly broken so that it would be virtually impossible to re-establish any of the people in the original record."
So de-identified data is less secure than anonymized data -- and even anonymous information involves risk. Just ask a so-called anonymous source who was inadvertently identified by the police, a prosecutor, or the press.
And then there was the guy who wrote Primary Colors, which was originally published by -- you remember -- "Anonymous." How did that work out for you, Joe Klein?
All de-identified really means is that there is no reasonable basis to believe the data can be used to identify an individual. But is someone obsessed with discovering the source of data reasonable? What about someone determined to crack information out of spite, competition, or malevolent intent?
You don't have to be a conspiracy theorist to understand that it's not only possible but probable to re-identify the de-identified.
A well-publicized study proved it was possible to execute a successful re-identification attack on claims data on 135,000 patients disseminated by the Group Insurance Commission in Massachusetts. The discharge record for the then governor was re-identified by matching it with simple demographic information found in the Cambridge voter registration list that someone purchased for $20.
All It Takes Is Three Data Points
Remember, three data points -- gender, birth date, and ZIP code-- can identify 87 percent of Americans.
Then there's Netflix. The online movie rental service publicly released a dataset containing movie ratings of 500,000 Netflix subscribers. The dataset was intended to be anonymous, and all personally identifying information was allegedly removed. But researchers at the University of Texas at Austin demonstrated that an attacker who knows only a little bit about an individual subscriber can easily identify the subscriber's record, or at the very least, identify a small set of records that include the subscriber's record.
So much for keeping a low profile, or no profile at all.
The fact is that nothing is ever totally anonymous until every identifier in the data is removed. But as one of my favorite hackers told me, "I feel as though everything is an identifier. People think that replacing a name with a number protects the identity of the name, but that's just not true. The only way to de-identify anything is to remove all of the data that exists:
Obfuscation of data, randomizing data, and any other technique is never 100 percent anonymous. The bottom line: Any information about you can be used to identify you, on its own, or coupled with other personal information that many would deem 'anonymous' or unimportant.
Given enough perseverance and desire, someone will crack the code. It's only a matter of time. In his autobiography, Kevin Mitnick -- arguably America's most famous ex-hacker -- wrote:
We're told that our medical records are confidential, shared only when we give specific permission. But the truth is that any federal agent, cop, or prosecutor who can convince a judge he has legitimate reason can walk into your pharmacy and have them print out all of your prescriptions and the date of every refill. Scary.
We're also told that the records kept on us by government agencies -- Internal Revenue Service, Social Security Administration, the DMV of any particular state, and so on -- are safe from prying eyes. Maybe they're a little safer now than they used to be -- though I doubt it -- but in my day, getting any information I wanted was a pushover.
So call me paranoid if you want. Better yet, just call me realistic.