How useful is de-identification for protecting privacy?
The Information and Privacy Comissioner of Ontario just released an interesting paper about de-identification of sensitive data. This is "Dispelling the Myths Surrounding De-identification: Anonymization Remains a Strong Tool for Protecting Privacy," (PDF) by Ann Cavoukian and Khaled El Emam.
Here's the conclusion that this paper reaches:
The claim that the de-identification of personal data has no value and does not protect privacy due to the ease of re-identification is a myth. If proper de-identification techniques and re-identification risk measurement procedures are used, re-identification remains a relatively difficult task. However, we recognize that this is not a static exercise – it is everchanging. As re-identification techniques become more sophisticated and more personal information becomes available to facilitate re-identification, it is important to reassess and strengthen de-identification and re-identification risk management techniques.
While there may always be a residual risk of re-identification, in the vast majority of cases, de-identification will protect the privacy of individuals, as long as additional safeguards are in place. While de-identification may not be a perfect solution to reduce all privacy risks when personal information is being considered for secondary purposes, it is an important first step that should be used as part of an overall risk assessment framework. We urge you not to abandon your efforts to de-identify personal data in a comprehensive and responsible manner.
It looks to me like this is partially right and partially wrong.
It's probably true that the claim that de-identification of sensitive data has absolutely no value isn't true. But then that's not really what people are claiming about it. What they're claiming, and what's probably true, is that anonymization of sensitive data in a way that makes it hard to re-identify the data is almost always harder than you might first think it is. It's easy to do but hard to do right. So in one way it's much like encryption – if you try to do it yourself without expert advice, you'll probably do it in a bad way.
On the other hand, this paper's claim that re-identification is a relatively difficult task is probably not true. There's more than one commercial product that makes it very easy to re-identify data that has been anonymized. I've talked to people who work on these products from time to time and some of the stories that they've told me about how their customers have used their products to find patterns in anonymized data are quite amazing. I've also talked to people who have written their own de-anonymizer applications who have told me similar stories. From their discussion of this topic, I'd guess that neither of the authors of this paper have either tried to re-identify data or to use one of the available products that can help you do this.
It sounds like the goal of this paper may be to justify a future government policy that will say that anonymized data doesn't need to be handled with the same care as data that hasn't been anonymized. As a policy goal, that's probably fairly reasonable. But it's going to be very hard to clearly and carefully define what types of anonymization can be used to qualify for a lower level of care.