A recent column by Christopher Soghoian on CNet predicts a decline in companies sharing “anonymized” user data with the academic research community. Along with last year’s AOL data release debacle, Soghoian points to a more recent case where researchers were able to de-anonymize a data set released by Netflix, comprising of 100 million movie ratings made by 500,000 subscribers to their online DVD rental service.
As both a privacy advocate and someone who respects the research information scientists (such as Jim Jansen or Amanda Spink) are able to perform with these datasets, I share Soghoian’s internal dilemma:
As a privacy advocate and end user, I think the shift against sharing anonymized data is probably a good thing. After all, I don’t want some random student browsing through my search history, anonymized or not. However, if I take the end-user hat off, and put on my PhD student hat, then this is a really bad thing. Researchers depend on accurate data in order to do their work. Without the data, we don’t get new exciting research, and thus no new cool technologies. For the research community, this Netflix incident will be the final nail in the coffin of information sharing from the dot-coms.
Soghoian’s final point, that we’ve witnessed the end of the sharing of large data-sets for academic research, is troubling, if true. We need to find a way to properly anonymize data in order to prevent the squelching of valuable academic research, yet protecting the privacy and integrity of people’s online intellectual activities.
To that end, I recently attended an NSF-sponsored workshop on data confidentiality which focused on this very issue:
This workshop comes at a time when governments and organizations are struggling to expand research access to statistical and multimedia databases, while at the sametime as protecting the confidentiality of the individuals whose data are recorded and combating breaches of cyberinfrastructure security, especially those involving unauthorized record linkage and individual identification and harm. There has been a long tradition of confidentiality associated with statistical databases, but the ever-expanding cyberinfrastructure raises new and far more challenging questions about the protection of privacy associated with electronic databases involving individuals, families and other groups, and organizations.
The goal of this workshop is to bring together leading researchers in the area of privacy and confidentiality from diverse intellectual communities to share expertise and map out a broad research agenda to inform funding agencies and organizations responsible for database access and protection. Specific attention will be focused on understanding the tension between privacy/confidentiality and data utility, and understanding the role of auxiliary information (“extra” information known to the adversary) in defeating privacy objectives.
Among those at the workshop working on creating anonymous data-sets where researchers from Web search engine companies themselves, such as Andrew Tompkins and Ravi Kumar from Yahoo! Research, who presented their paper “On Anonymizing Query Logs via Token-based Hashing.” Similar work is being done by members of the PORTIA (Privacy, Obligations and Rights in Technologies of Information Assessment) project, of which I was affiliated.
Despite these efforts, many still maintain that truly anonymized data-sets are an impossibility. Unfortunately, they might be right. The work of Latanya Sweeney, for example, reveals that 87 percent of Americans can be personally identified by presumed-anonymized records listing only their birth date, gender and ZIP code. The researchers from Yahoo! also discussed how they could easily overcome the typical attempts to anonymize search records and server logs.
I am not a computer scientist, so unfortunately there is little concrete I can offer toward a solution to creating truly-anonymous data sets of user activities. And certainly, as a privacy advocate, I will always be quick to point out violations of user privacy even when those releasing the data have the best of intentions (as AOL and Netflix did). But I hope we can work towards a solution that benefits both communities.
UPDATE: Bruce Schneier has a related column in Wired, touching on many of these same issue.