A group of computer scientists from the University of Minnesota recently presented a fascinating paper “You are what you say: privacy risks of public mentions,” Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, 2006.
Their concern is the ability to identify pseudononymous people by comparing datasets of their “public mentions.” For example, could Amazon reidentify customers on competitors’ websites by comparing their purchase history against public reviews written on those sites?
Here’s a blurb from the paper’s abstract:
In today’s data-rich networked world, people express many aspects of their lives online. It is common to segregate different aspects in different places: you might write opinionated rants about movies in your blog under a pseudonym while participating in a forum or web site for scholarly discussion of medical ethics under your real name. However, it may be possible to link these separate identities, because the movies, journal articles, or authors you mention are from a sparse relation space whose properties (e.g., many items related to by only a few users) allow reidentification. This re-identification violates people’s intentions to separate aspects of their life and can have negative consequences; it also may allow other privacy violations, such as obtaining a stronger identifier like name and address.
They conclude that user-level misdirection could be a way to decrease the chances of reidentification. With the Amazon example above, users (or automated bots) could leave fake reviews at websites in order to “pollute” the public data used for possible reidentification. From the paper’s conclusion:
Being re-identified using items in a sparse relation space can violate privacy: the items themselves might leave one vulnerable to judgment, or they might be used to get at an identifier or quasiidentifier to get information one wishes to keep private.
We found a substantial re-identification risk to users who mention items in a sparse relation space, and we demonstrated successful algorithms to re-identify. In other words, relationships to items in a sparse relation space can be a quasi-identifier. This is a serious problem, as there are many such spaces, from purchase relations to books on Amazon to friendship relations to people on MySpace. We explored whether dataset owners can protect user privacy by suppressing ratings and found it required impractically extensive changes to the dataset. We also investigated ways users can protect their own privacy. As with dataset owners, suppression seems impractical–it is hard for users to mitigate risk by not mentioning things.
User-level misdirection (mentioning items not rated) provides some anonymity at a fairly low cost. We also found that although rare items are identifying, popular items are more useful for misdirection. This leaves misdirection as a possible privacy protection strategy, possibly with an advisor. However, there are questions about whether users would accept such a strategy or such an advisor.
This idea of misdirection to throw off data-mining is similar to the TrackMeNot tool, which sends ghost search queries to search engines in order to obfuscate your actual web search history.
[via Bruce Schneier]