Speaking of the research ethics related to automatically harvesting public social networking data, we are confronted this week with the story of Pete Warden, a former Apple engineer who has spent the last six months harvesting and analyzing data from some 215 million public Facebook profile pages.
According to Warden, he exploited a flaw in Facebook’s architecture to access public profiles without needing to be signed in to a Facebook account, effectively avoiding being bound by Facebook’s Terms of Service preventing such automated harvesting of data. As a result, he amassed a database of names, fan pages, and lists of friends for 215 million public Facebook accounts.
Warden has already done some impressive analysis of this data at an aggregate level, and I know researchers would love to get their hands on it. And like the “Tastes, Ties, and Time” Facebook project, Warden wants to release the dataset to the academic community.
But also like the “Tastes, Ties, and Time” project, Warden would be wrong to do so.
First, similar to our discussion of the ethics of collecting public Twitter streams, just because these Facebook users made their profiles publicly available does not mean they are fair game for scraping for research purposes. Yes, I have limited profile information viewable to the public, and I’ve authorized Facebook to make that information available for search engines to crawl. But the purpose of this public availability is to help people — humans, not bots — find me. The presumption is that my public profile data will only be found and viewed if someone actually searches for “Michael Zimmer” on Facebook or a search engine. In reality, my profile is only “public” if a human being takes specific and conscious action to find me.
Warden’s actions, however, violate this implicit understanding for making profiles publicly searchable. Rather than trying to find me, Warden is systematically sought everyone, letting a script to the work of seeking and harvesting my data. There is no genuine desire to find me, to friend me, and so on. He’s just collecting data. His reasons might be honest and beneficial, but that’s not what’s at issue here. The point is whether the 215 million Facebook users who now have some of their information in Warden’s database contemplated such harvesting and aggregating when they built their profile and configured their privacy settings. They almost certainly didn’t, which brings into doubt whether this data has been collected with proper consent.
Second, Warden’s release of this dataset — even with the best of intentions — poses a serious privacy threat to the subjects in the dataset, their friends, and perhaps unknown others. Warden claims to be sensitive to the privacy of the subjects in the database, and in response he has removed the identifying URL’s’s that are unique to each profile, but the dataset retains the subjects’ names (really!), locations, Fan page lists and partial Friends lists (I’m not sure what is meant by a “partial” list of friends).
So, obviously, individuals can be easily identified within the dataset. But that’s not the greatest threat with the release of this data. What is most dangerous is its potential use to help re-identify other datasets, ones that might contain much more sensitive or potentially damaging data. Recall the research that showed how trivial it was to re-identify the presumed “anonymized” Netflix database, or the ease in identifying individuals within social networks. These ease of re-identifying these datasets came from having ready access to other large sets of data where the subjects where already known. By overlaying social graphs and other intricate data-comparison methods, the “anonymous” datasets were quickly re-identified. (See Paul Ohm’s “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization” for excellent coverage of these cases and discussion of consequences for law & policy).
Warden’s rich dataset of 210 million Facebook users, complete with their names, locations, and social graphs, is just the ammunition needed to fuel a new wave of re-identification of presumed anonymous datasets. It is impossible to predict who might use Warden’s dataset and to what ends, but this threat is real.
It turns out that Facebook has asked Warden to delay releasing this data to the academic community (I’m curious as to what kind of pressure — if any — they exerted to keep him from releasing this week as originally planned). We will need to keep a close eye to see if the data is actually released, in what form, and if any steps will be taken to control and track its usage.