On the “Anonymity” of the Facebook Dataset (Updated)

(Updated below with responses to comments by Jason Kaufman, one of the lead researchers on this project)

(Another update: I’m pretty sure the “anonymous, Northeastern university” from where this dataset was derived is Harvard College. Details here)

A group of researchers have released a dataset of Facebook profile information from a group of college students for research purposes, which I know a lot of people will find quite valuable. (Thanks to Fred Stutzman for bringing it to my attention.)

Here is the description from the Berkman Center’s announcement:

The dataset comprises machine-readable files of virtually all the information posted on approximately 1,700 FB profiles by an entire cohort of students at an anonymous, northeastern American university. Profiles were sampled at one-year intervals, beginning in 2006. This first wave covers first-year profiles, and three additional waves of data will be added over time, one for each year of the cohort’s college career.

Though friendships outside the cohort are not part of the data, this snapshot of an entire class over its four years in college, including supplementary information about where students lived on campus, makes it possible to pose diverse questions about the relationships between social networks, online and offline.

Access to the dataset requires the submission of a research statement (which I haven’t yet done), but the codebook is publicly-available.

Of course, this sounds like an AOL-search-data-release-style privacy disaster waiting to happen. Recognizing this, the researchers detail some of the steps they’ve taken to try to protect the privacy of the subjects, including:

  • All identifying information was deleted or encoded immediately after the data were downloaded.
  • The roster of student names and identification numbers is maintained on a secure local server accessible only by the authors of this study. This roster will be destroyed immediately after the last wave of data is processed.
  • The complete set of cultural taste labels provides a kind of “cultural fingerprint” for many students, and so these labels will be released only after a substantial delay in order to ensure that students’ identities remain anonymous.
  • In order to access any part of the dataset, prospective users must read and electronically sign the user agreement reproduced below.

Let’s consider each one of these in order:

First, as the AOL debacle taught us, one might think “all identifying information” has been deleted, but often random bits of our data trail that alone seem anonymous can be pieced together, possibly exposing clues to our identity. The fact that the dataset includes each subjects’ gender, race, ethnicity, hometown state, and major makes it increasingly possibility that individuals could be identified. For example, if the data reveals that student #746 is a white Bulgarian male from Montana, majoring in East Asian Studies, there probably aren’t that many who fit such a description. Unlikely, but not bullet-proof.

Second, the researchers take good measures by keeping the master roster on a secure server and promising to destroy it once all the datasets have been released, in 2011. One hopes this remains secure until then.

Third, the researchers are right to recognize how a person’s unique set of cultural tastes could easily identifer her. But merely instituting a “substantial delay” before releasing this personal data does little to mitigate the privacy fears…it only delays them. Researchers routinely rely on datasets for years (some search engine studies are still using datasets from 1997!). Do they think that once a person graduates, she no longer might be harmed by her potential identification in such a dataset? Delaying the release of this data to 2011 is not only arbitrary, but much too short. A better tactic would be to gain the consent of the subject before releasing the data, or simply not releasing it at all to provide the fullest privacy protection of the subjects.

Fourth, requiring a user agreement and terms of use is a nice step.  Clearly, the researchers understand the potential harms that the dataset represents, since the agreement is full of requirements to not use the data to try to identify individuals (in fact, there are so many points on this, one fears the possibility might be all too real). Unfortunately, however, we’re all too familiar with clickwrap agreements, and most users won’t bother read the terms (and it is uncertain how they would be enforceable).

All told, good steps are being taken to address the privacy of the subjects in the dataset, but more could be done. We’ll wait to see if anyone does become identified within the data.

But one more thing

Since I first saw the press release for this dataset, I’ve been bothered by the description of the date as “approximately 1,700 FB profiles by an entire cohort of students at an anonymous, northeastern American university.”

Right off the bat, the source university loses full anonymity since it is identified as being in the northeastern US. Further, according to the codebook, this is a private, co-ed institution, whose class of 2009 initially had 1640 students in it.

A quick search for schools reveals there are only 7 private, co-ed colleges in New England states (CT , ME , MA , NH , RI , VT ) with total undergraduate populations between 5000 and 7500 students (a likely range if there were 1640 in the 2006 freshman class): Tufts University, Suffolk University, Yale University, University of Hartford, Quinnipiac University, Brown University, and Harvard College. (The total bumps up to about 18 if we include NY and NJ)

Is one of these the source?

This might prove easy to discover, given the uniqueness of some of the subjects. Based on the codebook, the dataset includes only one self-identified Albanian, one Iranian, one Malaysian, one Nepali, and other solitary ethnicities. If we can isolate one of these people in the dataset, combined with the subject’s gender, home state, and major, it probably wouldn’t be that hard to discover who it is. (Same with the fact there is only one Folklore major, or one Slavic Studies major, etc.) Keying off these unique data elements will provide a possible path to identifying the school, and potentially many more individuals in the dataset.

This would have been much harder if I didn’t know it was a private school in the northeast us. Or if they took a random sample of the dataset, and didn’t tell me that actual number of students in the cohort.

Again, time will tell if this gets cracked.

UPDATE: Jason Kaufman, the principal investigator for this research project, was kind enough to read through my concerns and post a thoughful response in the comments. So did Alex Halavais. Please take a look.

I do feel the need to react to some of the arguments made by Kaufman and Halavais. They both seem to suggest that while the data might lead to the identification of some of the subjects, that these Facebook users don’t have an expectation (or a right) to privacy since they made this information public in the first place.

Kaufman remarks:

What might hackers want to do with this information, assuming they could crack the data and ’see’ these people’s Facebook info? Couldn’t they do this just as easily via Facebook itself? Our dataset contains almost no information that isn’t on Facebook. (Privacy filters obviously aren’t much of an obstacle to those who want to get around them.)

And Halavais notes:

The data is already there, this is merely (!) the collection of that data. Or to put it another way, AOL users presumed that no one was watching, but this is very different from Facebook users who are intending to share with someone (if not the researchers).

We see these kinds of arguments all the time: you have no expectation of privacy with public records, or if you’re on the public roads, you can’t expect privacy, or Facebook’s news feed simply made sharing the information you made public more efficient. All such notions are wrong: they ignore the contextual nature of privacy. Just making something known in one context – even a non-secret context – doesn’t mean “anything goes” in terms of the collection, storage, transmission, or use of that information.

So, let’s look at this Facebook dataset and claims made above. I can take issue with (at least) 3 points being articulated by Kaufman and Havalais.

One, Kaufman’s mention of “hackers” and focusing on what they might “do” with this information exposes a focus on “harms” when it comes to privacy concerns. One doesn’t need to be a victim of hacking, or have a tangible harm take place, in order for there to be concerns over the privacy of one’s personal information. Privacy is about dignity as much as about informational harm by some evil agent. As Havalais points out later in his comment, none of the subjects in this dataset consented to having their personal information used in a research study. Don’t they have a right to some control over their information?

This leads to the second point: just because users post information on Facebook doesn’t mean they intend for it to be scraped, aggregated, coded, disected, and distributed. Creating a Facebook account and posting information on the social networking site is a decision made with the intent to engage in a social community, to connect with people, share ideas and thoughts, communicate, be human. Just because some of the profile information is publicly avaiable (either consciously by the user, or due to a failure to adjust the default privacy settings), doesn’t mean there are no expectations of privacy with the data. This is contextual integrity 101.

Thrid, both Kaufman and Havalais seem to suggest that the information was already easily available to anyone who cared to look for it. Examining some of the details in the codebook reveal this isn’t necessarily true, and the researchers certainly had much more efficient ways of gathering the information than an average Facebook user. Frist, the researchers received an official roster of each freshman at the college, along with their university e-mail address. This allowed them to easily and systematically search for each student in the cohort. Second, and more importantly, they appeared to use research assistants from that school in order to access and download the profile information. That means that a Facebook user might have set their privacy settings to be viewable to only to other users within that school. As a result, the RA was able to view and download the data. However, now that same data – originally meant for only those within the college – has been made available to the entire world, perhaps against the expressed wishes of the data subject.

Let me repeat that last point: Some Facebook users might have restricted their accounts to only people from their own school. But since the researchers used RAs from that school to access to account information, that restricted data has been published outside of those boundaries.

The researchers even seem to acknowledge this when they state:

In other words, a given student’s information should not be considered objectively “public” or “private” (or even “not on Facebook”)—it should be considered “public” or “private” (or “not on Facebook”) from the perspective of the particular RA that downloaded the given student’s data.

So, if the student’s information should not be considered “objectively public”, then why is it being treated as such in the dataset?

In total, claims that the data was public in the first place simply do not hold up to scrutiny.

Now, don’t get me wrong. I completely see the research value in having this data. But we must be more careful in how we release such personal information to the world, and we must be certain to understand how privacy is contextual, not just based on whether an RA can download a profile.


  1. I am the Principal Investigator on the Facebook project mentioned above. These comments are extremely useful. We’re sociologists, not technologists, so a lot of this is new to us. We thought long and hard about what to do with the unique ‘Favorite’ listings – they do indeed have the potential to compromise subjects (in 2011, when we release them), though they will be enormously useful to researchers interested in taste, culture, etc. Our other option would be to replace taste names with numbers, but then researchers will only know how many tastes people have in common, not what those tastes are. If you and your community have suggestions on better ways to handle this, we would appreciate hearing them.
    In the meantime, I am urging my collaborators to consider removing the information about the region and type of university we sampled. This is good advice. Sociologists generally want to know as much as possible about research subjects. What might hackers want to do with this information, assuming they could crack the data and ‘see’ these people’s Facebook info? Couldn’t they do this just as easily via Facebook itself?
    Our dataset contains almost no information that isn’t on Facebook. (Privacy filters obviously aren’t much of an obstacle to those who want to get around them.)
    Nonetheless, seeing your thought process — how you would attack this dataset — is extremely useful to us.
    Many thanks,
    Jason Kaufman
    Berkman Center

  2. Not Quinnipiac. We’d never do anything that didn’t have our name on it ;).

    I recognize the danger, but I’m afraid I’m with Jason on this. The data is already there, this is merely (!) the collection of that data. Or to put it another way, AOL users presumed that no one was watching, but this is very different from Facebook users who are intending to share with someone (if not the researchers).

    Is there a privacy concern? Of course! But I think the measures in place are strong enough to introduce a kind of “friction” (again something that didn’t exist in the openly downloaded and reposted data set from AOL) that provides a barrier to broad revelations, and this friction mitigates the problem. I presume mitigation is what you are after.

    If Sarah Palin is in the data set, someone will find her and make it open, but at some level, it would be easier for someone to do this with the original data (i.e., Facebook) than go through the hassle of self-identifying to this group.

    All that said, I’m a little surprised this made it through IRB. Consent (via, for example, an opt-in Facebook app), would have alleviated a lot of these problems. Of course, network data sucks when you have missing nodes, and not everyone would opt-in. But then, isn’t that the point: if they wouldn’t opt in, maybe we shouldn’t be including them…

    Jason: Give up on anonymizing the college. I’m with Michael here: cat may not be entirely out of the bag, but he is far enough out that he won’t be rebagged. And while taste data may be used to identify individuals, it can almost certainly be used to infer differences in the aggregate (e.g., sports team favorites, favorite bars, music, etc., are all fairly localized).

  3. Thank you both for your comments, and I’ve provided a detailed response in the post itself.

    A quick question for Jason: did you consult with privacy experts (either at Berkman or elsewhere) when deciding how to parse and release the data? Just curious.

  4. I am aware that Facebook owns the data I post on my FB profile. But I still post stuff, because the whole point of FB is to be in touch with my friends – so FB is a private site (and again, this whole private/ public debate kicks in). I take all available privacy precautions (highest privacy settings for my profile).

    If anyone would do any type of research using my data, I’d feel really offended (notwithstanding the fact that I still know FB owns the data; now, do I believe it is OK? No, in fact, I know something, but I believe something else – FB’s statement of ownership is not legitimate – and I guess many of those students feel the same).

    I think any research involving human beings of some sort (and this includes the poorly made state-backed mass surveys) SHOULD ask for individual consent (I realize though it is hard to achieve, but the researchers could have mass emailed all participants and given them a chance to withdraw from the research project). Data obtained in this project is important, but unethical in my view. I’m disappointed that such a big institution showed such disdain for regular users.

  5. Good argument. My only contribution is that Facebook doesn’t categorize members by race…unless something has changed recently. it’s an often overlooked characteristic of the system.

  6. Michael – We did not consult w/ privacy experts on how to do this, but we did think long and hard about what and how this should be done. Our IRB helped quite a bit as well. It is their job to insure that subjects’ rights are respected, and we think we have accomplished this.
    On the issue of the ethics of this kind of research — Would you require that someone sitting in a public square, observing individuals and taking notes on their behavior, would have to ask those individuals’ consent in advance? We have not accessed any information not otherwise available on Facebook. We have not interviewed anyone, nor asked them for any information, nor made information about them public (unless, as you all point out, someone goes to the extreme effort of cracking our dataset, which we hope it will be hard to do).
    The race data, btw, is extrapolated from pictures posted by Facebook users, as well as group listings. It is not a perfect measure (neither are self-reported measures, however), but we had multiple coders assess each user profile and they agreed in almost ever case.

  7. Jason – thanks for continuing the conversation.

    I would be interested to hear what your IRB would say about your example, taking notes about people walking in the park. If you were compiling detailed information about them (such as gender, ethnicity, hometown state, political views, sexual interests, college major, relational data, and interests), and then publicizing that specific data to the entire world, I wouldn’t be surprised if consent would be required before widespread publication of the raw data.

    My point is, your research is not simply taking notes of observable information about random people in the park on a random day and time. This dataset represents detailed and non-obvious personal information intentionally posted to a social networking site for a specific purpose, something the subjects likely did with the particular context and informational norms of that space in mind. While much of the information might in fact be publicly available, we should consider whether the subjects actually intended for it to be collected, archived, and distributed in such a way that other people could sort it, aggregate it, mine it, and perhaps de-identify it.

    And, as I noted above, if the research team did in fact use RAs from the same school as the subjects to pull the profile data, it seems quite likely that some profiles that were meant to be seen by only people within that network have been included in the public release. If true, that should be seen as an obvious violation of their expressed privacy interests.

    Do you have any sense as to whether that has occurred? Have you tried pulling the same profiles from a FB account that is not a member of that network? I would be curious as to the results.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s