More On the “Anonymity” of the Facebook Dataset – It’s Harvard College (Updated)

(See my update at the bottom of the post, as well as Fred Stutzman’s thoughtful analysis)

As mentioned the other day, a group of researchers from the Berkman Center for Internet & Society at Harvard University released a dataset of Facebook profile information from an entire cohort (the class of 2009) of college students from “an anonymous, northeastern American university.”

(I’ve been engaging with Jason Kaufman, the PI for this research, on a variety of privacy and research ethics issues in this post and the comments section – please check it out.)

Well, I’m pretty sure this “anonymous, northeastern American university” is Harvard College. And I didn’t even have to download the dataset to figure it out. Here’s how.

As I noted here, the press release and the public codebook for the dataset provided many clues to where the data came from: we know it is a northeastern US university, it is private, co-ed, and whose class of 2009 initially had 1640 students in it. A quick search for schools reveals there are only 7 private, co-ed colleges in New England states (CT, ME, MA, NH, R , VT) with total undergraduate populations between 5000 and 7500 students (a likely range if there were 1640 in the 2006 freshman class): Tufts University, Suffolk University, Yale University, University of Hartford, Quinnipiac University, Brown University, and Harvard College.

Next, the codebook mentioned utilizing research assistants (RAs) to access and download the subjects’ Facebook profile information. Specifically, the researchers noted that “that both undergraduate and graduate student RAs were employed for downloading data, and that each type of RA may have had a different level of default access based on individual students’ privacy settings.” (See my note regarding how this complicates the argument that “the subjects’ Facebook profiles were already public, so there is no privacy concerns” towards the bottom of this post.) This leads one to believe that the RAs were from the same university as the subjects themselves, since the only relevant privacy setting for a Facebook profile would be whether users in one’s own “network” (a university, for example) can see the profile, vs. only one’s friends). It further stands to reason (for simplicity and efficiency) that the RAs employed by the researchers for this task would be from the researchers’ own univeristy, Harvard.

Then, the fact that the researchers’ institutional “Committee on the Use of Human Subjects” allowed this research without requiring explicit consent by the subjects lends me to believe thta the subjects must be affiliated with that institution. I find it unlikely an IRB would have allowed this data collection and release to be performed in such a way on students from an institution that they didn’t feel they had some kind of domain over.

Finally, and perhaps most convincingly, only Harvard College offers the specific variety of the subjects’ majors that are listed in the codebook. While nearly all univerersities offer the common majors of “History”, “Chemistry” or “Economics”, one only needs to search for the more uniquely phrased majors to discover a shared home institution. As far as I could tell, only Harvard College (of the 7 northeast private universities we’ve already narrowed it down to) offers majors with unique titles such as these:

For these reasons, I’m confident this Facebook dataset represents the class of 2009 from Harvard College. I could be wrong, but I don’t think so.

Now, a couple of comments:

One, I didn’t even look at (and haven’t even downloaded) the actual dataset in order to make this discovery. I simply read the press release and public codebook, performed a few Google searches, and did some thinking of my own. I’m not a hacker, I’m not a statistician. This wasn’t hard, and there’s something wrong if the source of an “anonymous” dataset can be de-anonymized (at the institutional level) so easily.

Two, I hesitated whether to even post this revelation, as it certainly contributes to the chances that individual subject’s identities could be exposed, further eroding their privacy. There is no easy calculus to determine what to do in such a case, but I decided to post the results of my investigation as a clarion call to others who might want to release similar datasets. We must take care to ensure, to the best of our abilities, the privacy of our subjects.

Three, what could the researchers have done to prevent my discovery? Besides not giving away so many demographic clues as to the university, they should have coded the specific names of the majors into more generic identifiers. Even without the other “northeast, private university” clues, having the specific majors alone would have been sufficient to (eventually) figure out the identity of the school.

Methinks we really need to keep working on a new set of Internet research ethics and methodologies.


UPDATE: Earlier this year, Jason Kaufman (the PI on this Facebook research project) presented some of the findings of the work at the Berkman Center. Here is the video. It provides a great insight into the motivation and results of the research (which I do find fascinating, despite my concerns about the methodology and release of the data).

During his talk, two more tidbits of information are provided that help me feel confident that the dataset is indeed of the class of 2009 from Harvard College.

First, Kaufman makes a throwaway remark (at about minute 35:10) that the dataset is from a “very idiosyncratic campus…a very elite college”. While that alone doesn’t give it away, it does help rule out some of the 7 colleges on my list above that simply aren’t considered “very elite” (sorry, Quinnipiac).

Second, and more telling, Kaufman notes (at about minute 32:52) that “midway through the freshman year, students have to pick between 1 and 7 best friends” that they will essentially live with for the rest of their undergraduate career. This accurately (if not precisely) describes how undergraduate housing works at Harvard. As described here, all freshman who complete the fall term enter in to a lottery, where they can designate a “blocking group” of between 2 and 8 students with whom they would like be housed in close proximity.

So, again, the lesson learned here is how disparate pieces of seemingly benign information can be pieced together to make an otherwise presumed anonymous piece of data identifiable. I did it here by quickly analyzing the codebook, reading a press release, and watching a video presentation. The New York Times did it with the AOL search data release, and I’m sure someone will do it with this Facebook dataset.

5 comments

  1. Did they really want to preserve the institution anonymous? I believe a majority of students can be easily tracked at individual level (and this is a concern), but your argument focuses on what appears to me as a unimportant breach.

  2. Yes, they want — and likely are required by Harvard’s IRB — to keep the source institution anonymous. Nearly every publication about the research, both now and in the past, note it as a variation of “an anonymous northeastern college”. From a research ethics standpoint, keeping this piece of information is vital to protecting the privacy of the subject. As such, this is a meaningful breach.

  3. Very interesting. And it comes on the heals of the massive security breach:

    http://www.computerworld.com/action/article.docommand=viewArticleBasic&articleId=9116138

    It makes me re-think my facebook settings, and I have the information handed out at One Web Day (which I hope will be an annual event).

    In terms of surveillance, “intelligence gathering”, “market research”, etc. I have to mention the movies.
    It seems the more bizarre and far-fetched a movie idea is, the more likely it is either true now, or will become true in the future.

    Consider the following movies:
    Blue Thunder (1983) – a super stealthy helicopter fights evil. All of the functions were available and either in use or being tested at thetime of the movie – 25 years ago; Conspiracy Theory (1997), where the suspect is tracked through his credit card use. (I mention this movie because ti was the first I had heard of that – now, it is a weekly thing in many different TV shows – as is the tracking of cell phones); Enemy of the State (1998), where the suspect is tracked through many different technologies and the entire Jason Bourne series, where he is tracked many different ways. The list is much longer, but the point is, is that we, as a nation are, to many degrees, living the Big Brother of Orwell’s 1984.

    I am not a conspiracy theorist, but having been in the military, my prints are on file with the US government and I probably have an FBI file, especially because we took the tour in the 1960’s!

    I have tried to tell my kids that Facebook information is not as secure as they think, and this would seem to bear that out. Thanks for a great blog!

  4. Adam: You’re right; they are going to great (but insufficient) lengths to anonymize data that they later (incorrectly) claim isn’t private in the first place. Faulty logic.

    However, to be clear, Facebook isn’t make any such claims. It is a group of (well-intentioned) researchers from Harvard and UCLA.

Leave a comment