(See my update at the bottom of the post, as well as Fred Stutzman’s thoughtful analysis)
As mentioned the other day, a group of researchers from the Berkman Center for Internet & Society at Harvard University released a dataset of Facebook profile information from an entire cohort (the class of 2009) of college students from “an anonymous, northeastern American university.”
(I’ve been engaging with Jason Kaufman, the PI for this research, on a variety of privacy and research ethics issues in this post and the comments section – please check it out.)
Well, I’m pretty sure this “anonymous, northeastern American university” is Harvard College. And I didn’t even have to download the dataset to figure it out. Here’s how.
As I noted here, the press release and the public codebook for the dataset provided many clues to where the data came from: we know it is a northeastern US university, it is private, co-ed, and whose class of 2009 initially had 1640 students in it. A quick search for schools reveals there are only 7 private, co-ed colleges in New England states (CT, ME, MA, NH, R , VT) with total undergraduate populations between 5000 and 7500 students (a likely range if there were 1640 in the 2006 freshman class): Tufts University, Suffolk University, Yale University, University of Hartford, Quinnipiac University, Brown University, and Harvard College.
Next, the codebook mentioned utilizing research assistants (RAs) to access and download the subjects’ Facebook profile information. Specifically, the researchers noted that “that both undergraduate and graduate student RAs were employed for downloading data, and that each type of RA may have had a different level of default access based on individual students’ privacy settings.” (See my note regarding how this complicates the argument that “the subjects’ Facebook profiles were already public, so there is no privacy concerns” towards the bottom of this post.) This leads one to believe that the RAs were from the same university as the subjects themselves, since the only relevant privacy setting for a Facebook profile would be whether users in one’s own “network” (a university, for example) can see the profile, vs. only one’s friends). It further stands to reason (for simplicity and efficiency) that the RAs employed by the researchers for this task would be from the researchers’ own univeristy, Harvard.
Then, the fact that the researchers’ institutional “Committee on the Use of Human Subjects” allowed this research without requiring explicit consent by the subjects lends me to believe thta the subjects must be affiliated with that institution. I find it unlikely an IRB would have allowed this data collection and release to be performed in such a way on students from an institution that they didn’t feel they had some kind of domain over.
Finally, and perhaps most convincingly, only Harvard College offers the specific variety of the subjects’ majors that are listed in the codebook. While nearly all univerersities offer the common majors of “History”, “Chemistry” or “Economics”, one only needs to search for the more uniquely phrased majors to discover a shared home institution. As far as I could tell, only Harvard College (of the 7 northeast private universities we’ve already narrowed it down to) offers majors with unique titles such as these:
- Near Eastern Languages and Civilizations
- Studies of Women, Gender and Sexuality
- Environmental Science and Public Policy
- Organismic and Evolutionary Biology
- Sanskrit and Indian Studies
For these reasons, I’m confident this Facebook dataset represents the class of 2009 from Harvard College. I could be wrong, but I don’t think so.
Now, a couple of comments:
One, I didn’t even look at (and haven’t even downloaded) the actual dataset in order to make this discovery. I simply read the press release and public codebook, performed a few Google searches, and did some thinking of my own. I’m not a hacker, I’m not a statistician. This wasn’t hard, and there’s something wrong if the source of an “anonymous” dataset can be de-anonymized (at the institutional level) so easily.
Two, I hesitated whether to even post this revelation, as it certainly contributes to the chances that individual subject’s identities could be exposed, further eroding their privacy. There is no easy calculus to determine what to do in such a case, but I decided to post the results of my investigation as a clarion call to others who might want to release similar datasets. We must take care to ensure, to the best of our abilities, the privacy of our subjects.
Three, what could the researchers have done to prevent my discovery? Besides not giving away so many demographic clues as to the university, they should have coded the specific names of the majors into more generic identifiers. Even without the other “northeast, private university” clues, having the specific majors alone would have been sufficient to (eventually) figure out the identity of the school.
UPDATE: Earlier this year, Jason Kaufman (the PI on this Facebook research project) presented some of the findings of the work at the Berkman Center. Here is the video. It provides a great insight into the motivation and results of the research (which I do find fascinating, despite my concerns about the methodology and release of the data).
During his talk, two more tidbits of information are provided that help me feel confident that the dataset is indeed of the class of 2009 from Harvard College.
First, Kaufman makes a throwaway remark (at about minute 35:10) that the dataset is from a “very idiosyncratic campus…a very elite college”. While that alone doesn’t give it away, it does help rule out some of the 7 colleges on my list above that simply aren’t considered “very elite” (sorry, Quinnipiac).
Second, and more telling, Kaufman notes (at about minute 32:52) that “midway through the freshman year, students have to pick between 1 and 7 best friends” that they will essentially live with for the rest of their undergraduate career. This accurately (if not precisely) describes how undergraduate housing works at Harvard. As described here, all freshman who complete the fall term enter in to a lottery, where they can designate a “blocking group” of between 2 and 8 students with whom they would like be housed in close proximity.
So, again, the lesson learned here is how disparate pieces of seemingly benign information can be pieced together to make an otherwise presumed anonymous piece of data identifiable. I did it here by quickly analyzing the codebook, reading a press release, and watching a video presentation. The New York Times did it with the AOL search data release, and I’m sure someone will do it with this Facebook dataset.