Last week, a Facebook dataset was released by a group of researchers (Amanda L. Traud, Peter J. Mucha, Mason A. Porter) in connection with their paper studying the role of user attributes – gender, class year, major, high school, and residence – on social network formations at various colleges and universities. The dataset — referred to by the researchers as the “Facebook 100” — consists of the complete set of users from the Facebook networks at 100 American schools, and all of the in-network “friendship” links between those users as they existed at a single moment of time in September 2005.
The research paper indicates that the Facebook data was provided to the researchers “in anonymized form by Adam D’Angelo of Facebook.” (D’Angelo was then Facebook’s CTO, and left Facebook in 2008.) Curious as to what precisely was included in the data release, and what steps towards anonymization were taken, I downloaded the data (200 MB zip file) on the morning of February 11.
The data files are separated by institution, and in total include, by my estimation, about 1.2 million user accounts. The content of each institution’s file is described as containing the following:
Each of the school .mat files has an A matrix (sparse) and a “local_info” variable, one row per node: ID, a student/faculty status flag, gender, major, second major/minor (if applicable), dorm/house, year, and high school.
Thus, the datasets include limited demographic information that was posted by users on their individual Facebook pages. The identity of users’ dorm and high schools were obscured by numerical identifiers, but to my surprise, the dataset included each user’s unique Facebook ID number. As a result, while user names and extended profile information were kept out of the data release, a simple query against Facebook’s databases would yield considerable identifiable information for each record. In short, the suggestion that the data has been “anonymized” is seriously flawed.
The consequences of this ease of re-identifying the dataset are numerous.
First, while only limited profile information is within the dataset, there is no indication that any consideration was given to users’ particular privacy settings. Based on the article, all user accounts from each of the 100 networks were provided to the researchers, and as long as the user provided the data to Facebook, it was turned over to the researchers. [Clarification: when I say “all user accounts” we provided, I do not mean full profile information was given to the researchers, just the particular data fields as described above]
Second, even though the specific data exposure within the dataset is limited, the fact that users can be identified and linked to their in-network social map fosters additional threats to privacy. Previous research (here and here, for example) has shown how “anonymous” datasets can be largely re-identified when there is access to other large sets of data where the subjects are already known. The “Facebook 100” data, with the Facebook IDs intact to guide identification of users, might be useful in similar efforts.
To recap, the suggestion that the “Facebook 100” data has been “anonymized” is seriously flawed, and its release might be putting the information of 1.2 million Facebook users at risk.
Interestingly, a few hours after the initial release of the “Facebook 100” dataset, the researchers Mason Porter announced they were pulling the data due to an unspecified “bug”. Later that evening, the data was again made available with a message indicating that the data files were now fixed.
Again, I was curious, so I downloaded and examined the new dataset. The only change I could see was that now the Facebook ID was removed entirely from the data files, and the order of the records in each file was randomized.
Thus, the “bug” must’ve been that the data was easily re-identifiable, and the “fix” was to take additional steps to anonymize the records. Somone joked on the announcement email list that the “bug” must have something to do with Facebook attorneys, but the Porter’s message re-releasing the data jokes that no lawyers were involved, and that they “really were fixing the data files!”
To me, however, the language used in these explanations was disingenuous. The data, as far as I could tell, had no bugs that prevented its usefulness for social network analysis. No, the problem with the data was that it contained each user’s unique Facebook ID, thus allowing easy identification. The researchers Porter should have been open and honest about why the data was pulled and what they did to correct the situation.
That said, there are still a number of open questions regarding this particular dataset:
- What kind of internal processes, if any, did D’Angelo follow when releasing the data to these researchers? Was he authorized to do so?
- Was this kind of large data release routine? How many other similar releases have taken place?
To the research team:
- Was the data received by Facebook already obscured with numerical identifiers replacing student majors, minors, and high schools, or did you add those?
- UPDATE: I have received word from one of the researchers, Mason Porter, that the data sent to them by Facebook was indeed already obscured with numerical identifiers in the place of actual student major, minor, and high school information.
- Did your IRB review the data used for the research, and approve the subsequent data release?
- Was there any “bug” in the data, or was the attempt to gain greater anonymization of the data the sole reason to pull it from public access?
Obtaining answers to these questions can help us better understand the uniqueness of this situation, and to put better processes and protections in place to prevent similar data releases that falsely believe data is sufficiently anonymized and respecting of users’ privacy expectations.
I hope Facebook and the researchers are willing to engage in a discussion, and I’ll report back on any communication, as allowed.
UPDATE (Feb 15, 6:00pm): I have been in contact with one of the researchers, Mason Porter, who confirmed that the data sent to them by Facebook was indeed already obscured with numerical identifiers in the place of actual student major, minor, and high school information. I’ve inserted this reply into the question above. I have also made a few minor changes to the main text, clarifying that the email messages reporting the “bug” in the data came from Mason alone, and should not be attributed to the entire research team.
UPDATE 2 (Feb 15, 6:10pm): The link to the full, revised dataset (http://people.maths.ox.ac.uk/~porterm/data/facebook100.zip) is no longer active.
UDPATE 3 (Feb 16, 9am): Added a clarification that when I say “all user accounts” were provided to the researchers, I do not mean full profile information was given, just the particular data fields as described above.
UDPATE 4 (Feb 16, 11am): Mason Porter, one of the authors, has posted an explanatory note on his blog indicating that he’s been in contact with the Facebook Data Team, and per their request, “I have taken down the data, and I will be working with them to eventually post a version of the data set with which both they and I are happy.”
I responded to your e-mail and provided information regarding what data we have, etc. Also, as I stated in my e-mail, please place all blame directed towards researchers on my shoulders alone.
Thank you, Mason, for replying and commenting here.
Second, this is a prime example of retaining unnecessary personal information that is not useful for analytical purposes. The FB IDs could have easily been replaced in the data cleaning process, and thereby circumventing a serious mistake. We can argue here whether the retention of the FB IDs is sloppy, but we do not need to look far (e.g., Professor Yankaskas at UNC) to begin to understand the serious and debilitating professional ramifications of these kinds of breaches in research datasets. This case underscores the need for de-identifying data at a very early stage in the research process.
If you know who everyone is, it should be easy to map the numerical identifiers back to student majors, minors, and high school info. You only need one History major who doesn’t hide their info to find out that 123 = History, and so on for the rest of the data. With 1.2 Million people, numerical ids keep you from reading the data only for as long as you choose not to.
But even if people’s Facebook IDs had not been included, it would still be a mistake to consider the data set to be anonymous. You can figure out who most people are just based on the shape of the social network. See Narayanan & Shmatikov – http://randomwalker.info/social-networks/ .
I used the Narayanan and Shmatikov method for De-anonymizing Social Networks to figure out who wrote the response above attributed to “anon”.
Just sayin …
I have to agree that the FB IDs, although with positive intentions, was an idea that has some terrible consequences and raises the bar in question the protection of privacy for Facebook users. Now, I understand the researchers intend to organize and categorize the information with IDs and tag so they can keep track of the information for their research. However, to claim that this is anonymous, is not always the case. Any individual with some free-time can connect the dots and put the puzzle together of a certain individual’s or individuals’ profile to get an idea of that person’s personal information. Now Facebook is releasing information from the 2005. Kind of reminds me of how Google Earth releases pictures taken at a certain address that is from 3-4 years ago. The information is not up to date, but it is still there and retained in the stored archive.
That’s the feeling I get when the researchers are taking whatever profile information that is anonymous to a certain degree, with numbers of course, and see for whatever research purposes. However, the only thing I realize is that raises another question….. How often do these researchers ask for Facebook to provide this information. Does Facebook informs their social network that their information is being used for research and that it is aimed at random? Is this itself to some extent, a violation of Facebook Services in their license agreement?
As a Facebook User myself, I know that my profile information is being watched and can be used by other organization for legal and possibly illegal stuff as well. Remember the picture of Eve Marie Carson, the UNC Chapel Student that was murdered in March 2008. Some of pictures were used for some study abroad program in India. Turned out that this particular organization had some shady background and connections as the ads with this student’s face were quickly removed.
That is just one example of how easy certain personal information, in this case, pictures, can be obtained on the web and used. In this case, it is a terrible thing because the information is used by an individual who suffered an early tragedy and demonstrates a clear violation of personal and family privacy.
Well the underlying message I think is that the researchers performing this project should be more responsible for the information and more presenting the objectives of the project itself. I guess more details and reasons provided and available to the general public is the best course of action. However, it goes both ways for the parties. We as users should know that any information we send, post, and put in the web can be accessed by other parties easily if we are not too careful. Sometimes, common sense of not telling to much about yourself and remind yourself that your information about yourself is free for others to see, maybe all users of social-networks should be given more awareness of what information others can see on the Internet.
I am a research scholar in IIT. I need Facebook data along with text(posts).
I want data of a Facebook group(member more than 5000) can you provide me this type of data.
can anyone tell me what attributes available in dataset so that i want to fixed my datasets.thanx