When it was announced that the The Library of Congress was acquiring the entire archive of public Twitter activity since March 2006, I posted a set of open questions regarding the potential privacy implications of this very unique arrangement.
I then proceeded to submit a Freedom of Information Act request to the LOC requesting a copy of the agreement, any internal policies or documents governing how the Twitter data will be archived and used, and requested answers to these questions.
It has now been almost 2 weeks and I have had not yet received a response from the Library. However, interviews with Library personnel posted at The American Prospect and Ars Technica provide us some insight into the nature of the agreement and the plans for the data.
And, just as I was starting to write this post (yesterday), I received an email from the LOC’s Matt Raymond directing me to this new blog post, which includes a scanned copy of the gift agreement (PDF) between Twitter and the LOC, and provides some additional details about the arrangement.
Piecing all these together, we can discern the following:
There has been little information available about what, precisely, will be included in the dataset provided to the LOC. The agreement simply indicates “public Tweets from the Twitter service”, which leads us to believe that detailed profile and social graph information might not be included. The LOC’s recent blog post provides more specifics:
Private account information and deleted tweets will not be part of the archive. Linked information such as pictures and websites is not part of the archive, and the Library has no plans to collect the linked sites.
Thus, the LOC will apparently only be archiving the tweets, and not expanding shortened URLs, or retrieving and storing any photos or linked websites.
When asked specifically about whether personal information might be included in the archive, the LOC mentioned how research on anonymization might help mitigate privacy concerns in the future, suggesting that they might include it, but would look for a way to anonymize it when providing the data to researchers. In the end, the LOC indicates that it will be up to Twitter to decide what to do about personal information.
I’m not fully comfortable with this position. The LOC should take a much stronger stance regarding the existence of personal content in the dataset. It should either require it purged before receiving it, or come up with specific steps it will take to scrub personal information from the data visible to outsiders.
While the LOC recognizes the research potential of having access to geo-locational data, the Gift Agreement is silent about what specific data will be shared. I suspect that it might fall into the “personal information” category of things that Twitter hasn’t quite figured out. Yet, until this is resolved one way or another, the possible inclusion of geo-locational data poses a significant privacy threat.
The LOC has noted that there will be a 6-month lag before the latest Tweets get sent to them to be archived. They’ve stated the reason for the lag is to give people a chance to delete their tweets and “things like that” before they are sent to the archive.
The LOC also hints that it might enforce an embargo “of several years” on the data, to help force the passage of time between access to the archive and any particular events.
Processing the Archive
The Library is planning to build analytic tools to help “make order” out of the archive, perhaps clustering content based on hashtag, etc; something more than just full-text searching.
Access & Data Release
The Gift Agreement provide insights into the kind of access that will be provided:
After a period of six months from the date any portion of the materials was first posted to the Twltter service. the Library may display such materials in the collection of its public website or In any other electronic form or successor technology subject to reasonable access limitations such as the use of a robots.bit file. The Library will not provide a substantial portion of the Collection on its public website in a form that may be easily subject to bulk download.
In short, the archive “would not be released as a single public file or exposed through a search engine, but offered as a set only to approved researchers.” There is no indication of what criteria will be used to determine “approved researchers”, but presumably the LOC already has such processes in place.
Twitter’s blog post about the acquisition indicates that only “non-commercial” research will be allowed, and the gift agreement confirms this, noting that potential researchers must sign a “notification mutually agreed upon by Donor and the Library prohibiting commercial use and redistribution of all or a substantial part of the Collection.”
[Interesting that the agreement might allow redistribution of a “non-substantial” part of the archive. I’m presuming this relates to selective quoting of items in the collection within one’s research.]
Opt-Out & User Access
A key question centers on whether owners of Twitter accounts might be able to request materials removed from the archive, even after the 6-month day, or opt-out entirely. (Note: I’ve pondered whether LOC-owned databases constitute government databases that fall under the Privacy Act, which would require such access)
There is little new information about whether users can access their archived tweets to make changes, but it appears that isn’t being contemplated. When asked about opt-out, the LOC punted it back to Twitter, saying: “We asked them to deal with the users; the library doesn’t want to mediate that.”
Like with the issue of private information within the archive, this stance disappoints me. As a library, the LOC should take a more proactive stance to provide individuals the ability to protect and control their information.
To summarize, I’m pleased that the LOC is providing (the required) transparency regarding this agreement, and that they’re trying to address some of the issues that have been floated about. However, I’m disappointed that they’re leaving many of the “hard” questions up to Twitter, especially the determination of what is considered “public” to include in the dataset. The LOC should take a stronger stance on issues that impact user privacy, control over their information, and access.
It seems that I now need to ask Twitter what its thinking is on various matters: precisely what data is included in the dataset, will users be able to opt-out, how will the data be sent to the LOC, etc.
More to come….
I’d like to push back – or at least question – one of the things you said here, Michael:
“As a library, the LOC should take a more proactive stance to provide individuals the ability to protect and control their information.”
This is sticky for me, because while I agree that library users ought to have their access records protected, I also think that libraries and repositories have an obligation to provide as full and uncensored a record of culture as they can. On the one hand, we as US citizens are all LOC users (though we also comprise a shrinking portion of all Twitter traffic) and so deserve privacy protections as such. But on the other, I think there’s a potentially problematic precedent that would be set if there’s an unlimited ability of users to scrub their digital records after the fact.
I don’t know exactly how to weigh the benefit/harm ratio, but I definitely wouldn’t want to set up a situation where incidents of, say, corporate malfeasance or public corruption can be erased from the record under the banner of personal privacy. Striking that balance is clearly a difficult issue.
What use is the Twitter archive to researchers without live links? What percentage of each tweet is a link? What percentage of tweets are link-based or link-heavy. In other words, besides hash tags, aren’t links what makes Twitter matter?