Open Questions about Library of Congress Archiving Twitter Streams

(See update below referencing Twitter’s announcement; this post about how your private tweets might end up in the archive; and this post where more details about the agreement have been provided)

The Library of Congress tweeted today that they are acquiring the entire archive of public Twitter activity since March 2006. (The official blog post is down, but a copy is on the LOC’s Facebook page.)

Have you ever sent out a “tweet” on the popular Twitter social media service? Congratulations: Your 140 characters or less will now be housed in the Library of Congress.

That’s right. Every public tweet, ever, since Twitter’s inception in March 2006, will be archived digitally at the Library of Congress.

… We will also be putting out a press release later with even more details and quotes. Expect to see an emphasis on the scholarly and research implications of the acquisition. I’m no Ph.D., but it boggles my mind to think what we might be able to learn about ourselves and the world around us from this wealth of data. And I’m certain we’ll learn things that none of us now can possibly conceive.

This is big. Huge.

And while the LOC stresses that they’re doing this for historical and scholarly reasons, there are major implications regarding the privacy and contextual expectations of Twitter users. Now, suddenly, all their tweets are being archived by the world’s largest library. Yes, the tweets were always public and discoverable, but the searchability and accessibility will increase drastically if/when the LOC processes this archive.

Here are some immediate questions that need to be addressed:

Will user profile information also be archived and made accessible? And historical changes to user profile information? If so, can users update the profile information that might be archived at LOC?
Will lists of followers and who is followed be included? If so, how will the be updated?
Will geo-locational data be included?
Will the LOC allow automated scraping of the database (by search engine crawlers or other bots)?
Will the LOC allow commercial use of the archive?
Will the LOC process the archive in such a way to create categories of users or tweets? Essentially, are we going to see a Library of Congress Classification scheme for tweets?
Currently users can delete tweets from Twitter, which (presumably in a reasonable time) are deleted from Twitters logs, and no longer discoverable. Will users have the ability to remove unwanted tweets from the LOC? (I presume not)

I look forward to seeing more information as the day progresses.

UPDATE: Twitter has now posted its own announcement, which provides some further details:

It is our pleasure to donate access to the entire archive of public Tweets to the Library of Congress for preservation and research. It’s very exciting that tweets are becoming part of history. It should be noted that there are some specifics regarding this arrangement. Only after a six-month delay can the Tweets will be used for internal library use, for non-commercial research, public display by the library itself, and preservation.

Interestingly, they’re enforcing a 6 month delay before public Tweets are made available to the Library of Congress. What remains unclear is whether the LOC are given live feed streams, and must simply embargo them for 6 months, or whether Twitter is only providing the LOC the archives after 6 months have passed. (The latter would provide users more opportunity to delete tweets that they might want taken out of public circulation — see item 7 above).

This is also a bit odd since Google apparently is providing real-time access to all Tweets without any such delay. Why a commercial entity is allowed immediate access, while the Library of Congress must wait, remains a mystery.

Finally, while Twitter notes that the archive can be used only for non-commercial research (good), it remains unclear whether the restriction for “internal library use” is meant to mean that only the library can use the archive, or whether the “public display” provision also means a searchable database will be made available.

Time to request of a copy of the agreement.

5 comments

sealover says:

April 15, 2010 at 10:29 am

What a boon for the social scientists and psychologists but for the rest of us, I see it as a waste of taxpayer money and wasted time for the staff of LC who could be doing better things.
Michael Zimmer says:

April 15, 2010 at 10:45 am

I see it as both a boon for researchers, as you suggest, but also for the general public. It is important for libraries to consider contemporary forms of communication & expression when developing archival strategies.

So, I see it as a useful project for the LOC; it just needs to be done correctly.
Will Norton says:

April 15, 2010 at 11:02 am

Google is able to provide real-time access because it paid for that right. Twitter is only providing a limited right for the LOC to use the data for non-commercial, internal research purposes and only after a period of time to preserve Twitter’s ability to sell real-time data for commercial uses. That’s one of its cash cows.
Michael Zimmer says:

April 15, 2010 at 11:06 am

Will – I hadn’t read about payment by Google. If true, then we have a classic example of how for-profit corporations are gaining swifter access to information than a library that acts in the public trust.

Not surprised…
Will Norton says:

April 15, 2010 at 1:54 pm

I can’t see any reason to bemoan the delay. First of all, Twitter has spent a lot of money creating this forum and collecting the data. This is extremely valuable information that is Twitter’s proprietary property. We should be thankful for the fact that they’re willing to share it for the public good at all. Second, Twitter is an awesome service that is provided to the public for free precisely because they can profit from selling the data to others for a short period of time. The six month delay seems like a perfectly reasonable compromise for consumers.

Comments are closed.

Share this:

Related

5 comments