Is it Ethical to Harvest Public Twitter Accounts without Consent?

tweet_meWhile participating in the workshop on Revisiting Research Ethics in the Facebook Era: Challenges in Emerging CSCW Research, the question arose as to whether it was ethical for researchers to follow and systematically capture public Twitter streams without first obtaining specific, informed consent by the subjects. Many in the room felt that consent was not necessary since the tweets are public, a conscious choice made by the user to allow the whole world see her activity. In short, by not restricting access to one’s account, there is no expectation of privacy.

I argued, however, that we cannot be so quick to presume the expectations of potential research subjects. Yes, setting one’s Twitter stream to public does mean that anyone can search for you, follow you, and view your activity. However, there is a reasonable expectation that one’s tweet stream will be “practically obscure” within the thousands (if not millions) of tweets similarly publicly viewable. Yes, the subject has consented to making her tweets visible to those who take the time and energy to seek her out, those who have a genuine interest to connect and view her activity through this social network.

But she did not automatically consent, I argue, to having her tweet stream systematically followed, harvested, archived, and mined by researchers (no matter the positive intent of such research). That is not what is expected when making a Twitter account public, and it is my opinion that researchers should seek consent prior to capturing and using this data.

A healthy debate on this issue followed, and continued in a separate thread on Facebook, which included the following varied positions & responses (edited and condensed):

  1. “…if the account holder tweets to the general public, then it’d seem like there’s no expectation of privacy so no consent would be necessary.”
  2. (me) “But isn’t my expectation that even though my tweets are public, they’re often lost in a sea of hundreds of tweets among my followers, and I never anticipated someone would archive, mine, and perform research on them?”
  3. “If you’re comfortable with your anonymity being guaranteed only by virtue of your public tweets being hidden in plain sight among millions of others, then you’d have to realize that some determined person could follow just yours, archive them, and analyze them. I like my privacy, but I don’t worry about walking around a city or campus even though …”
  4. “…depends on how data are being presented – e.g. in aggregate vs specific “quotes” that could easily be traced.”
  5. “Many IRBs would say yes [consent is needed], or at least would require you to get a waiver–publicizing the extremes to which IRBs go…”
  6. “…IRB application is required. You could request that Informed consent be waived with the argument that you are only analyzing tweets broadcast publicly, and that you de-identify your data to eliminate potential risk to the individual”
  7. “I would say if it is for research and you are dealing only with publicly available documents, then no, you need no consent. you can run that by the irb and get a waiver, but in the end, you are dealing with publicly available documents… not people, subjects. If you are dealing with subjects and not documents, then you will need irb clearance.”
  8. “Tweets are publications. I think it’s absurd to even consider IRB review for anything dealing with things people have published”
  9. “The questions are: 1) Are you conducting research that is intended to be published; 2) Does your research involved human participants; 3) For these human participants, will you gather data through intervention or interaction with the individual; and/or will you gather identifiable private information about them. (45 CFR 46.102(f))
    If these 3 conditions are met, your research must be reviewed by IRB. They will work with you and determine whether or not informed consent is required. In your case, if you are NOT interacting with the individual publishing the tweets, and the tweets are broadcast and searchable as public records (that is, you don’t need access to their account to view tweets posted to a limited audience), then it won’t fall under the definition of research with human subjects.”
  10. “If i download all of Michael’s published papers, blog posts, twitter posts and each one he publishes thereafter… are they the same? or different? I’d argue the same, just for different audiences.”
  11. (me) “What if tomorrow, I decide to take my Tweet stream private. And I delete my blog posts. Does my affirmative action to purge my documents from the “live” web mean that you (researcher) need to treat that previously archived material differently?”
  12. “If the individual changes their intent regarding release of data, then by IRB standards what might previously have been considered publicly available information, then becomes private information, and your collection would likely require BOTH IRB review AND informed consent, b/c the user now has an expectation that their information is protected.”
  13. “Once tweeted, a birdsong is gone forever. No deleting or taking back what’s been broadcast to the world. If someone seeks privacy, they should seek another method of communication. If from the beginning, there was some kind of inherent expectation that tweets were private messages, then the situation might be different. But the whole idea of tweeting is to voluntarily publish or broadcast. It’s different from, say, e-mailing or IMing.”

What we see here are numerous, intelligent researchers not in complete agreement about wither consent is necessary, about whether one’s tweets are “publications” not needing IRB review, or whether Twitter-based research is dealing with “human subjects” that does require strict scrutiny. There’s also some question about how to deal with the fact that users might make information private after an initial release, something our current forms of communication allow more than in the past.

What do you think? If readers have had experience with related research ethics issues, and how their IRB dealt with is, please email me or leave a comment.

Aside: Interestingly, Adam Fish, who I’ve friended on Facebook, saw that discussion and wanted to repost the thread on his blog. Respectful of the delicate nature of re-posting other conversations and moving them from the controlled environs of Facebook to a public blog, he contacted me to ask permission. He didn’t, apparently, contact each of the commenters to ask for their permission. I felt it necessary to get consent from everyone in that thread before authorizing its re-posting. When I asked each of them, all agreed (with some edits), and some took the position that the Facebook conversation was de facto public, even though technically only a certain set of users (friends of the participants) could in reality see the thread.

[image from TPorter2006]

25 comments

  1. It reminds me of the old discussions of IRC and Usenet, where these same questions came up and resulted in the same debates.

  2. Great article, thanks. (apologies if I’m double-posting this comment: bad wireless)

    I find myself much in agreement with comment #8 from the FB discussion: “Tweets are publications. I think it’s absurd to even consider IRB review for anything dealing with things people have published”

    It *is* absurd and, though I appreciate your concern regarding the reasonable expectation of ‘practical obscurity’, I really don’t see that it necessarily implies the need for informed consent. Consider: a fantastically niche fanzine with a circulation of 100 has, within the thousands of other paper publications available, a reasonable expectation of your practical obscurity. Nevertheless, that expectation does not stop the content of the being *published* content for *public* consumption. Merely because the publisher does not expect much interest in their content does not mean that researchers are forbidden to show interest in it without the publisher’s consent.

    This is a really important issue, though, and I forsee it needlessly clogging up already over-worked IRBs more and more in the future.

  3. Let’s put the expectation of privacy aside for a second — what expectation should researchers have that any of this self-published information is accurate/reliable in the first place?

  4. I believe it really depends on the way you see a tweet. If you see it as a micro *blog*, then we have to treat it the same way we treat blogs. As long as they are public, then we are allowed to follow and systematically capture public Twitter streams without first obtaining specific, informed consent by the subjects. But if you see twitter, as a big fat chat room, something similar to the old IRC days, then it’s really hard to decide whether it is ethical or not to base your researches on it.

    I myself believe that as long as the content is public, whether it is a blog, tweet, or chat room, then people are allowed to do whatever they want with such content, as long as they will preserver people’s rights in being acknowledgedly if needed. Yet I think, such decision – whether it is ethical or not – has to be based on the collective opinion of the social media consumers and that’s why I’ll wait to see what others will say here, in order to take my final decission.

  5. I agree, Tarek, that this comes down to how tweets are conceived in the taxonomy of research source material. And the key challenge, as I see it, is that tapping the “collective opinion of the social media consumers” will be exceedingly difficult, as people tweet (use Facebook, Buzz, etc) for different intents, directed to different audiences, and within different contexts.

    (Which is why Internet Research Ethics poses such a unique challenge currently)

  6. @TD – I had that very conversation with someone over dinner last night – while you can reasonably authenticate both the identity & accuracy of face-to-face interviews, data gathered from social media complicates this. The very nature of status updates & tweets, for many, is performative, and perhaps not fully authentic. A unique challenge….

  7. It’s like a blog. (Originally, Twitter was called ‘the microblogging service’.) You can quote and attribute from blogs, but you can’t pretend it’s your work. You can RT my tweets (this acknowledges i wrote them, and is fine).

    If you copy a sentence from my blog, you need to say it’s a quote and where from – more than a sentence or so, you need to ask me first. Online, it’s easy enough to link back to a post, so the author knows they’re being quoted. Simple. Basic manners, and basic copyright. You need to acknowledge my authorship, and make it clear in the text, that it’s not your work. (I would not normally give permission for a whole post to be used, they can use a quote and link it back to my blog.)

    If you present it as your work, i’m entitled to take legal action against you. I’ve already had to do this twice with blog posts people tried to use in entirety, one linked to me, one i only found after googling the first paragraph of that post – it was stolen wholesale. Anyone who does it with things i’ve posted on twitter is the same, a plagiarist and a thief.

    As for someone deciding to analyse me from my tweets and publish the results – well, not much i can do about the analysis, however, if they re-publish my tweets in bulk, without my permission, (online or off, in academic circles, on their facebook, whether for commercial gain or not) they are breaking existing copyright law – if i write it, it is mine. I don’t have to register it, it’s automatic. Me not knowing they’ve done it doesn’t matter either – the law’s been broken, even if it takes me years to find out. Copyright lasts for my lifetime + 70 years. It’s to protect writers, who often only get successful after they die – at least this way, their heirs can get some benefit from the slog the writer went through.

  8. This is such a pointless debate…

    “By submitting, posting or displaying Content on or through the Services, you grant us a worldwide, non-exclusive, royalty-free license (with the right to sublicense) to use, copy, reproduce, process, adapt, modify, publish, transmit, display and distribute such Content in any and all media or distribution methods (now known or later developed).”

    That’s from ‘Your Rights’ in the twitter ‘Terms of Service’: http://twitter.com/tos
    Which you should read along with: http://twitter.com/privacy before even start such “debate”…

    As pointless as asking yourself after been hired for and have written a “piece” for some publisher:
    Is it Ethical for the publish-er to publish my work?

    The same “contract” that you sign is as the ‘tos’ that you agree with when you sign into twitter…

    You’ve already given your ‘Consent’.
    They’re called ‘Public Twitter Accounts’… You have written it yourself!

    How are you asking from that bases if it is Ethical or not?

  9. Thanks for the comments, everyone.

    @Sheila & @Paul (nice pseudonym): This issue (as I’m presenting) isn’t as much about copyright protection as it is about informed consent regarding use of communication from human subjects. The fact that users grant Twitter a license to use their tweets (which is necessary for the service to work) means nothing in terms of whether it is ethical for researchers to systematically follow and harvest public tweet streams. Again, just because they are public doesn’t mean the intent was to allow them to be automatically archived & processed. That’s the issue regarding whether additional consent is necessary.

    @Lynda: Like above, the issue isn’t about having individual tweets reposted, but whether it is ethical for researchers to systematically follow and scrape them, without undergoing IRB review or gaining informed consent.

  10. Releasing something into an open flow makes it subject to downstream conditions. A public twitter stream is no less a part of the whole web than michaelzimmer.org/bio. The web is not an environment that supports a reasonable expectation of privacy in public. Unmistakeably not. Nor does twitter as a subculture gesture toward such an expectation. (Public tweets on CNN, anyone?)

    We may yet push for markup to let people embed nuanced requests robots.txt-style into some or all of their personal output: “don’t identify me by name,” “buzz off if you don’t know me,” “don’t store past date x,” “don’t republish outside network x,” etc. Or the markup could point to a content license. Absent any of that, public share = shared with everybody on world’s most permissive network = happy harvesting.

    Meanings of actions (as of words) are negotiated at the level of many, not one. No ethic of consent can ignore that.

  11. @TD Accuracy / reliability may not be an issue if the topic of research has to do with linguistic, sociolinguistic, or other behavior that is manifested in the tweets. How things are phrased, how often one tweets and on what topics, retweeting behavior — even if such things express a ‘pose’ that can be a topic of research. From my POV this is more interesting than whether we can verify that tweeters are where they say they are, are doing what they say they’re doing, or even sincerely believe what they’re tweeting — i.e. things of which we might doubt the accuracy or reliability.

  12. “Once tweeted, a birdsong is gone forever. No deleting or taking back what’s been broadcast to the world. If someone seeks privacy, they should seek another method of communication. If from the beginning, there was some kind of inherent expectation that tweets were private messages, then the situation might be different. But the whole idea of tweeting is to voluntarily publish or broadcast. It’s different from, say, e-mailing or IMing.”

    Thinking about getting my own twitter public, i first got to:
    Public vs protected accounts
    And unexpectedly arrive here again…

    I invite you to read that one too as i have invited you to read twitter in twitter itself, because about this ‘debate’, you can’t decontextualize an account from the ‘provider’, in this case Twitter…

    I start with that cite, and i get in: any tweets posted while your profile is private will remain private indefinitely, and tweets posted while your account is public will remain public indefinitely, for you to reflect on twitter…

    Indeed interestingly your aside about the facebook discussion, about twitter i was thinking in RT’s, would you think is necessary to ask for consent when RT-ing?

    I know this isn’t about copyright, but you need to understand that you’re writing/talking about a Social Media, which is the very point about the decontextualization that you’re moving on ‘debating’ about ethics in twitter…

    It’s pretty much as 13. wrote “they should seek another method of communication”, not only about privacy, but about the issue that you’re arguing, i understand that you asked for consent when sharing the Facebook conversation, but again, Facebook is too a Social Media…

    You have well set the bases for whether it is or not ethical to harvest public twitter accounts without consent, bases from which even i would agree it isn’t ethical, but it’s quite radical to reach that conclusion without considering the context that i have mention…

    As it is described in twitter ToS, it is so possible then in any account, not mattering if it’s the Pope’s…
    It is as if you sign in for a new e-mail service which central affair is to get them public… would you then complain about people reading your personal e-mails?

    Even commenting here we’re exposing ourselves, but it’s again part of it, other way we would look for a different way to communicate… I really invite you to read the whole ToS and privacy policy from twitter, and to understand that it is a Social Media provider… We’re always exposed to researches while in the net, whether it is as simple as how many people have visited this blog, or from which country, to take part of world wide graphs about internet usage…

    As i’ve stated before “This is such a pointless debate…”, if twitter got you to an ethical debate:

    “Once tweeted, a birdsong is gone forever. No deleting or taking back what’s been broadcast to the world. If someone seeks privacy, they should seek another method of communication. If from the beginning, there was some kind of inherent expectation that tweets were private messages, then the situation might be different. But the whole idea of tweeting is to voluntarily publish or broadcast. It’s different from, say, e-mailing or IMing.”


    I must not fear.
    Fear is the mind-killer.
    Fear is the little-death that brings total obliteration.
    I will face my fear.
    I will permit it to pass over me and through me.
    And when it has gone past I will turn the inner eye to see its path.
    Where the fear has gone there will be nothing.
    Only I will remain.

  13. “Like above, the issue isn’t about having individual tweets reposted, but whether it is ethical for researchers to systematically follow and scrape them, without undergoing IRB review or gaining informed consent.”

    Michael, would you consider the DHHS regulations regarding the protection of human subjects in reserach (45 CFR 46) to be an adequate standard to follow in the above statement?

  14. Hilarious to see the reactionaries in this thread, opposed to technological and practice innovation that might provide privacy in public. The notion that this is hard to do is, well, funny. It’s sad to see them try to rule certain points out of bounds, but that I guess that goes with the reactionary mindset.

  15. Funny, as I’m reading this while wearing my Internet Archive cap (literally!).

    My first internal response to the news was, “Wow ….” but I am a former IA geek and care deeply about digital preservation

    My second was “How will Twitter users react?”

    And my response upon reading your questions was, “How different is this from what IA does with crawling and preserving the web?”

    At first glance, there seems to be similarities:
    1) Only public tweets will be accessed – similar to IA respecting robots.txt in crawling
    2) IA’s crawls were only open to scholars in the beginning
    3) 6 month embargo for tweets mirrors the 6-month embargo for making crawls of news websites available
    4) While people may have/should have known that their websites were public, unless behind some sort of access wall, they may not have contemplated that their sites would be crawled and archived by others to be viewed for years after they were created; likewise with tweets. Perhaps “constructive notice” of preservation can be more strongly construed against Twitter users, since things like Google cache and IA didn’t exist at the start of the web?

    Do you see differences between archiving/preserving Tweets versus websites? I have to admit, the privacy advocate in me is a bit torn, but the IA/Long Now Foundation supporter in me is cheering.

  16. Oh, please, give me a break.
    While I would be the last to argue that everything on the Internet is “public” (in fact, I was one of the first pointing out that it is _not_, see our 2001 paper in the BMJ at http://www.bmj.com/content/323/7321/1103.full), BUT TWITTER IS PUBLIC – NO QUESTION ABOUT IT.

    Tweets (from the public stream) are like to be treated like blogs (microblogs) and webpages – PUBLIC. No consent required for analyzing them, unless of course they are DMs (which are like emails – confidential) or sent to your “followers only”. The Twitter privacy policy (https://twitter.com/privacy) – which are part of the terms of service which every user agrees to when he signs up for an account – are VERY clear in this:

    “Our Services are primarily designed to help you share information with the world. Most of the information you provide to us is information you are asking us to make public. This includes not only the messages you Tweet and the metadata provided with Tweets, such as when you Tweeted, but also the lists you create, the people you follow, the Tweets you mark as favorites or Retweet and many other bits of information. Our default is almost always to make the information you provide public but we generally give you settings to make the information more private if you want. Your public information is broadly and instantly disseminated. For example, your public Tweets are searchable by many search engines and are immediately delivered via SMS and our APIs to a wide range of users and services. You should be careful about all information that will be made public by Twitter, not just your Tweets.
    Tip What you say on Twitter may be viewed all around the world instantly.”

    I didn’t count how often the word public appears in this excerpt from the privacy statement. I wonder if any of the “researchers” who suggests otherwise has ever tweeted. You tweet because you want to get your message out, and not only to our friends (ever heard of retweets?).
    This is VERY different from discussion boards, chat rooms, or even Facebook.

    And yes – content analysis of tweets and aggregating/mining them on a large scale is already happening.

    http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0014118

    That’s what the Twitter APIs are made for.

  17. Gunther, I fear you are missing my point. The ethical calculus isn’t a simple public/private binary, but involves a more nuanced balancing of user expectations and right of refusal. First, based on related empirical evidence, a low % of users ever read privacy policies, so that’s not very strong of an argument to justify harvesting ethically. Second, users of social media services typically expect their activities to be visible — or of any interests — to a limited audience. While, technically, anyone can “see” my tweets, only a finite number of active followers actually see them on a daily basis. That is part of my expectation of the visibility and usage of the service, and I feel that expectation should be taken into consideration by researchers (no quotes necessary, please) when a decision is made whether/how to mine such information sources.

  18. Hi Michael,
    I simply dispute that ANYBODY who tweets (regardless of whether he has read the privacy policy or not) does so under the expectation of privacy or having a “limited” audience (if they want to do that, there is a privacy setting for that). Anybody who tweets sees on a daily basis that others are retweeting their tweets or quoting from their tweets also appear in search engines and on the twitter homepage itself (as well as on the public timeline http://twitter.com/public_timeline) etc. I wonder of you have ANY empirical data supporting your assertion? Ethics should be grounded in evidence, and I simply don’t see any evidence supporting this claim.
    If you argue that people don’t understand social media or twitter in particular, or not read the privacy policy (which is perhaps true), then I am not sure what problem you would solve with obtaining formal informed consent, where you can’t be sure if they read or understood your informed consent documents either.
    I also find the statement that “users of social media services typically expect their activities to be visible — or of any interests — to a limited audience” not very helpful – social media is a fuzzy term for a vast array of different services. There are many different forms of “social media”, and different social media lie on different points on a continuum in regards to that “expectation”. I would argue that twitter is on a point in that continuum where it is clear to everybody – by just looking at the twitter homepage and the thousands of websites tapping into the twitter API and aggregating, analyzing, and re-displaying that information – that there is not a “finite number of active followers” seeing or analyzing your tweets. Even the “trending topics” on the twitter homepage derive from tweets (and what we do in our research is not much different from plotting trending topics), so again, where are the data for that “expectation”?

  19. I think the concern is that people don’t read TOS so they don’t really expect it. There’s so much legalese out there that I the average person must totally ignore it.

    It is by will alone I set my mind in motion. It is by the juice of sapho that thoughts acquire speed, the lips acquire stains, the stains become a warning. It is by will alone I set my mind in motion.

  20. I looked for a discussion on this because I have come across a fansite open to thousands every day, that is publishing tweets. The tweets themselves are innocuous, about a popular actor, however, the tweeters account and address are often also published, so that one can go in and see the tweet about the actor directly. I have seen some really embarrassing private information written below or above the mined tweet.

    These people do not know their information is being disseminated to a widely read fan site. Is it reasonable that they should expect a national fan site might publish their address?

    I really do not know,but I imagine, given the state of privacy today, that the answer is yes. Any legal cases on point?

  21. As someone that interfaces with patients, I have found that patients don’t always fully understand what’s public and what isn’t. A patient @messaged me recently (unsolicited) and thought I was the only person seeing it. Potentially problematic and often not considered.

  22. @ Bonnie: Anything that is posted online should be assumed to be public. I am even cautious about how much I say in private emails, DMs, etc., because I just don’t know how private those forms of communication really are.

    Excellent discussion. Great comments. Fascinating read. Thanks everyone.

  23. What is the contemporary scholarship at the intersection of information privacy, copyrightability and archiving of tweets in the light of Library of Congress’s [ LOC] ambitious project of archiving tweets for building a collection in order to aid researchers. As far as my understanding goes, LOC has not sought the prior informed consent of users for archiving their tweets even though LOC has signed a Deed of Gift with Twitter. I have two specific questions in this regard-
    (i) What is the threshold for determining the number of tweets that could be archived for building a library collection? In the light of the Google Books decision, how would the “fair use” exception to copyrightability factor-in?
    (ii) For a university library looking to build an archive of tweets, would Twitter’s consent be mandatory considering that it is building the archive purely for research purposes and thus falling under Section 108 of the Copyright Act?

    Any useful insight on these issues will be highly appreciated. Thanks.

Leave a comment