The growth of research projects relying on pervasive data — big datasets about people’s lives and activities that can be collected without them knowing — are testing the ethical frameworks and assumptions traditionally used by researchers and ethical review boards to ensure adequate protection of human subjects.
For more than a decade, researchers interacting with pervasive data have crafted novel projects and methodologies, while generating considerable ethical controversies. For example:
- In 2006, AOL released over 20 million search queries from 658,000 users to the public in an attempt to support academic research on search engine usage, resulting in individual users being re-identified based on an analysis of their search activities.
- In 2008, Harvard researchers released the first wave of their “Tastes, Ties and Time” dataset comprising of four years’ worth of complete Facebook profile data harvested from the accounts of an entire cohort of 1,700 college students, spurring concerns about confidentiality and the lack of consent.
- In 2010, Pete Warden, a former Apple engineer turned independent researcher, exploited a weakness in Facebook’s architecture to amass a database of names, fan pages, and lists of friends for 215 million public Facebook accounts which he planned to make public. Under legal pressure from Facebook, Warden ultimately destroyed the database.
- In 2014, academic researchers, in partnership with Facebook, sparked an uproar when they altered the emotional content within the news feeds of nearly 700,000 Facebook users to study the impact on users’ mood.
- In 2016, a group of Danish researchers were criticized after they publicly released a dataset of nearly 70,000 users of the online dating site OkCupid, including usernames, age, gender, location, what kind of relationship (or sex) they’re interested in, personality traits, and answers to thousands of personal profiling questions.
- In 2018, news of the Facebook–Cambridge Analytica data scandal broke, revealing how personally identifiable information of 87 million Facebook users was improperly harvested in 2014 in order to build psychographic profiles for targeted political advertising.
- And just this week, a study of how teams use collaborative platforms was released, where Dropbox gave researchers access to project-folder-related data over a two-year period from about 400,000 users across 1,000 universities.
In each of these examples, researchers hoped to advance our understanding of a phenomenon by analyzing — and in some cases publicly sharing — large collections of pervasive data they considered freely available for analysis. Yet, in each case, controversies about the ethics behind such pervasive data-based projects quickly surfaced. Many of the basic tenets of research ethics — such as protecting the privacy of subjects, obtaining informed consent, maintaining the confidentiality of any data collected, and minimizing harm — appeared to be deficient in the researchers’ methodological protocols.
Over the course of this same decade, a growing number of scholars and advocates have made strides to address many of these ethical lapses, typically working within the domains of internet research ethics and data ethics. Regulatory authorities responsible for the oversight of human subject research are starting to confront the myriad of ethical concerns pervasive data research brings to light, and numerous scholarly associations have drafted ethical guidelines for internet research, including the American Psychological Association’s Advisory Group on Conducting Research on the Internet, the Association of Internet Researchers (AoIR) Ethics Working Group, and the ACM’s SIGCHI Research Ethics Committee.
Yet, even with this increased attention and guidance surrounding research ethics, significant gaps in our understanding and practices persist. Across the research community — as glaringly evident in the examples provided above — there is considerable disagreement over basic research ethics questions and policies, such as what constitutes “public” data, is informed consent necessary when dealing with “found data” or when a platform’s terms of service might allow sharing data with researchers, and even at what stage does computational research become human subjects research requiring particular ethical protection. Given all this, there is great uncertainty among the research community on how to address research ethics in the realm of pervasive data.
This uncertainty, however, does not need to lead to paralysis. In the conclusion of our book, Internet Research Ethics for the Social Age: New Challenges, Cases, and Contexts, Katharina Kinder-Kurlanda and I note how:
ethically-informed research practices come out of processes of deliberation and decision making under great uncertainty, which often may go wrong or seemingly force us towards less-ideal options.
We also argue that ethical decision making in the context of internet research cannot be easily governed by blanket rules. For example, if a rule was established that no tweets should ever be quoted to protect users’ privacy and respect the lack of specific informed consent to research being conducted, this could (paternalistically) ignore users’ carefully crafted public communications while not acknowledge their agency and authorship.
Faced with this dual challenge of growing uncertainty and the need for flexibility when addressing research ethics in pervasive data, Helen Nissenbaum’s concept of contextual integrity emerges as a valuable heuristic to guide researchers.
Contextual integrity is a benchmark theory of privacy, a conceptual framework that links the protection of personal information to the norms of personal information flow within specific contexts. Rejecting the traditional dichotomy of public versus private information, the theory of contextual integrity ties adequate privacy protection to the preservation of informational norms within in specific contexts, providing a framework for evaluating the flow of personal information between agents to help identify and explain why certain patterns of information flow are acceptable in one context, but viewed as problematic in another.
To aid the application of contextual integrity, Nissenbaum provides a nine-step decision heuristic to analyze the significant points of departure created by a new process, thus determining if the new practice represents a potential violation of privacy:
- Describe the new practice in terms of its information flows.
- Identify the prevailing context in which the practice takes place at a familiar level of generality, which should be suitably broad such that the impacts of any nested contexts might also be considered.
- Identify the information subjects, senders, and recipients.
- Identify the transmission principles: the conditions under which information ought (or ought not) to be shared between parties. These might be social or regulatory constraints, such as the expectation of reciprocity when friends share news, or the obligation for someone with a duty to report illegal activity.
- Detail the applicable entrenched informational norms within the context, and identify any points of departure the new practice introduces.
- Making a prima facie assessment: there may be a violation of contextual integrity if there are discrepancies in the above norms or practices, or if there are incomplete normative structures in the context to support the new practice.
- Evaluation I: Consider the moral and political factors affected by the new practice. How might there be harms or threats to personal freedom or autonomy? Are there impacts on power structures, fairness, justice, or democracy? In some cases, the results might overwhelmingly favor accepting or rejecting the new practice, while in more controversial or difficult cases, further evaluation might be necessary.
- Evaluation II: How does the new practice directly impinge on values, goals, and ends of the particular context? If there are harms or threats to freedom or autonomy, or fairness, justice, or democracy, what do these threats mean in relation to this context?
- Finally, on the basis of this evaluation, a determination can be made as to whether the new process violates contextual integrity in consideration of these wider factors.
The first six steps involve modeling the existing and new contexts, allowing a prima facie judgment to be rendered as to whether the new process significantly violates the entrenched norms of the context. These steps help us identify any immediate “red flags” that violate contextual integrity. The final steps of the heuristic involve a wider examination of the moral and political implications of the process to make a recommendation as to whether the new practice should be allowed or resisted.
Nissenbaum’s theory of contextual integrity has been applied in numerous contexts where technological developments have forced conceptualizations of privacy to be in a state of flux, including vehicle-to-vehicle communication protocols, search engine privacy, the privacy implications of cloud-based storage platforms, smartphone applications, and learning analytics.
Considered in the context of pervasive data, contextual integrity is a useful tool for properly addressing the oft-repeated refrains that “the data was already public” or “the terms of service allowed sharing the data without specific consent” when attempting to justify why pervasive data research does not pose a privacy or ethical concern.
For example, my recent article in Social Media + Society, “Addressing Conceptual Gaps in Big Data Research Ethics: An Application of Contextual Integrity”, uses Nissenbaum’s theory to interrogate the ethics of Emil Kirkegaard’s collection and public release of data scraped data from nearly 70,000 users of the OkCupid online dating site.
Kirkegaard, then a graduate student at Aarhus University in Denmark, collected the dataset between November 2014 and March 2015 using a web scraper — an automated tool that extracts data from web pages. After creating an OkCupid profile to gain access to the site, the scraper targeted basic profile information such as username, age, gender, sexual orientation, and location, while also harvesting answers to the 2,600 most popular multiple-choice questions on the site, such as users’ religious and political views, whether they take recreational drugs, if they have been unfaithful to a spouse, or whether users like to be tied up during sex. The resulting database, along with a draft paper analyzing the data, was posted on the Open Science Framework, a web platform that encourages open source science research and collaboration, as well as to the online peer-review forums of Open Differential Psychology, an open-access online journal also run by Kirkegaard.
When asked, via Twitter, whether he attempted to anonymize the dataset, Kirkegaard replied bluntly: “No. Data is already public”, a position expanded on in the accompanying draft paper:
Some may object to the ethics of gathering and releasing this data. However, all the data found in the dataset are or were already publicly available, so releasing this dataset merely presents it in a more useful form. (Kirkegaard & Bjerrekær, 2016, p. 2)
Kirkegaard further justified the inclusion of usernames in the released data to aid future researchers who might want to “fill in remaining [data]points” such as height, profile text, and even profile photos, which Kirkegaard’s team failed to initially capture due to technical limitations. Based on all available documentation, Kirkegaard did not seek any form of consent — from OkCupid or its users — at any point during the collection, use, and release of the profile data, nor did he obtain any ethics approval or guidance from his institution or related oversight body.
Considering Kirkegaard’s OkCupid study, it might be easy to accept that users made certain profile information available on the online dating platform, and all the Danish researchers did was present the data “in a more useful form”. Yet, an analysis of Kirkegaard’s actions through the lens of contextual integrity provides a very different calculus. As I argue, when considering the transmission principles and informational norms of the context, we can easily determine that the actions taken by Kirkegaard disrupt contextual integrity. And once we evaluate that disruption in terms of the moral and political values of the users, as well as the broader goals of the context itself, we conclude that the impacts of Kirkegaard’s actions are not justifiable. His disruption of the informational norms within the context of OkCupid brought no benefit — directly or indirectly — to the users or the context, and only degraded the values and goals of the community and its members. This conclusion is in striking contrast to Kirkegaard’s assertion that the supposed “publicness” of the data means little pause is necessary when considering to capture and process thousands of OkCupid profiles.
By demanding that information collection and transmission must be appropriate within a given context, contextual integrity can guide pervasive data researchers’ attentiveness to the normative bounds of how information flows on a particular social network or community under study. Thus, maintaining the contextual integrity of those information flows can help us be attentive to many of the ethical uncertainties that plague pervasive data research ethics.
Rather than simply waving the “but the data is already public” or “consent is implied in terms of service” magic wands to make the ethical concerns disappear, walking through contextual integrity’s decision heuristic can provide a much more nuanced — and contextually sensitive — approach to considering the ethics of a particular action or intervention into a research context. Embracing contextual integrity will undoubtedly guide researchers through similar ethical dilemmas in the growing domain of pervasive data research.