OkCupid and the Ethics of Big Data Research

We clearly have entered the era of big data. Armed with petabytes of transaction data, clickstreams and cookie logs, as well as data from social networks, mobile phones, and the “internet of things,” a wide range of economic interests, including consumer marketing, health care, manufacturing, education, and government, are now in pursuit of the value of data-driven decision-making that big data promises.

At the same time, the big data that increasingly fuels economic decision-making has emerged as a rich terrain for engaging in academic research and experimentation: think of the Facebook emotional contagion experiment of 2014, where the news feeds of nearly 700,000 users were altered to study the impact on mood; or when Harvard researchers released the first wave of their “Tastes, Ties and Time” dataset in 2008, comprising four years’ worth of complete Facebook profile data harvested from the accounts of an entire cohort of 1,700 college students; or a decade ago when AOL released over 20 million search queries from 658,000 of its users to the public in 2006 in an attempt to support academic research on search engine usage. These big data research activities yielded novel results, while also generating considerable controversy.This controversy recently caught up with a group of Danish researchers who, led by Aarhus University graduate student Emil O. W. Kirkegaard, publicly released a dataset of nearly 70,000 users of the online dating site OkCupid, including usernames, age, gender, location, what kind of relationship (or sex) they’re interested in, personality traits, and answers to thousands of profiling questions used by the site.

When asked whether the researchers attempted to anonymize the dataset, Kirkegaard replied bluntly: “No. Data is already public.” This position is repeated in the accompanying draft paper, “The OKCupid Dataset: A Very Large Public Dataset of Dating Site Users,” posted to the online peer-review forums of Open Differential Psychology, an open-access online journal also run by Kirkegaard:

Some may object to the ethics of gathering and releasing this data. However, all the data found in the dataset are or were already publicly available, so releasing this dataset merely presents it is a more useful form.

For those concerned about privacy, research ethics, and the rise of publicly releasing large data sets, this logic of “but the data is already public” is an all-too-familiar refrain used to easily set aside thorny ethical concerns. It was used by the Harvard researchers in the “Tastes, Ties, and Time” study, and it appeared again in 2010, when Pete Warden, a former Apple engineer, exploited a flaw in Facebook’s architecture to amass a database of names, fan pages, and lists of friends for 215 million public Facebook accounts, and announced plans to make his database of over 100 GB of user data publicly available for further academic research.

In each case, including the latest OKCupid controversy, researchers were hoping to advance our understanding of a phenomenon by making publicly available large datasets of user information they considered already in the public domain. One of the bedrocks of research ethics – protecting the privacy of subjects and maintaining the confidentiality of any data collected – appears to these big data researchers as a non-issue. As Kirkegaard stated: “Data is already public.” No harm, no ethical foul.

But in actuality, the relative newness – and rapid expansion – of big data-based research presents us with what the computer ethicist James Moor would call “conceptual muddles“: the inability to properly conceptualize the ethical values and dilemmas at play in a new technological context.

Consider the privacy concerns with big data research and data releases like those described above. Privacy is typically protected within the context of research ethics through a combination of various tactics and practices, including engaging in data collection under controlled or anonymous environments, limiting the personal information gathered, scrubbing data to remove or obscure personally identifiable information, and using access restrictions and related data security methods to prevent unauthorized access and use of the research data itself. The nature and understanding of privacy becomes muddled, however, in the context of big data research, and as a result, ensuring it is respected and protected in this new domain becomes challenging.

For example, the determination of what constitutes “private information” – and thus triggers particular privacy concerns – becomes difficult within the context of big data research. Distinctions within the regulatory definition of “private information” – namely, that it only applies to information which subjects reasonably expect is not normally monitored or collected and not normally publicly available – become less clearly applicable when considering the data environments and collection practices that typify big data research, such as the wholesale mining of Facebook activity or public OKCupid accounts.

When considered through the lens of the regulatory definition of “private information,” social media postings are often considered public, especially when users take no visible, affirmative steps to restrict access. As a result, big data researchers conclude subjects are not deserving of particular privacy consideration. For example, the Harvard/UCLA researchers argued that subjects do not have a reasonable expectation of privacy with their Facebook information, noting “We have not accessed any information not otherwise available on Facebook,” and equating their collecting of the profile data with “sitting in a public square, observing individuals and taking notes on their behavior.” Similarly, much of the justification for the appropriateness of harvesting and releasing OkCupid profile data centers on the fact that profile information is posted for the very purpose to be visible other users, thus no privacy expectations exist. In the words of the OkCupid researchers, “releasing this dataset merely presents [the user profile data] is a more useful form.”

Yet, in reality, the social media platforms frequently used for big data research purposes represent a complex environment of socio-technical interactions, where users fail to fully understand how their social activities might be regularly monitored, harvested, and shared with third parties, where privacy policies and terms of service are not fully understood and change frequently, and where the technical infrastructures and interfaces are designed to make restricting information flows and protecting one’s privacy difficult.

As a result, it is difficult to understand with certainty what a user’s intention was when sharing information on a social media platform, and whether users recognize that providing information in a social environment also opens it up for widespread harvesting and use by researchers. This uncertainty in the intent and expectations of users of social media and internet-based platforms – often fueled by the design of the platforms themselves – create numerous conceptual muddles in our ability accept the justifications of “we have not accessed any information not otherwise available” or “data already public” in order to alleviate potential privacy concerns in big data research.

In my critique of the Harvard/UCLA Facebook study from 2010, I warned:

The…research project might very well be ushering in “a new way of doing social science,” but it is our responsibility as scholars to ensure our research methods and processes remain rooted in long- standing ethical practices. Concerns over consent, privacy and anonymity do not disappear simply because subjects participate in online social networks; rather, they become even more important.

Six years later, with big data again promising a new way of “doing social science,” this warning remains all too true. The OkCupid data release reminds us that the ethical, research, and regulatory communities must engage in collaborative, dedicated, and multi-prong efforts to address the conceptual muddles present in big data research, reframe the ethical dilemmas inherent in such research projects, expand educational and outreach efforts, and develop policy guidance focused on the unique challenges of big data research ethics. By attending to such concerns, we will be better positioned to understand and address the ethical dimensions of big data research projects, close the existing conceptual muddles, and thereby ensure innovative research can take place while protecting the interests of research ethics broadly.

Editor’s note: Dr. Zimmer also explored this topic in an article for Wired magazine published May 14, 2016.

Michael Zimmer, PhD, is an associate professor in the School of Information Studies and director of the Center for Information Policy Research at the University of Wisconsin-Milwaukee