Research Using Secondary Data: New Challenges and Novel Resources for Ethical Governance

person using laptop with transparent data overlay

The digitization of everyday life has led to an interesting phenomenon for research administrators; the ethical concerns that arise from secondary uses of large and open data now pose a greater challenge for the ethical management of research data than do the conventional challenges of primary data acquisition. As debates over consent forms give way to discussions of differential privacy, it is hard to ignore the new reality that the highest levels of risk and benefit to human participants in research may now arise from secondary data uses.

The complex uses of secondary data present myriad sources of useful data for clinical, social, or even agricultural researchers, whether that data is extracted from the application programming interface (APIs) of services that millions use daily (e.g., video conferencing data from Zoom, news from the New York Times), accessed through curated research archives (e.g., Inter-university Consortium for Political and Social Research), or managed through data trusts or cooperatives (e.g., CitizenMe, or Grower Information Services Cooperative).

With so many sources of secondary research data, what should research administrators and IRB members do to understand and manage the risks and benefits presented by researchers’ uses of these big and rich data sources?

  1. Know the role of IRBs in review of research using secondary data, including applicable terminology, technologies, and terms of use that are unique to secondary data.
  2. Develop skills to identify and educate regarding the privacy and re-identification risks that arise from uses of secondary data, whether in combination with primary data sources or other secondary data uses.
  3. Hone the ability to communicate effectively with researchers and other stakeholders about the role of the IRB in secondary data use proposals and the risks and benefits that arise from the proposed research.

IRBs, Exemption 4, and Secondary Data Uses

The 2018 Common Rule revision expanded the categories of research considered “exempt from review”. This included exemptions for secondary research where “secondary research uses of identifiable private information or identifiable bio-specimens” uses “publicly available” information, or “information… recorded by the investigator in such a manner that the identity of the human subjects cannot readily be ascertained directly or through identifiers linked to the subjects, the investigator does not contact the subjects, and the investigator will not re-identify the subjects” (45CFR46.104(d)(4[i-iii]).

These revisions introduced important new vocabulary and associated challenges to the review of research proposals surrounding re-identification of identifiable private information. They included:

  • Requirements that administrators know whether secondary data is “publicly available”
  • Requirements that the agencies and departments implementing the revised Common Rule engage experts to “assess whether there are analytic technologies or techniques that should be considered by investigators to generate ‘identifiable private information'”
  • Requirements that research administrators know and track information on re-identification technologies, which represent a new set of tasks when determining if research presents re-identification risks to subjects

Determining if Secondary Data is “Publicly Available”

In some cases, public availability is easy to ascertain: the data exists on a website for easy download in readily accessible formats (e.g., .csv). But beyond such simple examples, the question of public availability becomes knotty.

For example, is the Cancer Imaging Archive (TCIA)—whose data is accessed through specific software and bespoke file types—considered public? Are “limited” datasets available from repositories like TCIA also public even though they are truncated in some ways? Are data available behind API authorizations (e.g., OAuth) also public? And is this data public in the same way as data controlled by researchers and accessed only through explicit sharing by those researchers, after completion of a Google Form or emailed request? Is the data that is shared via data trusts public in the same sense as data on well-known repositories, like data.gov?

Nuances in the definition of “public” create challenges for research administrators striving to determine whether a proposed project falls under applicable exemptions or not.

De-identification, Re-identification, and the Pursuit of Anonymization

Long-serving research administrators will recognize the habit of researchers to promise subjects that their data will be kept anonymous. Research administrators following debates in data protection and data management will recognize that such promises are strained by advances in data mining, machine learning, re-identification attacks, and by increasing recombination of data sources—all of which introduce additional risk of unintended re-identification of subjects.

Knowing the technical and ethical debates around re-identification risks is an important component of research administrators’ growing knowledge base. And there may be an additional challenge for IRBs engaging in researcher education. Researchers do not commonly receive instruction in data protection for their disciplines or the research data domain, leaving the task of instructing researchers in proactive methods for de-identification of data, using effective Privacy Enhancing Technologies (PETs) such as partial or fully homomorphic encryption, or even how to create strong hash identifiers, to research administrators and IRB reviewers.

The Future of Privacy Forum (FPF), a global privacy and data protection non-profit located in Washington DC, produced a “Visual Guide to Practical Data De-Identification” that can serve as a “primer on how to distinguish different categories of data.”

Getting Research Teams the Review they Need

What should research administrators do when a secondary data use proposal falls outside of the regulatory remit of the IRB? Exemption from IRB review allows research to proceed without review, but exemption from regulations is insufficient for a public increasingly concerned about privacy intrusions and bias arising from uses of their data by university researchers and corporate R&D departments.

Since 2020, state legislatures from Virginia to Colorado have passed privacy bills that require that researchers’ use of secondary data be reviewed by an ethics review board or other appropriate oversight committee. University and hospital IRBs are not an appropriate review committee for secondary use due to regulatory exemptions and limitations on the uses of their already constrained resources.

However, other review structures are scarce. This places IRBs, research compliance teams, and university legal departments in a difficult position where complying with new privacy legislation may invite scope creep and over-extension of IRBs into data use areas that are otherwise considered exempt.

FPF has established an “Ethical Data Use Committee” (EDUC) to review secondary uses of data for research. The EDUC reviews three general types of secondary data use:

  1. Sharing of data for research purposes, specifically where that data is shared from corporate data holdings to researchers pursuing research questions that contribute to generalizable knowledge
  2. Uses of secondary data to train machine learning and artificial intelligence (AI/ML) applications
  3. Uses of corporate or academic data to research and develop products or processes that introduce novel ethical concerns

Unlike IRBs with regulatory authority to approve or deny research projects, the EDUC is a multi-expert review committee that offers recommendations to researchers and other organizations striving to increase the ethical profile of their proposed uses of data. Employing similar administrative and expert review practices as IRBs, the EDUC committee reviews proposed data use protocols using a consensus-based review of key ethical principles in research and data protection: accountability, non-harm, respect for persons, justice, beneficence, transparency, confidentiality, lawfulness, privacy, sustainability, and attention to emergency uses of data. The EDUC review committee takes advantage of in-house and external expertise on issues of privacy, data protection, re-identification, and artificial intelligence to help organizations manage the risks from secondary uses of data.

For IRBs struggling to guide researchers in the management of secondary data analysis risks, the EDUC is available to review such protocols and to work collaboratively to give organizations the reviews they need to feel more confident in the ethical management and stewardship of their data uses.

Sara R. Jordan, PhD, is Senior Counsel, Artificial Intelligence and Ethics at the Future of Privacy Forum. Her profile includes privacy implications of data sharing, data and AI review boards, privacy analysis of AI and Machine Learning (AI/ML) technologies, and analysis of the ethics challenges of AI/ ML. Sara is an active member of the IEEE Global Initiative on Ethics for Autonomous and Intelligent Systems. Prior to working at FPF, Sara was faculty in the Center for Public Administration and Policy at Virginia Tech (2014-2020) and in the Department of Politics and Public Administration at the University of Hong Kong (2007- 2013). She is a graduate of Texas A&M University and University of South Florida.

Guest contributors are valued members of our community willing to share their insights. The views expressed in their posts do not necessarily reflect those of PRIM&R or its employees.