Transparently Unclear: Data Sharing in Machine Learning Research

There’s a growing trend in Social, Behavioral, and Education Research (SBER)–machine learning–in which investigators often request to obtain, through direct interaction and intervention, various sets of data on human subjects, including their physiological (i.e., data obtained from either invasive or non-invasive means) and/or biometric data (e.g., audio/visual recordings). The research as originally conceived may or may not have been considered human subjects research, but its ultimate purpose is to teach machines how to think, draw conclusions, and process information in much the same way humans do. Additionally, the research seeks to enable these machines to identify new patterns of human behavior that a human typically would not be able to recognize (indeed, an interesting thing about machine learning technology in SBER is that it requires human subjects to understand when/how/why we humans do things, and use biometric data to make conclusions as to what that all means). This requires large sets of data from, on, or about humans, and future secondary use of that data is quite valuable.

What happens when investigators wish to collect additional external data to enhance their datasets or share their data with others? In reviewing such requests, the first places an IRB typically looks are the consent form and original IRB application. What did the participant initially agree to? What were the confidentiality limitations? The consent form (and sometimes a data use agreement) usually contain language about what data will be shared, how it will be used, and with who it will be shared with.

While sharing research data is nothing new, and is indeed a necessary aspect of advancing research, advancing technology–such as in the aforementioned machine learning research in SBER–has led to new applications in human subjects research that extend far beyond the initial research goals.

Whether conducting biomedical research or SBER, investigators have always been obligated to obtain, maintain, and share research data responsibly and ethically. Transparency is key. What happens if we don’t know what we will find, or how we may want to use the data in the future as our machines evolve and identify new pathways in research? Some institutions have adopted a Broad Consent policy, while others, often for good reason, have decided not to.

As our technology advances, however, and data can be used in ways far beyond what we had originally envisioned or anticipated, can we still use “good data” even if it hasn’t been consented for or exceeds what the consent form originally allowed? If biometric or physiological data were obtained to study facial expressions and emotion in children’s ability to cooperate in small groups, can that data then be used in unrelated research; for example, identifying signs of schizophrenia? Some PIs (and even some IRBs) argue that this type of research does not involve human subjects because the goal is only to “teach machines” how to identify these characteristics.

What’s interesting about machine learning technology in SBER is that it requires human subjects to understand when/how/why we humans do things, and use biometric data such as tone of voice, tension in body or face, rise in blood pressure, how we breathe, if we maintain eye-contact, etc., to make conclusions as to what that all means.

In PRIM&R’s May 1 webinar, Data Sharing in SBER: Balancing Transparency and Human Research Protections presenters considered the current regulatory framework around data sharing and examined how the IRB protects the rights and interests of individuals and their personal information. This becomes quite the challenge in the rapidly evolving field of data use and sharing, especially in Machine Learning research.

Such research is evolving quite rapidly and can be used in various ways and numerous purposes. In education, by using audio and video recordings of teachers teaching young children, we can develop speech analysis and recognition technology that is then coded and “taught” to computers. The audio is then fed to the computers now able to identify ways in which teachers can help or hinder childhood development.

Other forms of this technology can be used to record individuals acting aggressively. The behavior is again coded, taught to computers, and analyzed for aggressive/hostile behavior by measuring physiological and biometric data from a distance. This technology has dramatic implications; it can be used by health care services, insurance providers, employers, and even law enforcement in determining who to provide health care or insurance to, who to hire, and who to classify as a likely criminal.

Ultimately, data sharing is what will allow this type of research to continue to evolve, and limiting it could threaten that potential. However, these evolving trends and the potential uses for the new technologies raise new questions: How do you define a human subject in these cases? How do you define research in these cases? Do we have the proper technological expertise on our current Review Board(s) to adequately make these decisions? How do we move responsibly into the future, and still share large sets of data ethically and in compliance?

Tamiko Eto was a member of PRIM&R’s Blog Squad for the webinar Data Sharing in SBER: Balancing Transparency and Human Research Protections. To learn how you can participate in the Blog Squad, visit our Blog Squad page.