When Big Data Isn’t Big Enough

Throw the phrase big data out at Thanksgiving dinner and you’re guaranteed a more lively conversation. Your nervous uncle is terrified of the Orwellian possibilities that our current data collection abilities may usher in; your techie sister is thrilled with the new information and revelations we have already uncovered and those on the brink of discovery. Many people likely feel a mix of both these sentiments: enthusiasm tempered by apprehension.

Big data is bringing about change in a variety of sectors, and the health research field is one area where the potential, both positive and negative, is the most dramatic. The recent event, Big Data, Health Law, and Bioethics, hosted by the Petrie-Flom Center at Harvard Law School convened leading scholars to discuss what we have to gain, what could go wrong, and how we should prepare for a future where big data health research is the norm.

One of the most fascinating panels of the day was titled Overcoming the Downsides of Big Data. Efthimios Parasidis, Sharona Hoffman, Sarah Malanga, and Carmel Shachar spoke, Glenn Cohen moderated, and we, the audience, began siding more and more with that nervous uncle.

Earlier that day, Nicolas Terry—self-proclaimed pro-privacy curmudgeon— set the stage for this group by noting that “health law is incredibly poor at keeping up with technology: we are always in a reactive position.” As our data collection and analysis capabilities grow, so grows the potential for harm, making it vital now more than ever to think and work proactively to solve problems before they arise. Fortunately, this is just what the speakers on this panel are doing.

Sarah Malanga and her team saw the temptation to take big data’s merits too far and asked the question “Who might be missing from our largest data sets?” The merit of big data lies in its sheer magnitude. With data on increasingly larger populations, researchers have less need to extrapolate, less need to make assumptions about what may or may not translate from a single data set to the broader population, less concern about bias and non-representative samples.

The massive data sets currently being used by researchers in the health science space come from a variety of sources: medical records, insurance claims, biospecimens, wearable fitness trackers, smartphones, social media, and more. From this multitude of sources, most Americans would be included. But confusing ‘most’ for ‘all’ would be a mistake. It’s a well-known and -cited fact that minority populations in the U.S. tend to have more health problems and less access to health resources than the white middle class. This is an infuriating problem on its own, but a reliance on data from the above sources has the potential to compound it. These same populations that are currently facing unjust difficulties in accessing healthcare are the same who are underrepresented in the sources from which big data is drawn. For example, African American households are less likely to have internet access than white households in the United States, potentially skewing data from social media.1 Hispanic individuals are the least likely to be covered by health insurance, meaning the data from those sources is non-representative as well.1 Malanga walked through each of these sources of data and demonstrated that the populations most in need of the advances big data may bring about are the same populations that the data cannot adequately represent.

The past decades have seen major strides in ensuring inclusion as a feature of quality research, not least of these being NIH’s 1993 mandate that projects should be inclusive of women and minority populations. This decision is now decades old, and yet we see the problem persists. With the potential big data poses to exacerbate the issue by giving researchers false confidence in the universal applicability of their results, it will become an ethical imperative to foster awareness of these limitations and to find ways to include these populations.

While the size of the data sets are big data’s greatest boon, this may prove to be an ethical bane as well. Efthimios Parasidis discussed some of the disheartening history of pharmaceutical companies manipulating data in the past to market drugs with questionable efficacy. With bigger data sets, he argued, it will become easier to manipulate data in deceptive ways. Within the legal system, there is dissension over whether data manipulation of this sort constitutes healthcare fraud , meaning entities can continue these practices with few, if any, consequences. Whether Parasidis’ predictions of heightened manipulability of these data sets is correct or not, asking these questions now is certainly better than waiting until harm has already been done.

Myth, legend, and history, from Babel to the Titanic, tell us that the “too big to fail” mentality is a dangerous one. Research with big data could fall into the same trap without adequate protective and preemptive measures. Big data has its own set of limitations, and a failure to recognize these could end up harming the populations that are in the most need of quality care and introducing new, yet-unimagined harms. This doesn’t mean we should (or, arguably, could) abandon this new direction in research. But it does mean we should approach it cautiously.


  1. Sarah Malanga, James E. Rogers College of Law, University of Arizona (and Jonathan D. Loe and Christopher T. Robertson) – Big Data Neglects Populations Most in Need of Medical and Public Health Research and Interventions