Engineering Biology: Big Data—A Path Forward

Jacob Oppenheim, PhD

Jacob Oppenheim, PhD

October 27, 2023

The social contract around big data and information technology is broken.

The combination of years of “Big Data” hype and obviously flawed inferences, of overpromising and under-delivering, has led to pervasive online tracking and a miasma of distrust. It is simultaneously too difficult to deploy novel consumer-facing information technology and avoid the sale or at least use of personal information. Few bulk datasets are made public anymore, while those that are shared are frequently scrambled in the name of privacy. All the while, the number of spam emails and unsolicited marketing phone calls continues to increase.

At the core of this mess lies a misunderstanding about the appropriate uses of big data and a privacy regime that prioritizes the wrong things. Current frameworks are written as if:

  • Any sufficiently large heap of data is ipso facto powerful, useful, and dangerous.
  • Any reidentification risk no matter how small and/or theoretical is too much.


  • The 2020 census results were locally confounded, based on an article theorizing a way to re-identify a tiny fraction of respondents. In practice, this meant moving respondents within set areas, regardless of neighborhood or jurisdictional boundaries, making the results next to useless for academic, governmental (at times imperiling redistricting), or commercial purposes.
  • It is increasingly difficult to build new consumer web and software applications that adapt to user behavior, especially outside the US. Heightened regulatory scrutiny and a thicket of regulations bound the accumulation, storage, and use of data without a consistent theory of what should and should not be allowed.

This precautionary focus misses the current harms that are actually being foisted upon us: the panopticon of web tracking, the liquid market in personal phone numbers and email addresses, and the continuous growth of dubiously consented personal data aggregation: fingerprints, faces, and now retinal scans. We need a new framework for regulation that extends beyond simply reidentification.

The focus of privacy needs to be around risk and effect.

Consider the Netflix prize dataset. The putative risk that led to the takedown of the dataset, was the chance of identifying a Netflix customer’s IMDB profile. But what was the cost of that link being established? On an individual basis, nothing. On a population basis, also nothing, because that risk is rare. We are overly protective of low likelihood privacy risks, while ignoring the liquid market in personal data that produces the bulk of the harms2.

This example suggests a novel division of datasets: Population versus Individual.

Population datasets are valuable for giving a picture of groups of individuals. They capture characteristics of people without identifying information (or with such information clearly scrubbed). We know from a decade of big data hype that these data do not pose a risk on a personal level1. They are, however, incredibly important for demographics, epidemiology, governance3, research, and business. Commerce matters. The benefits of technological and practical innovation can only accrue if producers can be connected with markets. Statistical information is part of this and should not be obfuscated.

We need a safe harbor for population datasets that defines acceptable levels of reidentification risk (that are not 0!) given the potential harms of that data at hand. In general, these risks arise from instances where a record is an outlier in many domains simultaneously. By definition these cases are rare, and, more importantly, non-generalizable. Statistical inferences can only be done on multiple records: n must be greater than one. We can set a minimum unit threshold however we desire, but sufficiently rare cases that could be reidentified are by definition not useful! Scrambling entire datasets to avoid them is unnecessary especially when a sufficiently small query can be blocked through software.

Thresholds for re-identification risk should be adaptive. The cost of identification of an IMDB profile is next to nothing; that of someone’s health records is considerably higher. A prudential regulatory regime must consider reidentification risk versus the cost thereof. However these lines are drawn, it cannot be forgotten that the map is not the territory. Collections of records are at best a poor approximation to the complexities of human life and should not be treated as oracles. By providing a safe harbor, we can encourage a return to the open sharing of proprietary datasets for research and exploratory purposes, as was common until the early 2010s.

Individual datasets provide useful information about people even without much depth: for instance, phone numbers, or fingerprints. These should be regulated tightly. Certain classes of data when misused are troublesome regardless of reidentification. In the panopticon, marketing can bleed into harassment. The availability of PII (Personal Identifiable Information) enables scammers. Here we need regulation starting perhaps with stronger penalties for phone and email scammers to destroy the underlying market and a strong right to be forgotten.

The accumulation of personal data created by using a given technological service needs to be more carefully regulated. Its persistence in an age of hacking and leaks is troublesome. Is GDPR the right approach? Perhaps not, but at least it provides a theory for evaluating and dealing with these harms. The right to be forgotten may cost too much but is a fit-for-purpose solution.

Between these two cases lies a muddled gray zone exemplified by the accumulation of data for research purposes. Rare disease datasets accumulate information about patients: demographics, genomics, healthcare records, and more, which can shade into re-identifiability even if each data type is de-identified. Each additional datum adds information but also adds risk. We need to be able to identify when a population dataset becomes something that is deeply personal or highly re-identifiable and rules around how to handle it.

Population data matter even if the promises of big data were vastly overblown. In our reaction to over a decade of hype, we should act prudently and not cripple the ability of researchers, government, and, yes, businesses to better understand the vast diversity of mankind. We must focus on the harms of panopticon and tracking, while allowing large datasets to grow safely, secure in the knowledge that there is no oracle of “big data.” The map is not the territory—all the more reason to build better maps.


1eg Spam emails, Robocalls, Phishing risk, Data breaches, etc.

2Arguably, most of these data are useless at predicting personal characteristics.  My friends in big tech are convinced they are vital for ad targeting (perhaps) and user customization (more likely).

There are many good reasons why the US Government has a constitutional obligation to conduct a census and has been publishing statistical information for centuries.

To subscribe to Engineering Biology by Jacob Oppenheim, and receive newly published articles via email, please enter your email address below.