Engineering Biology: ML in Bio—Supervised Learning is Core IP

Jacob Oppenheim, PhD

Jacob Oppenheim, PhD

April 26, 2023

The hope with Machine Learning has long been that we can eliminate complex, slow, and expensive physical processes with accurate predictions, inferred directly from data. As I’ve written previously, the complexity and scarcity of data make supervised learning like this less relevant in problems of biology. Unlike internet companies, generating reams of labeled data daily, our experimental throughput is orders of magnitude lower and our data modalities considerably more complex.

This is not to say, though, that our goal should not be exploiting supervised learning to enhance our ability to engineer biology. Rather, we need to take it seriously and orient the production of data and scientific knowledge to enable it. Supervised learning is an enabling technology tied directly into core IP, scaling the ability to discover, requiring a consistency of process and focus throughout an organization.

It’s too easy to say “use Machine Learning to predict the results of X,” without reflecting on what the real usecase is. In practice, we can identify several including:

When labeled examples are plentiful, we can automate and expand them. This is the simplest case: generate a lot of consistent data and label it. Allow models to slowly replace the labeling process. We can do this at all levels of information generation, but as we get further up the “stack,” it becomes considerably more complicated. Supervised learning requires consistent data generation and (mostly) consistent labeling. The more complex the underlying data generating process — in reality, the series of dependent processes for data generation — the harder this becomes.

Typically, a series of observations, calculations, and scientific models are applied by experts in and out of the lab to make decisions based on data. These are slow, error-prone, and exhibit person to person bias. Often, we can build supervised models to predict their results and debias individual raters. In my experience, this has included both inferring the health of a plant seedling based on images and automating predictions of payer and patient cost for drugs based on prices. Heuristics tend to be easier to automate as they sit close to the bottom of the information generation “stack:” people are using them to transform raw data into something more refined that forms the basis of their (scientific) knowledge work.

Anomaly Detection
With enough data generated, we can start to identify anomalous results and flag them for review (or experiments for repeats). We may not even need that much data to do this: just a series of similar enough experiments that we can use to fit a model of what expected results should look like. We can quantify the “surprise” in a result through model uncertainty and inform our thresholds and even underlying model parameters using scientific expertise as a strong prior.

As a Replacement for Computation
Many scientific fields rely on detailed, accurate, and extremely complex fundamentals-based computational models. Machine Learning can speed these models by training on a set of previously results that are either expensive to simulate in detail (fluid mechanics), calculate with perfect precision (astronomy), or run a fine grid of experiments (molecular dynamics). We can then identify regions where predictions seem dubious and have low accuracy and go simulate the full computational model there. This is a fruitful and rapidly growing area of science.

There’s a number of conclusions we can draw from these examples. First, there’s lots of room at the bottom of the information generating “stack:” the processing of data, the running of base computational models, and identifying outliers and anomalies. Data here are relatively plentiful, the processes simplest (and most annoying to those performing them), and the cost of a poor prediction is low. As we rise up the “stack” from raw data to analyzed data to more complex assays and experiments and advancement decisions, we concatenate processes and build something entirely more complex. Supervised learning in practice does not look like “tell me which molecule will be the drug,” but rather “automate the scoring of our initial binding assays.” It’s telling that the most successful and interesting ML x Wet Lab biotech companies are working at this level (eg Insitro, Recursion, Octant).

Second, there’s a need to build trust with experts whose work you are augmenting and simplifying. Most of these usecases involve replacing someone else’s work with that of a (frequently) black box model. If they do not trust the results you are generating, organizational frustration and duplication of efforts will be the inevitable result. Building trust in model(s) and data science teams is not an overnight process: it requires consistent engagement, transparent displays of results and predictions, and honesty when things go wrong. Starting from simplicity and low stakes work is critical.

Third, there’s a need to focus on where you can generate enough data for ML and build consistency of process and purpose across an organization. We will not be able to do this everywhere, especially at the beginning. Assays will change as scientists figure out the best way to run them. Scientific leads will be pursued that go nowhere. The key is to identify early on where consistent data can be generated effectively and where using ML will lead to gains in efficiency and increase the production of scientific knowledge. In essence, bringing an engineering focus.

Lastly, supervised learning is technology meant to last — it is part of your core IP. It is a key piece of your organization’s scientific, technical, and process moat. From generating consistent data, to organizational processes and teams that trust models and work well with them, to models that consistently improve with increased data volumes and as you explore a wider diversity of biological and chemical space. The accumulation of technological prowess with organizational skill that can leverage it is core IP — and should be treated as such. Plan it wisely from the beginning, build cross-functional buy in, and target it appropriately.

To subscribe to Engineering Biology by Jacob Oppenheim, and receive newly published articles via email, please enter your email address below.