Engineering Biology: The Ladder of Computational Sophistication

Jacob Oppenheim, PhD

Jacob Oppenheim, PhD

March 15, 2023

We don’t know what the hard problems are going to be.

Most of us were trained as academic scientists in a culture of finding winding paths through the dark forest of the unknown. Today, we are much closer to engineers — using data and computation to industrialize the production of knowledge. Biology presents an endless series of learning and inference problems for us to solve. In practice, especially in industry, we should not be solving them. Some problems are of basic scientific interest, but so far from therapeutics and/or technology development as to be irrelevant. Others are infeasible due to the difficulty of collecting enough data given our practical constraints.

Utility and feasibility are a good starting point for defining what computational problems to solve, but still not enough. We can only pick the right problems by working directly with scientists and experts. Just as building product requires user-centric design, so does good data science require user-focused solutions. Why is this?

Analog Science is effective — your coworkers are smart. Humans are good at solving many of complex computational problems through scientific intuition, reasoning, and complex heuristics. Rather, they are data limited. Serving data to experts in natural, intuitive tools that enable exploration can accelerate and debias their reasoning. Frequently it may help to automate the calculation of expert-defined heuristics or apply them more expansively across data to help a scientist see the breadth of possibilities, rather than trying to immediately learn a new one de novo.

For example, in small molecule drug development, medicinal chemists don’t need to be told what to make, but are empowered by tools that elaborate the diversity of possibilities The most useful tools in ML/AI chemistry enable the exploration of molecular space at a greater scale, augmented by some simple heuristic predictions, helping to debias the chemist. Notably this is NOT algorithmic design of molecules, wherein lie a graveyard of partial solutions.

Data are limited — we cannot generate or acquire enough data to solve the problem computationally. This is all too common in biology, where the relevant set of experiments touches only a small corner of the data needed to build an effective model. This doesn’t stop people from attempting to build ML models that will be, by definition, overconfident and underpowered. In short, a dangerous waste of time. It is better to give people all the information they need and pause there. If computation is needed, data generation should be planned carefully and will, by necessity, include experimentation with a wider lens of parameters than a biologist or chemist would typically consider, a topic I’d like to return to in the future.

We have the wrong data — The information we can collect cannot solve the problem at hand computationally. There are many examples where we can collect correlates or signals of underlying trends we’d like to model, but we lack sufficient samples of the true outcome of interest to learn computationally. In these cases, it is better to give information and aid in analog reasoning and heuristic creation.

A classic example here is clinical trial recruitment. Large insurance claims databases are sold by numerous third parties, capturing a large share of the medical and pharmaceutical transactions across the US. This allows us to understand epidemiology, market sizes and trends, patient trajectories and volumes, and so on. In theory, this seems like a perfect way to identify doctors and hospitals for running clinical trials. In practice, the opposite is true — doctors that run lots of trials aren’t well represented in claims databases, because they are primarily researchers. Trial sites tend to be specialty clinics not well connected to and annotated with hospital systems in databases, except at such a high level as to be meaningless (eg MGH Brigham in MA). Computation here is solving the wrong problem — the underlying claims data can guide clinical trial designers in discussions with sites and CROs, but cannot pick the sites for them.

Biology solves the problem for us — running experiments can be faster, cheaper, and more effective than a data-driven solution. The classic example here is antibodies — traditional antibody screening identifies a vast diversity of binders, wider than a computational algorithm would, with numerous platforms that only enhance that diversity. While computational antibody design is tempting and may be the right solution in some cases, it’s worth treading carefully.

So how do we develop computational sophistication?

Start with product — show people the data + intuitive visualizations. Understand what they are doing with it and using it for. Identify the heuristics they are using and what’s stopping them from doing their job better.

Build up computational tools to support these endeavors and increase sophistication step by step. Imagine climbing a ladder of solutions with each step guided by new needs and previous successes. Incrementally add features and measure their value. Plan and execute new analytics projects on a regular basis. Data science becomes predictable both to those doing the work and for end users. Tradeoffs become clear allowing effective prioritization within and across projects.

Scale horizontally — do this across problems, teams, and departments. Identify commonalities. One ladder of analytical sophistication becomes a row of parallel ladders. Prioritize where to climb based on greatest value and lowest lift.

Knit together the threads — Solve the problems that address multiple needs and usecases. Weave a web from the ladders. Climb it to computational sophistication. Sophisticated ML tooling is technology that ideally powers multiple endeavors. Think of it like software that can empower as many as possible, given the complexity to develop it right. The right ML problem to solve will have numerous uses, tying together how multiple teams use data.

One final example to close. When I joined EQRx, there was a clear sense that we should be using EHR data to improve clinical trial design and our understanding of drug efficacy. While there is considerable academic effort in this area, many of the models and results are not compelling. The utility of a model of disease progression premised on having continuous laboratory data from the ICU is hard to see. Similarly, attempting to estimate precise statistical effects of treatment efficacy from heavily confounded observation data seems difficult and unlikely to engender physician trust.

We took a user-centric approach starting with the physicians designing our clinical trials and the medical affairs team communicating our scientific results. Initially, this meant exposing the characteristics of patient populations both to encourage trust in the data and to reveal surprises in patterns of care, first in one indication, then multiple. This grew to include temporal patterns of treatment and concomitant meds, as well as causal calculations to understand the effects of demographic and medical covariates. We eventually built two best in class ML models for latent phenotyping of disease, designed to augment these focused solutions by better understanding disease severity, comorbodities, treatments, and costs (for our commercial teams). These models are fundamental technology for our approach to patient populations across departments and usecases, standing on the metaphorical shoulders of user-centered data science.