Engineering Biology: How to Build Data-Centric Biotech

Jacob Oppenheim, PhD

Jacob Oppenheim, PhD

June 14, 2023

"If you don't have novel and effective science in the first place, no amount of data science will save you."

The past weeks have seen a flurry of articles debating the efficacy (and proof thereof) of “AI” in drug discovery and biotech writ large, kicked off by a large layoff at Benevolent, an “AI” drug developer. I would argue the lesson of the recent AI boom in biopharma is a simple one: If you don't have novel and effective science in the first place, no amount of data science will save you: Data Science and Machine Learning* (hereafter ds/ml) will be most successful in biology where they sit atop transformative science that needs no special analytics.

The main barrier right now is data: we simply lack enough high quality, well-annotated, consistent, and relevant data to solve meaningful problems in drug development. Where ds/ml have been efficacious in biology, it is on top of decades of focused data curation across numerous research groups (eg UniProt + PDB for proteins, or the UK Biobank for genomes). We are looking at powerful process efficiencies, but without great scientific processes, they will be a road to nowhere.

There are two strategies for solving this problem in drug development:
• Better tools and systems for data collection and management
• Building around novel high throughput biology

Tools and Systems
The former begins by recognizing the lack of standardization that plagues internal experiments, datasets, and processes. Our ability to engineer biology effectively is limited by our inability to understand last year's experiments due to missing context and incomplete metadata, and our lack of overall process visibility into the laboratory. What biotech CEO can tell you how many experiments were done in the lab last week, let alone the consistency of the controls, which design cycle they were part of, and how much have we learned in the past month? If we cannot report accurately on the state of the laboratory, how are we supposed to build computational solutions on top of the data generated therein? Better tools and systems can solve this problem while giving us the base layer for application of computation and ds/ml.

Standardizing processes first is a natural bootstrapping approach. Small advantages accrue from every experiment and datum placed in order: information is retrievable, work is clearer, better decisions are made. Computational scientists can work directly with lab scientists to identify places where decisions can be automated or data resist easy interpretation due to high dimensionality. Over time, the added value from application of ds/ml in drug discovery can be reduced to specific efficiencies: each incremental step can be weighed and valued, showing the value of continued investment. Processes generalize better than experiments and data models than machine learning ones. A digitized, organized biotech can scale more efficiently and build a portfolio approach to discover with greater clarity and less spin than competitors, accruing further advantages.

Getting this backwards causes endless trouble: you don’t know what algorithms were trained on, whether they were effective, or why decisions were made. And you still need to go to the bench to test the molecules, a longer and more expensive process regardless.

High-Throughput Biology
A second strategy is to reconstruct drug development around high throughput biology. Think of Abcellera scaling antibody development through microfluidics, Terray scaling small molecule discovery through next generation combinatorial chemistry, or the competitive AAV capsid design space. All of these innovations rely, though, on novel high throughput science capable of delivering improvements even without ds/ml. Computation is a superpower on top of a well constructed scientific platform.

Doing this well is hard: You cannot wave your hand and immediately create a novel high throughput biology platform. The biology and the hardware are moats themselves. Most assays don’t scale well or consistently enough to be used as a screen day in and day out. They need to be repeatable, automatable, have low intrinsic variability (different runs in the same conditions with the same independent variables are similar), and low extrinsic variability (batch to batch, day to day, sample prep to sample prep, getting the same result). There’s an art to getting this right that requires fine tuning.

Regardless of modality, lead molecules need detailed preclinical characterization and optimization for solubility, clearance, toxicity, metabolism, immunogenicity, PK/PD, and more. Many of these steps, too, can benefit from computational approaches, after the initial measurement and screening technology and processes are built out. You simply cannot integrate out biology and medicine, by speeding up the front of the discovery funnel.

Systems that work at low throughput don’t work at high throughput. It is frustratingly difficult, for example, to get in vivo plant assays to scale as seedlings refuse to grow in standard ways. Physical and material transformations are required. Innovations in hardware underlie our ability to measure biology at high throughput. The need for meaningful biological assays, well engineered hardware, and clever algorithms need not go together.  Insight in all three is requisite.

Whichever route is chosen, computational and high throughput systems give you candidates, but those are still a world away from a drug that can be tested in humans. Consider antibodies. It’s easy to get a hit these days through a variety of technologies. Computation may get us there, too, someday soon. But you need it to be non-immunogenic, non-aggregating, manufacturable to high titer, ideally suitable for sub-cutaneous administration, easy to store etc. These drug-like properties are hard for either biology-first or computation-first strategies.

The leap to computation everywhere, thus obscures the complexities inherent in drug development and sets the stage for future failure. ds/ml should be used to catalyze core IP and not as a substitute for getting the hard parts of biomedicine correct. If you have a great high throughput screen, use it for data generation, don’t jump to zero shot learning, which will by definition perform worse. Use computational advances where they are likely to succeed (eg Protein LLMs for drug-likeness, but not antibody binding) coupled tightly to efficient scientific processes that will build a lineage of data you can learn from in the future.

Built either on traditional biology with effective digital systems  or novel high-throughput science, ds/ml are process efficiencies. To be effective, they require order and clarity in scientific process, data capture, and organization. If your processes are bad, you won’t benefit from small improvements at every step (or a major speedup in a handful of them). Computation begins with getting your scientific house in order.

* I’m not going to use the vague term “AI” here; frankly it’s not clear what it means at all and machine learning better captures the automated inference algorithms being discussed.