August 15, 2023
I was on a panel about digitization and the data revolution at the annual Academy of Management meeting last weekend. My co-panelist and I were there to give an operational perspective on how data are used in biopharma for everything from R+D to commercialization and how it compared to the empirical studies from a variety of industries presented earlier in the session.
We touched on a number of the themes I’ve written about in the past, including:
• Data Science (DS) is not a separate interest group to be traded off with biology and chemistry—rather, it is a foundation that empowers all internal experts.
• DS and Machine Learning (ML) are process efficiencies—organizations need clear processes to be able to effectively use quantitative information and computational methods.
• Public data and algorithms trained on them are not your moat—rather, the data you generate yourself, build computational tools with, and translate into biology are what creates competitive advantage.
• More is different—our ability to see further with data and novel ML is going to change what we do and how we do it in fundamentally unpredictable ways.
I’d like to flesh out some thoughts on the last question I was asked on the panel: Why now? What has changed? My thesis has been that we are in a new phase of engineering biology, brought about by dramatic changes in our ability to manipulate living systems in the past 20-30 years. Even if we had wanted to digitize processes and operations, the clear value drivers were not there until ~10 years ago.
Previously, biopharma was in what might be uncharitably called an “alchemical” phase. A sharper analogy would be to compare with the history of thermodynamics. Building on initial experiments by Boyle, Gay-Lussac, Watt and others, we learned to effectively manipulate thermal systems (e.g., practical steam engines) in the 18th century and could model many of their macroscopic properties. Meanwhile, fundamental, first principles physics could not explain these results or generate accurate predictions essentially until the age of Gibbs and Planck at the end of the 19th century. Over 100 years passed with a yawning gap between effective experimental and engineering science and our ability to understand much of any of it*.
This gap between fundamental understanding of how to effectively manipulate systems to get a desired result versus the practical ability to do so based on intuition, analogy, and hard-won experience is a common one in science**. For generations, we developed lifesaving drugs and dramatically boosted crop yields despite having at best weak understanding of what linked cause and effect. Our tools for manipulating biology looked very different than those used today.
In pharma, this meant phenotypic screens*** of interesting molecules, often found by bioprospecting, followed by reasoning by analogy from what was observed to determine where it might be efficacious in vivo, and how to develop it. In agriculture, it meant selective breeding of corn and other crops, producing compounding yield increases and climatic robustness year-after-year. Science like this is hard to standardize: inputs cannot be fully characterized to be standardized (e.g., extracts from a tropical plant or fungal culture, whence many of our antibiotics), and outputs are reliant on qualitative observation. These issues only compound as experiments are chained together based on analogy and intuition.
One might argue that the fundamental dogma of molecular biology provided a stronger theoretical foundation than early 19th century thermodynamics. However, in biology, we could not touch the level of scientific theory with empirical experiments except with incredible effort and luck. We knew of the gene, but not how to consistently measure it, characterize it, type it until the ‘90s. Even when we clearly knew how we wished to follow up on an experiment, most of the time we lacked efficacious probes to perform it, a problem that has all but vanished today. Marvelous discoveries were due to random chance—for example, the gene for Huntington’s disease was found because it lies proximate to a restriction enzyme binding site, accessible to crude gene finding methods from the early ‘80s.
By the year 2000, this had changed—RNAi, effective antibody engineering, versatile fluorescent probes, and fast sequencing enabled us to elucidate the chain of causation from genotype to phenotype in incredible detail, enabling today’s biotechnological revolution. In the time since, we have mapped enough of the moving parts that we can connect experiment with theory. We know enough chains of causation that we can (i) target anything that comes out of academic R+D not just with one modality, but multiple ones, and (ii) know where to measure it up and downstream. We can think about standardizing our assays, our downstream development tools, and our ways of measurement—we can (and should) reuse them and learn from them across experiments.
The new wave bioprospecting for antibiotic development provides a clear example of this shift. Where before we tested entire cultures and plant extracts in observational screens, today we can (i) sequence an entire mix of organisms, (ii) isolate individual ones of interest, confirming their identity, (iii) express individual gene clusters of interest, (iv) test what they produce in standardized high throughput screens, and (v) clone them for production in yeast or another organism of interest. This is a fundamentally different approach—it is engineering biology.
Engineering biology enables digitization. We have: (i) repeated assays, (ii) performed at scale, (iii) convergence to standardized paradigms—of which there are going to be many per modality—within an organization, (iv) clear tasks and desiderata. These steps are ones for which supporting software can be built, quantitation will add value, and statistical learning can be performed within and across experiments. In the future, we may think of these as “APIs for biology” with standard inputs and outputs able to be performed with clear expectations across systems and organizations.
A final note of caution: engineering biology is not always the most efficient process. Traditional corn breeding increases yields ~1% yearly. A genetic engineering project thus has a high hurdle rate to beat: if it takes ~10 years (as in BT Corn) to get the right modifications in the right place and test it in the field, it needs to be considerably more than 10% better to matter. Thus far, genetic engineering in agriculture has not been targeted at overall crop yield, but rather herbicide resistance and natural insecticides, to deliver clear value. Clearly, there is more to do.
To achieve the potential of engineering biology requires us to take different approaches to software and systems, computational tools, laboratory operations, and organizational design. In the coming weeks, I will be diving deeper into these questions: the interrelations between software, hardware, process management, and the generation of scientific information.
* A classic example here is Carnot (1820s) and later Clausius who realized that the waste heat produced by inefficient steam engines represented a key system property (entropy) and that, at absolute best, it would remain constant. Entropy was not explained until Boltzmann in the late 1870s with widespread acceptance only three decades later.
** Renaissance and Early Modern alchemy versus Enlightenment-era chemistry is another example with the latter path only becoming dominant with Lavoisier as scientific methods showed clear results.
***In essence, putting a molecule into a biological system and observing what changes compared with standard controls and other molecules of known effect.