Digitalis Press researches and publishes on the science and technologies that underpin our work in Investments, Proving Grounds, and Commons.
Accurately identifying and defining the most critical unmet needs in health is the first and most fundamental step in deriving solutions that positively impact health at scale. A meaningful understanding of such needs requires a broad view, one that embraces how questions of science and technology are tied inextricably to economic, policy, and social circumstances and histories.
My focus in writing over the past three months has been the interplay between powerful new computational methods, digital technologies, and operational processes. It began with the observation that successful Machine Learning (ML) integrated biopharma companies have a moat in data generation and the scientific application of computation to these data—not in machine learning itself. Operational excellence is requisite for these companies, not merely a nice-to-have.
Engineering Biology
I was on a panel about digitization and the data revolution at the annual Academy of Management meeting last weekend. My co-panelist and I were there to give an operational perspective on how data are used in biopharma for everything from R+D to commercialization and how it compared to the empirical studies from a variety of industries presented earlier in the session.
Engineering Biology
The integration of Machine Learning (ML) into scientific work exists on a continuum between whole-scale replacement of human processes and providing inputs to complement the judgment of a human arbiter. As I’ve argued previously, current models are insufficient at best for fully substituting human knowledge in biology for all but base-level tasks…
Engineering Biology
The past weeks have seen a flurry of articles debating the efficacy (and proof thereof) of “AI” in drug discovery and biotech writ large, kicked off by a large layoff at Benevolent, an “AI” drug developer. I would argue the lesson of the recent AI boom in biopharma is a simple one: If you don't have novel and effective science in the first place, no amount of data science will save you: Data Science and Machine Learning* (hereafter ds/ml) will be most successful in biology where they sit atop transformative science that needs no special analytics.
Engineering Biology
Back in 2017, when I was just starting to build out data science at Indigo, Tristan Bepler joined us as a summer intern. We had a large and growing amount of sequencing data from microbial communities both their composition from marker genes and whole genomes of organisms of interest. Both of these datasets resisted conventional methods. The mathematical modeling of microbial communities remains underdeveloped with heuristic methods that produce nonsense and potentially more correct ones that are difficult to implement.
Engineering Biology
The hope with Machine Learning has long been that we can eliminate complex, slow, and expensive physical processes with accurate predictions, inferred directly from data. As I’ve written previously, the complexity and scarcity of data make supervised learning like this less relevant in problems of biology. Unlike internet companies, generating reams of labeled data daily, our experimental throughput is orders of magnitude lower and our data modalities considerably more complex.
Engineering Biology
Where does Machine Learning belong in biology? Nearly all successful efforts fall into one of three categories: Exploration—Summarizing large complex datasets that cannot be fathomed by the human mind: gene sequences, chemical structures, images, etc and enabling scientists to explore them. Scaling—Automating, standardizing, and debiasing heuristics and calculations. Prediction—Estimation of how a complex process will perform on a new element.
Engineering Biology
We don’t know what the hard problems are going to be. Most of us were trained as academic scientists in a culture of finding winding paths through the dark forest of the unknown. Today, we are much closer to engineers — using data and computation to industrialize the production of knowledge. Biology presents an endless series of learning and inference problems for us to solve.
Engineering Biology
How does data science fit into the biopharma tech stack? The analytical operations involved are certainly more complex than the transformation and aggregation of data. This might suggest that data science is an artisanal, intellectual operation built off of the core data repository; in essence, an extension of laboratory science to computation. While tempting, this pattern only leads to confusion, frustration, and a misuse of human and silicon capital. Just as we are industrializing biological discovery and drug development, so must we with data science.
Engineering Biology
Technology in a biopharma company tends to grow by accretion rather than design. Tools and systems are brought in house as functions are brought on line. LIMS comes with the establishment of a lab, a compound registry with the first experiments with small molecules, a chem informatics tool when it’s time to start digging into SAR. Growth reflects staffing and capabilities — much as you don’t hire a medicinal chemist until it’s time to design small molecules, you don’t bring in the systems they would use until the function is present
Engineering Biology
Capturing and recording all relevant data is only half the battle. We then need to make it useful. In practice, we will have a deluge of information, much of which will be hard to parse without the relevant context: high throughput instrumental recordings, metadata tables, and the tracing of samples throughout laboratory workflows.
Engineering Biology
What do we need to generate data at scale? Practically, we need tools to allow us to run experiments: laboratory informatics, automation, and data capture. More is needed in order to always be performing the key experiment. We need to be able to design new experiments based on results as they come in, not laying out ten thousand in advance and waiting a month. These are fundamentally data problems, yet we do not have systems designed to enable their solution.
Engineering Biology
In mid 2017, my data science team was tasked with building out a new genome assembly and annotation pipeline that could cover the vast expanse of fungal and bacterial diversity to support our development of novel microbial products. Our company was engaged in bioprospecting of microbes from sites across the US. Back in the lab, we were isolating, identifying, and then assaying a previously unmeasured wealth of biological diversity.
Engineering Biology
So where do we begin? With a hypothesis and a key set of experiments. From there, we must process the data, analyze them, and make decisions. If we are lucky enough to have seized on a real insight, there will be the immediate paired questions of replication and scale. How do we confirm these results and generalize beyond? To properly modify a system, as in drug development, we will need to move from science to engineering, and work with a myriad of slightly different experiments to arrive at the one we can use to, say, improve human health.
Engineering Biology
There’s been an exceptional amount of talk about and investment in learning from data in biology, especially with the advent of effective ML systems. The ability to quantitatively model and learn from data at scale is real: look at the continued progress in protein structure prediction in CASP. Every biopharma company now has a data science org with diverse operational models from centralized to distributed, and there is continual talk of innovation and AI.
Engineering Biology