January 3, 2023
So where do we begin? With a hypothesis and a key set of experiments. From there, we must process the data, analyze them, and make decisions. If we are lucky enough to have seized on a real insight, there will be the immediate paired questions of replication and scale. How do we confirm these results and generalize beyond? To properly modify a system, as in drug development, we will need to move from science to engineering, and work with a myriad of slightly different experiments to arrive at the one we can use to, say, improve human health. Notably, this same process is exactly what we’d need to generate data for a learning algorithm, too.
Breaking this down, we can identify six steps: (1) Run experiments; (2) Process Data; (3) Analyze Data; (4) Generate Data at Scale; (5) Generate the Right Data; (6) Machine Learning. The first three of these steps are just the traditional scientific method, cast into data scientific language. The second three reference the challenge of scale in modern biopharma.
While many of these problems are being solved today, the results are scattershot and do not span this entire pipeline. Two of these problems have long been fundamental to biopharma:
• Running Experiments (1) is at the base of everything a company does. Laboratory informatics, sample tracking, and automation have made the work of a research team more efficient, through workflows designed to make their work easier.
• Analyzing Data (3) is the classic use of statistics in industry and biology at large. New discoveries require new paradigms and new tools require new methods. This area is fast evolving and never finished, eg every couple of months sees a new theory paper on how to correct for the noise in single cell RNA seq data.
We see considerable investment in a couple others:
• Data Processing pipelines (2), while often roughly hewn, are critical to our ability to use data from modern biology experiments, especially in sequencing and imaging. The world of bioinformatics tools and pipelines continues to grow and scale, as it addresses a fundamental need.
• Machine Learning for biology (6) is trendy, even as people mostly work on public datasets curated by decades of work from consortia. These results both evidence future potential and provide a foundation for other scientists to work from. The structures predicted by Alphafold are not novel drugs, but they give the medicinal chemist a place to start.
What’s missing are two key problems that look like scaled versions of the traditional scientific workflow:
• Generating Data at Scale (4) is similar to running experiments just at much higher quantity, frequency, and fidelity. Much as a physicist is not an automotive engineer, this effort takes different skills than designing and running boutique experiments. Oftentimes a deeper look at what you’d want to do here identifies issues in steps 1–3. To run experiments at scale you need to not only automate and digitize laboratory workflows, but also to collect and aggregate the resultant data so that they can be monitored and performance and fidelity continuously assessed. It requires tracking every datum and metadatum so that results can be (re)assessed.
• Generating the Right Data (5) captures the fundamental need of Biopharma R+D: to always be performing the key experiment. To do this, we must know which experiments we have run so far, their results, and how they compare. More is not simply better, it is often costly.
• In parallel, we need to run more than just the experiments a biologist would dream of to help train algorithms. Models must find the bounds of what is and is not known and how to connect the places where their predictions hold. Oftentimes novel science is found here: think of disordered protein regions, which were justifiably not a priority for structure determination historically, yet appear to be very important in aggregation and dynamics, as well as critical for evaluating algorithmic structure determination.
In the next couple of weeks, I’d like to take a deeper look at these problems and how their lack of solution retards and impairs the dramatic progress made in experimental technologies, data processing, and machine learning for biology. Ultimately, getting them right will enable all of these steps to flow as a continuous, modern process, rather than a series of artisanal steps.