Engineering Biology: ML as Process Efficiency

Jacob Oppenheim, PhD

Jacob Oppenheim, PhD

July 31, 2023

The integration of Machine Learning (ML) into scientific work exists on a continuum between whole-scale replacement of human processes and providing inputs to complement the judgment of a human arbiter. As I’ve argued previously, current models are insufficient at best for fully substituting human knowledge in biology for all but base-level tasks. The same is true in many other fields. Consider cars: full self-driving is far from ready for prime time, but lane tracking and collision avoidance are now standard (and useful!) features that augment human drivers. Nearly all ML systems lack sufficient robustness to changes in input data and problem definitions that accumulate over time for solo operation.

Pairing Machine Learning systems with human-in-the-loop processes can correct for their failings. This deeply practical path requires both carefully constructed models that match the contours of the work being automated, repeatable playbooks for human-performed steps, and clear interfaces where the two meet. Those models we build today are not oracular, but they offer advantages to those who could use them effectively.

When I look at biopharma, I see processes as the biggest gap in implementing augmented intelligence. Operations frequently look like a scaled-up post doc project, not engineering biology. To succeed in improving scientific discovery and therapeutics development, we need processes that optimally pair humans and machines.

Traditional small molecule drug development is a good case study. We have immense amounts of data about the structures of known and synthesizable molecules and good heuristics about what makes them drug-like. We have libraries that include tens of millions molecules (and billions of potential ones), far more than a human can comprehend. Unfortunately, we have precious little data on the properties of these molecules in biological contexts.

We have small numbers of data points from disparate assays for classic development experiments in ADME* and Toxicology, compared to the vast diversity of chemical space. We have next to no data whatsoever on binding to a target of interest, a number that falls further if you try to control for binding site, or look for in vivo results. We can design and generate molecules with a computer, but we are not good at predicting their properties in advance, nor at de novo (or zero-shot) design.

Is ML valuable then? The space of potential scaffolds, interpolations between them, side chain decorations, and classic medicinal chemistry alterations is exponentially immense. Enumerating and sampling amongst them is better work for a computer than a human. A well trained chemist, though, is needed to verify the proposals, add specific compounds of interest, and prevent algorithmic selection from exploring nonsense.  A human is needed in the loop.

The global inability to accurately predict binding and molecular properties does not imply that we cannot do so locally. Chemists for 40+ years now have been using simple regression techniques on designed chemical fingerprints to infer relationships between molecular structure and function (QSAR). By testing a number of molecules similar in structure to hits, we can begin to infer what transformations thereof are likely to produce even better candidates. This local optimization problem is well-posed computationally: ML excels at optimizing across numerous covariates, quantifying exploration-exploitation trade-offs, and rapidly converging on a solution set. Since we’re looking at structurally similar molecules, we needn’t generate massive amounts of data to get decently performing models.

Drug discovery is a funnel: a vast number of potential candidates or fragments go in, and a handful of lead molecules come out. By designing ML tools to enhance the operations of that funnel, removing clear negatives, and helping scientists avoid false positives, we can increase throughput substantially. In both of these cases, however, Machine Learning is  process efficiency. It gives a more open and less biased view of molecular space.  It can help pick more diverse and mutually exclusive design cycles.  It can shorten the time to converge on good candidates while assuring you sufficient exploration. To exploit these efficiencies, we must have good scientific processes for designing and running the experiments and operational processes for sharing their results and making decisions. We cannot outsource everything but the ML.

To get this right we need a tight connection between the laboratory, data operations, and machine learning. We need data pipes to rapidly extract experimental results, analyze them, and feed them into new models.  We need predictions immediately and natively shown to chemists to engage with, modify, and act on them.  We need scientific processes that stage and sequence assays to give enough information for us to actually build effective models.  If we don’t start taking some ADME data early, we will be picking molecules solely on binding and will likely end up with very un-drug-like molecules and low quality ADME predictions.  Part of the solution is data and computational infrastructure that goes beyond Excel sheets and the Mac Automator and part is detailed operational focus, which is arguably harder. The only moat is execution.

The same lessons are clear in the clinic, too.  Automated ML selection of patients for clinical trials from medical records has failed numerous times due to lack of data completeness and fidelity. However, UPMC has demonstrated dramatic reductions in patient ID and screening time by leveraging a human-in-the-loop system that does not attempt to evaluate every single inclusion and exclusion criteria for a trial from data, only those easily inferrable. We see a similar success with the second wave of EHR to EDC** companies. They do not try to crack the currently insoluble issues of inferring specific patient details directly from unstructured medical notes, but instead focus on what can be reproducibly mapped.  ML is a process efficiency: Narrowing the funnel computationally is powerful, if paired with well-designed processes.

This framework applies, too, to Large Language Models (LLMs), the currently hottest area in ML development.  These models perform the seemingly miraculous when connecting unstructured text to structured data, be it generating tables, laying out code frameworks, or writing texts given a set of defined inputs.  Without this structure, the models are prone to “hallucination” or creating meaningless or fact-free text with the correct form***.  The key innovation behind ChatGPT was not the improved language model, but rather the chat interface, that enabled facile human interaction.  They, too, are a process efficiency, not an oracle.

We have many tasks, though, where we need a tool like this. Nonstandard transformations of data, as is common in billing and accounting; transferring information between business systems; information extraction from standardized documents and forms. The past decade has seen considerable success from firms automating pieces of these processes, using less advanced ML techniques, under the heading of Robotic Process Automation (RPA).  These tools learn from standardized processes to infer what must be done in the next case.  LLMs greatly extend this ability, provided, of course, there is a real process for them to build on

In the coming weeks, I’ll be writing more about the connection between technology, not just Machine Learning, but Data Analytics, and Software more generally and effective human processes.  Bridging this gap is critical for our ability to engineer biology.

*Absorption, Distribution, Metabolism, and Excretion, four overarching categories for in vitro assays to predict what would happen to an ingested small molecule.

** Electronic Health Record to Electronic Data Capture (the database system used in clinical trials) potentially eliminates immense amounts of data duplication, manual review, and reconciliation of information in clinical trials by copying directly from medical records to clinical systems where the items already exist.

***A biological example: when you ask what binds to protein X, the answer is a random compound, because their training data has many variations of compound Y binds to protein X, all of which look the same to the model.