Engineering Biology: ML in Bio—There’s No Labeled Data to Fit

Jacob Oppenheim, PhD

Jacob Oppenheim, PhD

April 11, 2023

Where does Machine Learning belong in biology? Nearly all successful efforts fall into one of three categories:

Summarizing large complex datasets that cannot be fathomed by the human mind: gene sequences, chemical structures, images, etc and enabling scientists to explore them.

Automating, standardizing, and debiasing heuristics and calculations.

Estimation of how a complex process will perform on a new element.

Exploration is simultaneously the most common of these and the most difficult to do well. Indeed the list above is in reverse order of complexity and instead by frequency of use, representing a fundamental tension in biology. The ML problems of interest are most often those of unsupervised learning, or extracting meaning from complex, heterogeneous datasets.

In practice, the relatively simpler realm of supervised learning, where we infer models connecting input data to outcomes, is rare in biopharma: our data are limited, disparate, and discordant. Even when we can build supervised models, we often have to restructure and simplify our data first to be able to learn from them. In biology, nothing is simple.

The key problems in ML for biology are unsupervised: how do we deal with complex, high dimensional spaces that obey their own grammar and rules, which we barely understand. Examples include:

DNA sequences
A four character alphabet, but also with chemical modifications that change in time; arranged linearly, but also with non-trivial 3d structure; transcribed and translated to proteins, but also encoding the kinetics of its own creation, regulation, and destruction.

Protein structures
A limited alphabet of 20 amino acids, but with non-trivial physics in the spatial interactions between them; initially formed as a chain, but requiring rapid, high fidelity folding to a 3d structure; interactions that are fundamentally understandable chemical physics, but too large to mentally comprehend or simulate well.

A simple enough graph encodes chemical structures, but variation is sparse and highly nonlinear; effect is a function of structure, but the rules that govern how and where a small molecule will interact with a biological system are manifold, context dependent, and mostly still intuition.

A matrix of defined data from scattering and fluorescence events, but myriad underlying events that could be causing them; a uniform spatial scale of output data, but a reflection of biological processes occurring at wildly different scales: phenotypic screening with high content imaging has yet to achieve its full potential.

Medical Records
a linear chain of events (visits), but with each varying dramatically in complexity, contents, and volume; a set vocabulary you can use and encode, but hierarchical, redundant, and lacking in many requisite details while containing thousands of terms; an underlying concept of disease and diagnosis, but carried over time amidst the vagaries of human health and medical hypotheses.

We need to summarize these data to make them useful to people — no one thinks in such high dimensional spaces, but we need to obey the complex data generating and scientific processes underneath them. The simplest techniques are bound to fail (think PCA) but can give intuition. What hurts is when we use complex, trendy techniques, but don’t adapt them to the science at hand. The saga of tSNE and UMAP in biology is a classic example.

We’re not going to generate enough data for supervised learning most of the time. But we can use small amounts of data with outcomes on top of unsupervised learning to help guide knowledge creation.

In a previous position, we attempted to map the vast diversity of bacterial + fungal microbial diversity to help pick strains to test for symbiotic effects on plants. Our sampling was uneven and our rate of testing orders of magnitude below what was needed for supervised prediction. Instead, we built an unsupervised map of microbial diversity from marker gene sequences* where distances were an estimate of evolutionary time**. In these spaces, we could highlight microbes tested to have positive and negative effects, those isolated from plants and regions of interest, etc, allowing scientists to pick a diverse group of candidates to test, learning from the sparsely labeled data.

This approach is general: small amounts of labeled data allow us to think about how to explore this space and move around. We get regions of predictivity where we have done lots of experiments. Traditional methods in computational chemistry rely on this: QSAR*** models, relating molecular structure to function, are traditionally built in “local” regions of similar molecules. With slightly more data, we can recommend the ideal way to “actively learn” using Bayesian Optimization, trading off exploration vs exploitation of areas of known interest.

We can sometimes do even better, moving to the realm of weakly (or semi-) supervised learning. We often have some labeled data from years of basic research that tell us about scientific properties other than the ones of interest. For small molecules, these could be dissociation rates in water + solvent; for proteins, structural and functional domains, etc. These data convey useful information that may help us solve a problem, by starting to relate complexity to a phenotype, giving us better biological/scientific priors. Often putting more of this into our ML models makes them perform better: see this review for an example in protein function prediction.

The complexity of usecases and paucity of data are the science in data science and why ML is not a series of cookbook operations. We need to understand the problem and where the pain points are for experts. We need to choose the right basis and ancillary sources of information to improve our models. We need to know what our modelling assumptions are and where they will fail. We need to deliver value and continuous improvement. ML is computational *technology* and must be tied to core scientific IP, not just throwing complex mathematics at a problem and seeing what sticks. Next time, we’ll turn to the cases where we can use supervised learning and how it fits the same underlying themes.

*Technically, k-mer frequencies within marker genes both for computational efficiency and a history in the literature of using them to model molecular evolution.

**Viewing each marker gene as a “text” of k-mer “words,” we could use topic models that mimic evolutionary processes to build a latent, lower-dimensional space. The axes (“topics”) captured major pieces of correlated variation, with the most important dividing large phyla and lower weighted ones separating classes etc. Another valid approach would have been to use a hyperbolic embedding, to capture the assumption of treelike evolutionary structure.

***Quantitative Structure Activity Relationship

To subscribe to Engineering Biology by Jacob Oppenheim, and receive newly published articles via email, please enter your email address below.