Engineering Biology: Learning Biology from Data—Focus on Simplicity

Jacob Oppenheim, PhD

Jacob Oppenheim, PhD

March 28, 2023

How do we build models under resource constraints? We almost never have enough data to adjust for every possible confounding factor, nor do we know what all those factors are.

In a previous position, I was working with a group of plant biologists and ag scientists to determine the best ways to analyze greenhouse experiments. Greenhouses are imperfect laboratories: the climate control in rooms large enough to grow plants shows slight inconsistencies over space and time, light intensity varies, opening and closing doors creates microclimates, and the rows of plants demonstrate numerous edge effects. All of these factors could be accounted for: temperature fields, light intensity gradients, row and column effects, etc, simple right? When you have experiments of a couple hundred plants at most, that becomes considerably more difficult. There’s the natural noise of biology and your experimental factors to account for, too. And what if your factors interact? Suddenly, your idealized statistical model can’t identify any factor with any degree of precision.

The same paucity of data that makes it hard to draw unambiguous conclusions makes it hard to distinguish among competing models. Rather than strive for complexity and modeling every known parameter it is frequently better to simply generate more data. In the case of greenhouse experiments that meant simplifying our experiments by eliminating factors or concentrating them in space and in time.

Traditionally, statisticians performed power analyses to determine the necessary size of experiments given previously observed effect sizes and variability. Mathematically, these tend to be tenuous: parameter estimates come from previous successful experiments, thus overestimating actual effect sizes and underestimating variability. The baseline statistical model used is often far from capturing the decidedly non-Gaussian noise we observe in biological experiments. Outside of highly regulated environments, such as clinical trials, they tend to be ignored, as they prescribe experiment sizes so high that no one wants to perform them.

If science is so underpowered, how do we end up with novel therapeutics that work? The key is that we don’t perform a single screen and immediately rush to the clinic. We repeat experiments in different settings, run confirmatory assays, and follow up in more realistic conditions. In practice, we are Bayesians, accumulating evidence of efficacy with each sequential experiment, saving the most rigorous designs for the final step, when we feel confident that we will be successful.

This still does not eliminate the need to model our data to analyze results, identify strong effects, and adjust for confounding in experiments. Mathematically, we can trade off additional parameters in a model versus the increased accuracy they provide. Before the recent deep learning boom, this was in some ways an easier task, as you could compare models by an information criterion (eg BIC). We could quantify the value of sparsity and the increase in predictive accuracy versus model complexity as a function of data. It’s a powerful idea, but unfortunately poorly extensible.

These approaches are limited to situations where we can easily quantify the number of parameters and their effect. The extension to Bayesian nonparameterics, like Dirichlet processes, and kernel methods, like SVMs, is unclear at best. They make no sense at all for overparametrized models like neural networks.

Simplicity, or not modeling more than you can reasonably approximate from the data you have, is not a matter of calculation: it is one of procedural rigor and scientific focus. A more complex model must be tied to a scientific hypothesis and evaluated against it. Clever mathematics can neither fix poorly designed experiments nor can it help you escape the need to do good science.

Single Cell RNAseq (scRNAseq) gives us another example of this phenomenon. Here, experiments measure the quantity of various RNAs transcribed from a single cell, rather than in a bulk sample, through amplification, reverse transcription, and sequencing. From the beginning, these results appeared to be far noisier than expected. Typically, we assume that RNAs are distributed in Poisson fashion: each individual transcription event is rare on a short enough timescale with its own probability of occurrence. Over a long enough time, we will find a Poisson distribution for each transcript. But there seemed to be more variability (and zeroes) than that.

And so an entire cottage industry arose: how to adjust for the extra variability? One assumption was that there was a probability of “dropout,” or simply not finding a transcript regardless of its quantity in a cell. This naturally led to a zero-inflated distribution of RNA counts. Another view posited, regardless of the origin of extra variability, it seemed to correlated with transcript abundance, so moving to a variance-inflated model, such as a negative binomial distribution was warranted.

With more experiments, standardization of technique, and improved data collection, it has become clear that dropout does not exist beyond the Poisson model. Variance inflation has been shown to be considerably smaller than previously thought. Notably neither of these more complex models proposed testable biology, rather they added epicycles to better fit noisy data. Even if they had been more accurate, the increased accuracy would not have revealed additional scientific information. Recent work focuses on measuring precise differences in distribution that have specific implications for RNA biology. Occam’s razor was sharper than many expected 10 years prior.

The implication for data science in industry is clear: focus on your key experiment and get clean data. Generate enough data to be able to identify effects of interest. Focus on the data you need to learn the model of the biology you wish to learn. Adding complexity won’t save you from having little data, nor will it correct for experimental flaws post hoc. There is no substitute for procedural and scientific rigor — if you cannot trace all of your data and use it seamlessly to design and perform the key experiment — no amount of Machine Learning will save you.