January 2, 2023
There’s been an exceptional amount of talk about and investment in learning from data in biology, especially with the advent of effective ML systems. The ability to quantitatively model and learn from data at scale is real: look at the continued progress in protein structure prediction in CASP. Every biopharma company now has a data science org with diverse operational models from centralized to distributed, and there is continual talk of innovation and AI.
Yet, it does not feel like we are learning much. Improved sequence processing here, better imaging there, perhaps a touch of automated chemical design. The successes so far have found limited domains. Biology is diverse and everyone’s problem is different. Biotech especially flourishes on different scientific insights, all of which will require subtly different training sets and objectives. Both viral capsid optimization for gene therapy and nanobody design for therapeutics are protein engineering from sequence problems, but with decidedly different contours and desiderata.
Why is this? It’s tempting, but incorrect to argue there’s two problems: not enough data and poorly performing models. This clean-cut division does not represent the scientific process, which is more a loop. One must generate data, fit a model, make predictions, evaluate them, and then adjust the underlying model. Without enough data, it’s hard to run this loop, with the wrong data, you never learn enough from it, and in both cases, your model never gets good enough.
Statistical learning in that sense mirrors scientific learning, though that is still not enough. The struggle is not to generate more data, but to generate the right data. Success comes not from improving the fit of a model, but from always identifying and performing your key experiment. That requires being able to rapidly design and execute the critical experiment. A similar focus is critical, too, to building relevant models that can effectively speed scientific discovery.