June 2, 2023
Back in 2017, when I was just starting to build out data science at Indigo, Tristan Bepler joined us as a summer intern. We had a large and growing amount of sequencing data from microbial communities both their composition from marker genes and whole genomes of organisms of interest. Both of these datasets resisted conventional methods. The mathematical modeling of microbial communities remains underdeveloped with heuristic methods that produce nonsense and potentially more correct ones that are difficult to implement. Meanwhile, at best, 50% of the ORFs in a fungal genome were annotatable, and frequently considerably fewer, rendering the sequencing data marginally interpretable at best.
Tristan was remarkably successful with both of these problems: We ended up with a NeurIPS workshop paper on building topic models to model microbial diversity across communities and discover shared axes of variation, and he started formulating some of the ideas that today are embedded in OpenProtein.ai. At the most basic level, we needed a better method than BLAST (and its successors) to identify gene homologs and functional domains across an immense diversity of life. Fuzzy k-mer matching, which underlies tools like BLAST, cannot cope with the high levels of diversity (eg when ~20% sequence identity is a sign of conservation), nor can it tell you anything about underlying biology. We needed methods both more error tolerant and more biologically grounded.
Tristan ended up running with some of these ideas and over time revolutionized the field of protein ML. The key insight was to incorporate biological priors directly in the learning algorithms to capture what we *knew*, even when that alone could not give us the answer. The place to start was to take an idea from Natural Language Processing and learn from the vast diversity of protein sequences known from all living organisms, not just specific ones of interest. This evolutionary prior tells you a remarkable amount about how sequences vary and change over evolutionary history and the general rules of their grammar. The first successful deep protein models were based on the same principles underlying today’s Large Language Models, in a biological context.
On top of this foundation, including additional training tasks to fine tune the model is key. However, obvious candidates, such as predicting function from databases like SwissProt, turn out to be bad. Nearly all of the annotated functions come from BLAST homology search and not experimental confirmation. Tristan’s key insight was to leverage structural categories and folds, embedded in ontologies like SCOP, to improve model fits. This idea works incredibly well, because protein function is downstream of structure, which is in turn controlled by sequence. By leveraging structural priors, you could get at better functional relatedness and clustering of proteins.
These efforts have continued, adding in contact predictions (themselves another, finer form of structure), tuning datasets to accurately reflect biological variation (and not overtrain on duplicate sequences), and, most recently, attending across proteins to find functional domain similarity that may be located at vastly different locations in related proteins. This latter piece is important: proteins can be composed of numerous copies of different functional units (think transmembrane domains) with a key unit of similarity, e.g. a signaling domain, in a vastly different place in each family member. We need methods flexible enough to find commonality, even when evolution has driven large scale modification elsewhere in a protein.
Beyond incorporating strong biological priors to build better models, OpenProtein has a different concept of product than others. At its core is the belief that scientists can design the right proteins/antibodies/etc using ML tools to see further themselves. They are not an expensive CRO, handing you back a couple of sequences to build and test after paying a rich fee. Neither are they trying to back their way into drug development. Rather, they are empowering biologists with computational models trained on the vast diversity of known proteins to do their own jobs better.
There’s a number of reasons why this is the right paradigm for ML in biology:
• Models need to solve relevant tasks for drug development. Much of the recent progress in ML for biology has yielded wonders that, while advancing science, do not fundamentally alter the hard problems of drug discovery and development. AlphaFold structures are a fantastic resource, but not nearly accurate enough to even use for docking. ML that will make a difference requires solving scientists’ data to day problems.
• Models need refinement from specific, relevant data. Culturally and legally, it’s hard enough to send your data to a third party. Practically, it’s not easy to integrate well at scale. Giving customers a platform to work with, where they can value different datasets differently, and continue to update them over time is empowering.
• Integration into the scientific process matters. ML does not work well when you’re “throwing data over the wall.” Only extremely rarely does a model perform so well on its own that it can be applied and left to run untouched. Rather, iterative cycles of design / test / build and model improvement lead to success. You’re not going to iterate quickly, correctly, and efficiently with a separate third party doing computation or with processes blind to the laboratory scientists.
• Science is a Bayesian endeavor. We almost never get things right the first time and are constantly refining our hypotheses, integrating new information from a variety of sources. You can’t trade off computational optimizations vs hard-learned heuristics vs ad hoc feasibility (let alone exploration vs exploitation) with a list of sequence targets. Scientists must be able to see the holistic picture and work from there.
• Lastly, by providing an ML service that others can directly integrate into their workflows, OpenProtein allows biotechs to focus on their Core IP. Building foundational protein language models should not be a requirement for using them: it’s a huge cost in time, money, personnel, and compute. Rather, biotechs should be focusing on Core IP and prowess. If they’re building ML, it should be on top of these models tied to their specific tasks. OpenProtein is a huge step in that direction.
Proof of value is going to come from using ML tools to discover new science and better engineer biology. The diversity of sequence space, the complexity of modeling biological constraints, and the exponential vastness of potential experiments require novel computational tools. While the theoretical case is clear, the practical benefit has yet to be achieved.
Too many tools have focused on computational validation in a couple toy problems rather than supporting novel biological research. There’s a lot we can do here to show novel powers while enhancing the research community. I think of large scale evolutionary datasets — ML is a good way to encapsulate the differences in proteins over mammalian and human evolution, for instance — and can be applied proteome wide. Instead of yet another supposedly optimized antibody after a well-worn target that will never be tested in the lab, these methods should prove themselves over the diversity of biological research.
It’s time for us to be better biologists using the tools of machine learning and to help those in the laboratory see further with computation. OpenProtein is a big step in that direction and I’m delighted to be part of the journey.