January 10, 2023
In mid 2017, my data science team was tasked with building out a new genome assembly and annotation pipeline that could cover the vast expanse of fungal and bacterial diversity to support our development of novel microbial products. Our company was engaged in bioprospecting of microbes from sites across the US. Back in the lab, we were isolating, identifying, and then assaying a previously unmeasured wealth of biological diversity. Sequencing was a core piece of getting that right. The analytical tools the company had first implemented as an early stage startup were producing low quality genomes with far too few annotations: we weren’t learning much.
We ended up building a set of state-of-the art genomic assembly, annotation, and comparison tools on a cloud-based microservices architecture. While excellent technology, this is not what a biotech startup needed to have data scientists and software engineers working on. Because of a lack of tools and infrastructure, our data scientists were not focused on core scientific IP and our software engineers were writing standard, commoditized code. The lack of off the shelf software and systems slowed the pace of scientific research and development. We were spending our time as “digital plumbers,” rather than contributing to core IP.
What did we need to solve? Raw genomic data, as it comes off the sequencer, isn’t particularly useful. A series of complicated steps is needed to extract meaningful biological insights from it: genome assembly, gene identification, and functional annotation. For each of these steps there exist sophisticated algorithms, which make different statistical and biological assumptions, all of which can result in dramatically different results. Every step must be independently validated and QC’d. Each one may fail for different reasons from low quality input data to model misspecification and variations thereon.
It took us nearly a year of work to get a high quality system into production: design, research, and comparison between tools, while closely related to our core IP, took only a small fraction of that time. Implementation took considerably longer. Why was this? We had a brilliant computational biologist designing and testing the tools and excellent software engineers to put them into a microservices architecture. The software engineers needed to learn the biological problem, the computational biologist needed to figure out which steps to modularize and what the inputs, outputs, and states would be, and everything had to be translated into AWS and python.
Far too much standard bioinformatics code is spread among disparate languages (perl, R, python, etc) and in inefficient implementations. While we built great software, almost none of these steps should have been necessary, except picking the genome annotators and maybe testing the assemblers. Certainly, there should have been solutions available (as there are starting to be now), that do not require a squad of software engineers to implement genomics pipelines in the cloud.
What should they have been doing instead? The company was interested in sequencing a diverse collection of novel bacteria and fungi. Our core questions were around how we could understand the taxonomic and functional diversity of these organisms and how microbial genotype related to phenotype. These are questions of genomic annotation, beyond the standard tools (eg PFAM, COGs, etc), phylogenetic inference, and comparative genomics. With proper cloud based tools, we could have hit the ground running with those questions, rather than identifying the best way to put an assembler on AWS and how we would log failure states.
Data and software teams are expensive: they should be focused on core IP and enabling the next, key experiment. Frequently, this will involve building pipelines and extending software systems. These should not be for standard tasks in data science or comp bio, but for novel methods that are tied to the company’s fundamental technology. Scientific progress in all areas of biopharma, not just genomics, will depend on us having reusable, standard infrastructure we can use and build on. If we’re going to learn at scale from data, we need better systems from the beginning.