Engineering Biology: Generating Data at Scale—The Organization of Information

Jacob Oppenheim, PhD

Jacob Oppenheim, PhD

January 24, 2023

Capturing and recording all relevant data is only half the battle. We then need to make it useful. In practice, we will have a deluge of information, much of which will be hard to parse without the relevant context: high throughput instrumental recordings, metadata tables, and the tracing of samples throughout laboratory workflows. These data need to be transformed, aggregated, analyzed, and visualized to be useful to all but data experts. If the right analytics are created, scientists can focus on performing the next key experiment, while operations can understand the throughput and value of laboratory effort. In short, you can have an efficient biopharma.

Unlike new laboratory systems, these problems are more industry agnostic, yet their solution is not common practice in biopharma. There are two reasons for this: one practical, one cultural.

As compute and storage, especially on the cloud, has fallen dramatically in price, the types of data processing, storage, and analytics pipelines needed to make use of data without losing information have gone from wildly expensive to simple and feasible. Technically, we talk about moving from an ETL (Extract-Transform-Load) paradigm to ELT. In layman’s terms, once we used to have to aggregate, reduce, and simplify data before loading it into a database to avoid immense storage costs, infeasible analytics operations, and enormous engineering teams: even if you had all the data in a database, you couldn’t work with it all at once. Cloud data warehouses such as Snowflake and BigQuery have solved the first problem, and cheap cloud compute on AWS, GCP, and Azure have solved the second. A world where you don’t need to preallocate uses to data and throw away what is not immediately needed, enables novel and divergent approaches to using data. In a future post, I’ll dive deeper into some of these technologies and how they allow teams to operate at scale with dramatically fewer engineers.

On a cultural level, processing data has never been “sexy” in the way Machine Learning, or novel computational tools have been. Getting data operations right requires detailed process mapping to ensure all relevant information is captured and to serve the right analytics. It also requires having the right keys or master data, that is the unique identifiers to ensure the same thing is referred to the same way every time it occurs. Think of a small molecule as it goes from design by a medicinal chemist, synthesis, laboratory assays, preclinical experimentation, clinical trials, manufacturing development, and scaleup, and finally commercialization. Every time it’s referenced, you want to be sure it links, else you will have a heap of messy, disparate data. Each reference requires the proper metadata to ensure that you know all relevant parameters, e.g. when it was synthesized, what the purity was, the lot, etc. This type of work can feel like hacking through weeds, but it’s vital to get right at the beginning — and now we have the tools to make it easier and more effective.

It’s relatively common for organization to pay lip service to these needs. Many pharmas have data management teams ostensibly devoted to these efforts, but they are rarely funded or supported enough to make much headway. Decades of data disorganization will not be solved overnight, but at the same time, that is no excuse not to do better on current efforts. The fruits of proper data flows are rapidly apparent, if properly cultivated.

With all of your data properly organized in a cloud warehouse, it is now easy to drop it into a dashboard and give live, accurate, up-to-the-minute information to every stakeholder in your organization. Not only do they not need to rely on possibly out of date reports lost amidst email or file clutter, they can be assured that they are looking at the same views of data as every other stakeholder. Operationally, this removes a considerable amount of spin*, redundancy, and disorganization from a company. Building great (data) tools and product is part of this, too, which we’ll talk about in the future.

This “omnichannel approach” need not end with simple data visualizations and aggregations. We can stick complex analytical pipelines into our cloud workflows and give any stakeholder the power of a data science team’s models. Not only does this free data scientists to focus on core IP: modeling data, building better models, and helping to design the next experiment, it allows lab scientists to operate independently and focus on getting to the next key experiment. The virtuous circle of learning loops accelerates.

Similarly, an organization can be strategic through online data monitoring. Executives and operations can track laboratory throughput, consistency, and progress — questions that are essentially unanswerable today. I have yet to meet a biopharma exec who knows exactly how many experiments were run last week and whether they were an efficient use of resources, let alone exactly how much was learned. These questions are answerable today through cloud based tools: what is needed is a continual focus on data operations, master data management, and rigor. Not only will this free your data teams from work as expensive (data) janitors, it will enable everyone to focus on core IP and productivity.

*To give an example, take Health Care Providers: Doctors, Hospitals, etc. At EQRx, we need to know who they are, what their patient volumes and technological characteristics are, and which payers they were connected with, among many other details. Corporate strategy wanted to plan engagement with the ones who saw the most patients, med affairs with the most influential, commercial wanted to find ones connected with payers we had partnerships with, clinical operations wanted to find who to run trials with (or to refer patients for trials), and beyond. Traditionally, each team would have analysts or consultants creating reports for each team, potentially off of different data sources or versions thereof. With a modern data stack, we could build interactive dashboards for all of these teams based on the exact same information, ensuring a comprehensive, cohesive strategy.

To subscribe to Engineering Biology by Jacob Oppenheim, and receive newly published articles via email, please enter your email address below.