Engineering Biology: Generating Data at Scale—Tools and Systems

Jacob Oppenheim, PhD

Jacob Oppenheim, PhD

January 17, 2023

What do we need to generate data at scale? Practically, we need tools to allow us to run experiments: laboratory informatics, automation, and data capture. More is needed in order to always be performing the key experiment. We need to be able to design new experiments based on results as they come in, not laying out ten thousand in advance and waiting a month. These are fundamentally data problems, yet we do not have systems designed to enable their solution.

Whenever I discuss this issue with scientists and industry leaders, the first question is usually “What about [LIMS / ELN / Instrument Data]?” While these systems all rely on data as an enabling technology, they do not share a common operational thesis. Each was designed for a specific workflow and set of usecases. All are ill-equipped to provide real-time analytics to monitor operations and results and immediately design new experiments. By analogy, stoves and ovens both use heat to cook food. Cooking well requires them both, not a hybrid combination of them. Similarly, success will require us to put these pieces together to build a seamless system, not to shoehorn extra functionality into one.

LIMS, or Laboratory Information Management Systems, are designed from the perspective of an experimental sample moving through a series of industrialized processes. A compound is synthesized or a biological specimen pulled from the freezer. Some operations occur, be they chemical reactions, measurements, or transformations. The LIMS is designed to log the results of these pieces along the lifetime of the sample. It is built for scale, not flexibility: the steps of the process must be each individually configured and cannot easily be changed. The scientific process, however, is messier. A classic failure mode for LIMS is assay development experiments, where conditions and controls are changed frequently to determine the right experimental parameters. While any experimental design can be theoretically programmed into your LIMS, each change requires a new configuration of the system. With a ballooning number of therapeutic modalities from cell and gene therapies to drugging molecular condensates, the presets don’t work. Processes today require data generating modalities that did not exist a decade ago. These completely new sets of workflows must be programmed into LIMS to make them useful. Unsurprisingly, this process is painful, error prone and leads to considerable customer frustration.

A subtler issue concerns how we think about data: we are interested in experiments not samples. We run highly multiplexed assays across hundreds of samples. We run replicates on separate days that we wish to compare. There are precise conditions of an experiment we’d wish to record that change frequently but are common across samples. By keying to the sample and not the experiment, a LIMS must be shoehorned into capturing, recording, and standardizing the data you need for every single experiment. Frequently, this means companies are missing key experimental parameters and metadata, find aggregations of replicates complex to aggregate, record, and visualize, and no one has overall insights into the operations of the laboratory itself.

ELNs, or Electronic Lab Notebooks, are designed from the perspective of a scientist, exploring the dark forest of the unknown. By giving a place to capture the information from a series of designs and experiments, they provide a record of how a scientist reached a specific point. In some laboratory functions, this has proved to be very useful through simple comp bio integrations e.g. for scientists trying to design CRISPR guide RNAs. For others, like medicinal chemistry and drug design, they’ve been a hassle. Fundamentally, though, they do not address the problem of aggregating across scientists, compounds, and experiments: you can trace one line of thinking, when what we want to understand is all of the data generated across scientists and across lines of scientific inquiry. ELNs are modern tools for artisans, what we need are modern tools for industry.

So why can’t we just go directly to the machines performing the experiments? Machine data is designed to be logged continuously, measuring the state of the machine and what it can detect at every point in time. These data are in many cases designed to provide a record of the machine’s performance so that breakdowns can be diagnosed. Many of these data are redundant (e.g. measuring the temperature every minute every day when you just need it at the time of your experiment), irrelevant (how many days since last restart), or too raw (light intensities on a sensor) to use. There’s a considerable amount of aggregation, transformation, and analysis needed to get the results we need from one of these machines. This gap in levels of abstraction is a major hindrance to biomanufacturing, where extant data systems and standards miss the proverbial forest for the trees. Aggregate reporting must be done by hand and lives in motley Excel sheets and pdfs.

What we need are data systems designed from the perspective of an organization industrializing discovery. The key must be the level of the laboratory and the process, neither the sample, nor the scientist, nor the machine. How many experiments were performed in the past day? What do the results look like? How do controls compare day to day? How do the results differ by condition for every replicate of every experiment? Not only do we need all the data, we need to be able to filter it to be able to monitor consistency over time.

We can start to see how such a system would be built: it would aggregate over the samples in a LIMS, but with the flexibility to add new columns and tables, adding the depth of the machine data, and in the context of the scientific process of the ELN. We need to pivot and aggregate all of these sources to accumulate the data from the entire lab, properly analyze the results, and place them into dashboards and apps whose workflows mirror scientific and industrial processes. Modern cloud native data systems allow us to do this, while exposing the results to all in a democratized fashion. We’ll dive into how in the coming weeks.