February 27, 2023
How does data science fit into the biopharma tech stack? The analytical operations involved are certainly more complex than the transformation and aggregation of data. This might suggest that data science is an artisanal, intellectual operation built off of the core data repository; in essence, an extension of laboratory science to computation. While tempting, this pattern only leads to confusion, frustration, and a misuse of human and silicon capital. Just as we are industrializing biological discovery and drug development, so must we with data science.
Complex analytics, machine learning, etc should be viewed as a more involved data transformation. Fundamentally, they share core characteristics: both take a data table, perform some operations, and return a data table. A focus on charts and decisions to be made can obscure this similarity, but underlying the visuals and recommendations lies a data table or three. Once we view analytics as a “pop in” module and more complex form of transformation, how we scale data science becomes more clear.
This is a bold claim and there’s reason to be skeptical. Data science evolved from in-house statistical teams designing + analyzing experiments, and exploring results. These efforts were limited by computational power and data storage, bounding their breadth and throughput. Consider, though, analyzing a series of replicates from the lab. The simplest analysis is just to take the mean, nothing more than a data aggregation or simple map reduce step. Perhaps the results are messy and we want to apply robust estimators, or fit additional parameters to capture variation. From there, we might want to correct for factors of interest and learn the uncertainty in them. We soon arrive at a complex statistical model that might be inferred with Machine Learning. At the root of the task, though, is the same question: what are the right quantities that summarize a set of results?
What does this look like in practice? Analyses are mathier transformations in a different language with results written back to database. They are run every time new data comes in, either on a schedule or when triggered by an event. Results are product, not analytical exploration. By connecting a dashboard or BI tool to the analyzed results table, anyone, at any time, can get the freshest results and explore them. There are no longer delays from running reports and data scientist work schedules. A company no longer uses data but is instead powered by data.
For a data science team, this is liberating. There’s no going back and tweaking the fonts and axes in charts. The style sheet is agreed on upfront and every user can explore the data. If changes and upgrades are needed, they can be accumulated, prioritized, and built for scheduled releases. Data scientists are no longer constantly making changes to static reports and can instead focus on Core IP.
Analytical improvements are easier. Training datasets for ML are automatically accumulated. New models can be swapped in and out of pipelines that run in the cloud and compared both retrospectively and prospectively. By saving computational results, multiple models can be maintained for a reasonable length of time to understand both forward looking performance and how decisions would have been different if an updated model would have been used in the past. Nothing is lost. Model improvements can be traded off versus other tasks, as their performance and decisionmaking implications are tracked. Sometimes a better model simply doesn’t matter, because it would not alter downstream actions. Development cycles are shortened and Core IP accumulates faster.
Culturally, this approach improves relations between data scientists and their internal customers. Scientists (and others) are smart, just data limited. By providing reusable tools that are constantly improving, data science empowers users, rather than seeming like a bunch of hotshot nerds who take the hard work of others and tell them what to do with it. Data scientists focus on product: analyses feeding tools that others can use to make decisions. Standard release and update cycles allow customers to know what they can expect. Rather than analytical problems that pique data scientist interest, computational sophistication can be tied to questions that need to be answered, processes that need to be automated, and the accumulation of Core IP. With results freely available to all, the next analysis to run becomes an open, transparent discussion with clear parameters instead of an endless, opaque debate.
Conversations between data science and users are much more fruitful. With standard, reusable product, the focus is on the analyses and what the data mean. Are we capturing everything we should? How else could we cut these results? What other factors should we consider? Modular analytics means talking less about DS model choices and more about their scientific implications. Too often, these conversations are diverted by tangents on how analyses are run, which data scientist is doing what, what the tech stack looks like, etc. By building off of a standard set of tools and “popping in” new analytics, these distractions can be minimized and the focus can return to where it belongs: collaboration on Core IP.
In this paradigm, it’s worth asking where in an organization data science should live. Traditionally part of R+D, data science can’t be traded against biology, chemistry, or even target product profiles. Data are data, if you’re arguing with them, you’re losing. Rather, they should be informing every one of these groups: the biologists, the chemists, the program and product managers, the medical teams, etc. If more data or improved analytics are needed, it should be driven by the needs and requirements of one of those teams, not sui generis from data science. In this way, data science is more a core operational capacity than another scientific team.
This may be too glib. There are activities where there are discovery collaborations and real tradeoffs between data science and laboratory (and other teams). Experimental design, for one, the scale of data collection to inform statistical learning and model building for another. Data science, too, must help guide stakeholders to even be able to understand what they can ask for analytically. It’s a fundamental tension and I’m curious as to other people’s thoughts.
To subscribe to Engineering Biology by Jacob Oppenheim, and receive newly published articles via email, please enter your email address below.