Engineering Biology: ML + Medicine—A Hammer in Search of Nails

Jacob Oppenheim, PhD

Jacob Oppenheim, PhD

June 29, 2023

We have heard stories about how computation and Machine Learning (ML) are poised to change medicine for well over a decade now. Conferences, press reports, promising results—yet little has changed. Brash company launches have turned into biannual reorgs and strategic redirections. Algorithms have been deployed and quickly yanked back; Geoff Hinton famously predicted the imminent end of radiologists almost a decade ago, yet the demand for them grows yearly.

Why is this? Healthcare is technically complex, highly regulated, and full of large, often sclerotic businesses with unexpected incentive structures. Technically, the problems are difficult to get right; operationally, integration of successful solutions must contend with numerous, opinionated players; and logistically, the business is often extremely difficult to make work, given clashing incentive structures, and complex reimbursement arrangements among the various actors.

At heart, I believe we have a hammer in search of nails. We are throwing machine learning models at medicine because it is a data rich domain and a huge share of the economy, not because we have a clear sense of what to do or how to do it.

I’ve written the piece below as an opinionated, technical view of the landscape including its common pitfalls and where we see people escaping them to create real progress. I’m purposefully avoiding, for now, the questions of business models and deployment to focus on the underlying technical hurdles.

Small, unrealistically dense datasets
For over a decade, a standard toy dataset has been various generations of MIMIC (Medical Information Mart for Intensive Care), which consists of Electronic Medical Records (EHR/EMR) from a smallish sample of very sick patients in the ICU at one academic medical center in Massachusetts. The data are clean, there’s frequent sampling of laboratory values, and the population is now in the low tens of thousands. Unsurprisingly, it’s very popular to try to build clinical decision support systems on top of it.

While an impressive endeavor, MIMIC does not reflect the general condition of EHR/EMR data, which are sparse, unevenly sampled, and often lacking standardized lab measurements, when they have them at all. Patients in the ICU are highly unrepresentative, even of the ill.  By over-indexing on ICU datasets such as MIMIC, researchers herd into a handful of niche tasks; for example, the number of sepsis detection algorithms built is astounding, especially considering few, if any, generalize beyond MIMIC.

Recently, more academic medical centers have been building analytics capabilities, interventional models, and, in effect, their own, real-time MIMIC-like datasets. Erich Huang at Duke led many of the first such efforts; David Scheinker at Stanford and Karandeep Singh at Michigan and their collaborators are doing incredible work today.  Most notably, Suchi Saria’s multi-year effort to build clinically relevant decision support algorithms has met with some demonstrable success. I’m hopeful that more of these efforts, combining on-line prediction and close engagement with practitioners will bear fruit over time.

Data volume over quality
The opposite problem exists, too. Where large datasets can be assembled from medical claims and snippets of EHR/EMR data, you’ll find a myriad of vendors peddling algorithms for everything from clinical trial patient identification to, yet again, sepsis prediction. These datasets lack key covariates for meaningful predictions and oftentimes accurate information about what was to be predicted (e.g., eligibility for a clinical trial, which mostly depends on factors not captured in an EHR/EMR).  After Epic’s disastrous rollout of a sepsis prediction algorithm that could be charitably described as relying on correlates of correlates of correlates, there has been increased recognition of data quality pitfalls, yet a long way to go remains.

Where the data generation is better understood, there have been real successes in this space, especially for the inference of underlying diagnoses. Notably, many of these papers are small, hard-fought wins, emphasizing the importance of practitioner engagement and human-in-the-loop validation. The best data sources tend to be those where the vendor has skin in the game themselves, using what they are selling for their main line of business.

Trusting in the Notes
If the solution is not in data depth or volume, perhaps it is in the unstructured, clinical notes that accompany the EHR/EMR. Moderately successful businesses have been built structuring these notes for research in oncology, yet despite a decade plus of effort, data volumes remain limited in the low hundreds of thousands, prices are high, the transcription work manual, and the extension to other disease areas has not occurred. A fundamental question thus remains on whether there consistently is meaningful data in these notes after all. Shorthand remarks for future reference by the clinician or their colleagues may not be generally tractable for data analysis. The diversity of potential stakeholders beyond clinical medicine (e.g. billing, legal, patient transparency) further present complexities beyond typical natural language prcoessing.

Real World Evidence
A fourth, arguably older line of research is in Real World Evidence (RWE) which attempts to draw precise statistical inferences on health outcomes as you would in a randomized clinical trial but from observed EHR data. While putatively simple, by relying on the data that are available rather than carefully defined experiments, estimates are confounded by definition. The mathematical machinery to correct for confounding bias improves yearly, yet relies on our ability to measure the confounders. Unfortunately, we don’t know all the confounding processes and don’t capture all of the details that we do know could lead to confounding (e.g., socioeconomic status). Even when constructed rigorously, high-quality RWE datasets are designed for specific comparisons and not data mining, where confounding will creep back in.

The exercise of comparing similar patients across clinical experiments and observational datasets is an important one, but the leap to precise statistical estimates of treatment effects and comparisons amongst them is a stretch at best and too often undermined by poor methodology. Where we have seen successes, it has often been in rare disease, where small numbers of well-studied patients can elucidate confounders and provide grounds for effective synthetic controls and efficacy comparisons. Jumping straight to complex statistical machinery without first exploiting the power of understanding the data in all its peculiarities is at best questionable data scientific practice.

Medical Imaging
The application of ML to analyzing scans and other forms of medical imaging is logical and well posed in a way that inference from EHR/EMR data is not. It is, however, hampered by small data volumes, nonstandardized machinery, conditions, and covariates, and the ease with which neural nets tend to overfit. These barriers are logical, but we keep running into inconsistencies, poor annotations, and diverse conditions that throw results into question. At Emory, Judy Gichoya’s group has shown that even when you standardize equipment and covariates and train models on a diverse population, they still retain the (relatively) inexplicable ability to accurately predict the patient’s race. The naive selection of tasks, too, continues to produce seemingly impressive results on clinically meaningless problems. Clearly, more work needs to be done here.

Across these disparate areas there are consistent themes: We are throwing tools at problems and not respecting data generating processes and the complexities of medical practice. Data are presumed to be clean, dense, and complete when they rarely are in practice. The modeled tasks are unrepresentative of medical practice. While there are numerous excellent researchers, dedicatedly focusing on particular niches, they are all too frequently buried by the hype cycles. We need to focus on integration with processes and the data available to clinicians, not throwing ML at every possible dataset.

The introduction of ChatGPT has started the hype cycle anew. The ability of very large language models, trained upon medical exams, to score highly on such exams has been heralded as a triumph of Machine Learning, rather than yet another form of database retrieval. Of course, physicians have been using a more effective electronic system, UpToDate, for 30 years now*, suggesting even the database retrieval aspect is overhyped.  We see, too, many rushing to put LLMs in the clinic, application unknown. Another hammer in search of a nail.

But, even in LLMs, there are green shoots of progress. Abridge, founded well before the ChatGPT hype cycle and led by a physician, integrated into an academic medical center, working on a relevant and seemingly soluble problem (visit transcription and summarization), seems to have found a meaningful opportunity for ML technology. Like in most cases with ML, I expect we will find that the successes come not from throwing standard models at extant data, but instead where it fits the contours of physician practice and data generation. More is different and we should exult in it.

*The first half of this article has a good history of UpToDate followed by speculations on the next generation of similar technologies

To subscribe to Engineering Biology by Jacob Oppenheim, and receive newly published articles via email, please enter your email address below.