February 29, 2020
Question: If a test to detect a disease whose prevalence is 1/1000 has a false positive rate of 5%, what is the chance that a person found to have a positive result actually has the disease, assuming you know nothing about the person’s symptoms or signs?
This seemingly simple problem, when posed to attending physicians, house officers and medical students, yielded concerning results in two studies conducted 35 years apart. In both the original paper by Casscells et al. published in the New England Journal of Medicine in 1978 and a follow-up study published in 2014 in JAMA Internal Medicine by Manrai et al., the answer most commonly given by respondents was 95%.
The correct answer is, however, less than 2% (1.96% to be precise).
While getting to the right positive predictive value can be calculated using Bayes’ theorem, applicable when there is knowledge of prior facts (in this case, the prevalence of the disease in the population) that might be related to the event in question (here, the positive result), it is worth noting the common sense approach to the correct solution laid out in the original paper:
"Only one of 1000 people studied will … have the disease, and 5% of the others (0.05 x 999), or roughly 50 persons (from the 1000 tested), will yield (falsely) positive results. Thus only one of 51 positive results will be truly positive, and the chance that any one positive result represents a person with the disease is one in 51, or less than 2 per cent."
The fact that the broad population can’t calculate or has poor instinct for statistics such as positive predictive value is one thing, but more serious concerns about public health and safety are raised when the main clinical actors in the medical profession are less than proficient in understanding basic statistical tools.
– Jonathan Friedlander, PhD & Geoffrey W. Smith
Digitalis Commons is a non-profit that partners with groups and individuals striving to address complex health problems by building solutions that are frontier-advancing, open-access, and scalable.
Technical note: Synthesis.bio
In 2019, Digitalis Commons began exploring the mining of public data sets to identify patterns of value in biotech and related domains. Starting with the OpenFuego open source project that came out of the Nieman Journalism Lab, the Commons team moved on to create an entirely-new application that pulls public data from Twitter APIs and scores mentioned URLs to identify resources of interest to a given community. That work has been running at synthesis.bio, an automatically-curated list of the most interesting web pages among a community of Twitter users with a strong interest in biotech.
The project is maturing rapidly and Digitalis Commons plans to release it as an open source project later this year, after adding additional capabilities. The project is written in Python and makes use of the rich open source ecosystem for data projects, including Pandas, which is the ubiquitous toolset of manipulating columnar data in memory at high speed. The project runs in Docker containers deployed on Amazon's serverless cloud computing infrastructure, and operates at remarkably low cost -- illustrating the extraordinary potential for the creation and operation of next-generation data analytics platforms in the cloud: cheap, fast... and good.
Watch this space for more about the project as it grows and goes public. The team welcomes feedback and contributions of help, ideas, and beer. Contact the team at email@example.com.
Apply for a quick, targeted $3,000 grant to develop a public good for better health. Application details at digitaliscommons.org/dart-grants/.
To subscribe to Engineering Biology by Jacob Oppenheim, and receive newly published articles via email, please enter your email address below.