Metrics details
Clinical prediction models should be validated before implementation in clinical practice
But is favorable performance at internal validation or one external validation sufficient to claim that a prediction model works well in the intended clinical context
We argue to the contrary because (1) patient populations vary
and (3) populations and measurements change over time
we have to expect heterogeneity in model performance between locations and settings
It follows that prediction models are never truly validated
This does not imply that validation is not important
the current focus on developing new models should shift to a focus on more extensive
and well-reported validation studies of promising models
Principled validation strategies are needed to understand and quantify heterogeneity
and update prediction models when appropriate
Such strategies will help to ensure that prediction models stay up-to-date and safe to support clinical decision-making
Whereas internal validation focuses on reproducibility and overfitting
external validation focuses on transportability
Although assessing transportability of model performance is vital
an external validation with favorable performance does not prove universal applicability and does not justify the claim that the model is ‘externally valid’
the aim should be to assess performance across many locations and over time
in order to maximize the understanding of model transportability
we argue that it is impossible to definitively claim that a model is ‘externally valid’
and that such terminology should be avoided
We discuss three reasons for this argument
Distribution of patient age in the 9 largest centers from the ovarian cancer study. Histograms, density estimates, and mean (standard deviation) are given per center
Distribution of maximum lesion diameter in the 9 largest centers from the ovarian cancer study
and median (interquartile range) are given per center
Median cohort size was 283 (range 25 to 25,056)
mean patient age varied between 45 and 71 years
the percentage of male patients varied between 45 and 74%
Pooled performance estimates were 0.77 for the c-statistic
0.65 for the observed over expected (O:E) ratio
The O:E ratio < 1 suggests that the model tends to overestimate the risk of in-hospital mortality
The calibration slope < 1 suggests that risk estimates also tend to be too extreme (i.e.
Large heterogeneity in performance was observed
with 95% prediction intervals of 0.63 to 0.87 for the c-statistic
and of 0.34 to 0.66 for the calibration slope
95% prediction intervals indicate the performance that can be expected when evaluating the model in new clusters
When adjusting for differences in patient characteristics
This suggests that about one third of the decrease in discrimination at external validation was due to more homogenous patient samples
given that clinical trial datasets were used for external validation
which often contain more homogeneous samples than observational datasets
Such measurements are increasingly used in prediction modeling studies based on electronic health records
The c-statistic on the test set (5970 radiographs from 2256 patients; random train-test split) was 0.78
When non-fracture and fracture test set cases were matched on patient variables (age
the c-statistic for hip fracture decreased to 0.67
When matching also included hospital process variables (including scanner model
This suggests that variables such as the type of scanner can inflate predictions for hip fracture
Reported methods included the Confusion Assessment Method (CAM)
Frequency varied between once to more than once per day
These images were randomly selected after stratification by classification given by a deep learning model (50 images labeled as positive for pneumonia
There was a complete agreement for 52 cases
Pairwise kappa statistics varied between 0.38 and 0.80
Each patient was examined by one of 40 different clinicians across 19 hospitals
The researchers calculated the proportion of the variance in the measurements that is attributable to systematic differences between clinicians
For the binary variable indicating whether the patient was using hormonal therapy
the analysis suggested that 20% of the variability was attributed to the clinician doing the assessment
The percentage of patients reporting the use of hormonal therapy roughly varied between 0 and 20%
A subsequent survey among clinicians revealed that clinicians reporting high rates of hormonal therapy had assessed this more thoroughly
and that there was a disagreement of the definition of hormonal therapy
the radiologists evaluated the risk of MVI on a five-point scale (definitely positive
Kappa values were between 0.42 and 0.47 for the features
The c-statistic of the risk for MVI (with histopathology as the reference standard)
but the validity of predictions may be distorted
The models were developed using different algorithms (e.g.
and were validated over time using similar data from patients admitted up to and including 2012
Although discrimination remained fairly stable
there was clear evidence of calibration drift for all models: the risk of the event became increasingly overestimated over time
Accompanying shifts in the patient population were noted: for example
the incidence of the event steadily decreased from 7.7 to 6.2%
the proportion of patients with a history of cancer or diabetes increased
and the use of various medications increased
observed mortality was 4.1% whereas EuroSCORE had an average estimated risk of 5.6%
observed mortality was 2.8% but the average estimated risk was 7.6%
The c-statistic showed no systematic deterioration
temporal changes were observed for several predictors (e.g.
average age and prevalence of recent myocardial infarction increased) and surgical procedures (e.g.
fewer isolated coronary artery bypass graft procedures)
The authors further stated that surgeons may have been more willing to operate on patients due to improvements in anesthetic
Such criteria lack scientific underpinning
We question the requirement from some journals that model development studies should include “an external validation”
this requirement may induce selective reporting of a favorable result in a single setting
Imagine a model that has been externally validated in tens of locations
Discrimination and calibration results were good
with limited heterogeneity between locations
This would obviously be an important and reassuring finding
there is still no 100% guarantee that the prediction model will also work fine in a new location
it remains unclear how populations change in the future
Such strategies help to ensure that prediction models stay up-to-date to support medical decision-making
Information from all examples (except the example on ovarian cancer) was based on information available in published manuscripts
The data used for the ovarian cancer example were not generated in the context of this study and were reused to describe differences between populations from different centers
The data cannot be shared for ethical/privacy reasons
Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis
Prognosis and prognostic research: validating a prognostic model
Prediction models need appropriate internal
Predictive analytics in health care: how can we know it works
Internal validation of predictive models: efficiency of some procedures for logistic regression analysis
Assessing the generalizability of prognostic information
What do we mean by validating a prognostic model
The myth of generalisability in clinical research and machine learning in health care
and outcomes in patients with traumatic brain injury in CENTER-TBI: a European prospective
Calibration: the Achilles heel of predictive analytics
External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges
Calibration of risk prediction models: impact on decision-analytic performance
Generalizability of Cardiovascular Disease Clinical Prediction Models: 158 Independent External Validations of 104 Unique Models
Validation of models to diagnose ovarian cancer in patients managed surgically or conservatively: multicentre cohort study
Clinical prediction models for mortality in patients with covid-19: external validation and individual participant data meta-analysis
Verification of the harmonization of human epididymis protein 4 assays
The heterogeneity of concentrated prescribing behavior: Theory and evidence from antipsychotics
Biases in electronic health record data due to processes within the healthcare system: retrospective observational study
Changing predictor measurement procedures affected the performance of prediction models in clinical examples
Impact of predictor measurement heterogeneity across settings on the performance of prediction models: A measurement error perspective
Deep learning predicts hip fracture using confounding patient and healthcare variables
Derivation of a simple clinical model to categorize patients probability of pulmonary embolism: increasing the models utility with the SimpliRED D-dimer
Critical issues in the evaluation and management of adult patients presenting to the emergency department with suspected pulmonary embolism
Clinical experience and pre-test probability scores in the diagnosis of pulmonary embolism
Systematic review of prediction models for delirium in the older adult inpatient
Accurate auto-labeling of chest X-ray images based on quantitative similarity to an explainable AI model
Screening for data clustering in multicenter studies: the residual intraclass correlation
Interobserver Variability and Diagnostic Performance of Gadoxetic Acid-enhanced MRI for Predicting Microvascular Invasion in Hepatocellular Carcinoma
Reynard C, Jenkins D, Martin GP, Kontopantelis E, Body R. Is your clinical prediction model past its sell by date? Emerg Med J. 2022. https://doi.org/10.1136/emermed-2021-212224
Feature robustness in non-stationary health records: caveats to deployable model performance in common clinical machine learning tasks
Detection of calibration drift in clinical prediction models to inform model updating
Continual updating and monitoring of clinical prediction models: time for dynamic prediction systems
Prediction models will be victims of their own success
Informative missingness in electronic health record systems: the curse of knowing
Calibration drift in regression and machine learning models for acute kidney injury
Dynamic trends in cardiac surgery: why the logistic EuroSCORE is no longer suitable for contemporary cardiac surgery and implications for future risk models
A clinical prediction model for outcome and therapy delivery in transplant-ineligible patients with myeloma (UK Myeloma Research Alliance Risk Profile): a development and validation study
Understanding receiver operating characteristic (ROC) curves
Assessment of heterogeneity in an individual participant data meta-analysis of prediction models: an overview and illustration
and evaluating clinical prediction models in an individual participant data meta-analysis
A framework for meta-analysis of prediction model studies with binary and time-to-event outcomes
Does ignoring clustering in multicenter data influence the performance of prediction models
Geographic and temporal validity of prediction models: different approaches were useful to examine model performance
Validation of prediction models: examining temporal and geographic stability of baseline risk and estimated covariate effects
Untapped potential of multicenter studies: a review of cardiovascular risk prediction models revealed inappropriate analyses and wide variation in reporting
Internal-external cross-validation helped to evaluate the generalizability of prediction models in large clustered datasets
Multicentre prospective validation of use of the Canadian C-Spine Rule by triage nurses in the emergency department
Minimum sample size for external validation of a clinical prediction model with a binary outcome
Transparent reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): the TRIPOD statement
Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration
Transparent reporting of multivariable prediction models developed or validated using clustered data: TRIPOD-Cluster checklist
Transparent reporting of multivariable prediction models developed or validated using clustered data (TRIPOD-Cluster): explanation and elaboration
Methodological guidance for the evaluation and updating of clinical prediction models: a systematic review
Sample size considerations for the external validation of a multivariable prognostic model: a resampling study
A calibration hierarchy for risk models was defined: from utopia to empirical datra
Download references
BVC was funded by the Research Foundation – Flanders (FWO; grant G097322N)
Internal Funds KU Leuven (grant C24M/20/064)
and University Hospitals Leuven (grant COPREDICT)
Department of Development and Regeneration
CAPHRI Care and Public Health Research Institute
Julius Center for Health Sciences and Primary Care
and MvS reviewed and edited the manuscript
All authors agree to take accountability for this work
The reuse of the ovarian cancer data for methodological purposes was approved by the Research Ethics Committee UZ / KU Leuven (number S64709)
and the need for individual information letters was waived
The authors declare that they have no competing interests
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations
unless otherwise stated in a credit line to the data
Download citation
DOI: https://doi.org/10.1186/s12916-023-02779-w
Anyone you share the following link with will be able to read this content:
a shareable link is not currently available for this article
Metrics details
The assessment of calibration performance of risk prediction models based on regression or more flexible machine learning algorithms receives little attention
we argue that this needs to change immediately because poorly calibrated algorithms can be misleading and potentially harmful for clinical decision-making
We summarize how to avoid poor calibration at algorithm development and how to assess calibration at algorithm validation
emphasizing balance between model complexity and the available sample size
calibration curves require sufficiently large samples
Algorithm updating should be considered for appropriate support of clinical practice
Efforts are required to avoid poor calibration when developing prediction models
to evaluate calibration when validating models
The ultimate aim is to optimize the utility of predictive analytics for shared decision-making and patient counseling
These predictions may support clinical decision-making and better inform patients
Algorithms (or risk prediction models) should give higher risk estimates for patients with the event than for patients without the event (‘discrimination’)
discrimination is quantified using the area under the receiver operating characteristic curve (AUROC or AUC)
also known as the concordance statistic or c-statistic
it may be desirable to present classification performance at one or more risk thresholds such as sensitivity
another key aspect of performance that is often overlooked
and summarize how calibration can be assessed
We explain the relevance of calibration in this paper and suggest solutions to prevent or correct poor calibration and thus make predictive algorithms more clinically relevant
Irrespective of how well the models can discriminate between treatments that end in live birth versus those that do not
it is clear that strong over- or underestimation of the chance of a live birth makes the algorithms clinically unacceptable
a strong overestimation of the chance of live birth after IVF would give false hope to couples going through an already stressful and emotional experience
has a favorable prognosis exposes the woman unnecessarily to possible harmful side effects
When using the traditional risk threshold of 20% to identify high-risk patients for intervention
QRISK2–2011 would select 110 per 1000 men aged between 35 and 74 years
NICE Framingham would select almost twice as many (206 per 1000 men) because a predicted risk of 20% based on this model actually corresponded to a lower event rate
This example illustrates that overestimation of risk leads to overtreatment
Illustrations of different types of miscalibration
Illustrations are based on an outcome with a 25% event rate and a model with an area under the ROC curve (AUC or c-statistic) of 0.71
Calibration intercept and slope are indicated for each illustrative curve
a General over- or underestimation of predicted risks
b Predicted risks that are too extreme or not extreme enough
we recommend against using the Hosmer–Lemeshow test to assess calibration
it is reasonable for a model not to be developed at all
internal validation procedures can quantify the calibration slope
calibration-in-the-large is irrelevant since the average of predicted risks will match the event rate
calibration-in-the-large is highly relevant at external validation
where we often note a mismatch between the predicted and observed risks
which was published under the Creative Commons Attribution–Noncommercial (CC BY-NC 4.0) license
which was published under the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license
Tufts PACE clinical predictive model registry: update 1990 through 2015
Prognostic models in obstetrics: available
A calibration hierarchy for risk models was defined: from utopia to empirical data
External validation of multivariable prediction models: a systematic review of methodological conduct and reporting
A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models
Reporting and methods in clinical prediction research: a systematic review
A spline-based tool to assess and visualize the calibration of multiclass risk predictions
Modern modeling techniques had limited external validity in predicting mortality from traumatic brain injury
Big data and predictive analytics: recalibrating expectations
A deep learning mammography-based model for improved breast cancer risk prediction
Predicting the chance of live birth for women undergoing IVF: a novel pretreatment counselling tool
Predicting the 10 year risk of cardiovascular disease in the United Kingdom: independent and external validation of an updated version of QRISK2
Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries
Strategies to diagnose ovarian cancer: new evidence from phase 3 of the multicentre international IOTA study
Prediction of indolent prostate cancer: validation and updating of a prognostic nomogram
Prospective validation of the good outcome following attempted resuscitation (GO-FAR) score for in-hospital cardiac arrest prognosis
Predicting the cumulative chance of live birth over multiple complete cycles of in vitro fertilization: an external validation study
Comparison of two models predicting IVF success; the effect of time trends on model performance
Poor performance of clinical prediction models: the harm of commonly applied methods
Clinical impact of prostate specific antigen (PSA) inter-assay variability on management of prostate cancer
Impact of predictor measurement heterogeneity across settings on performance of prediction models: a measurement error perspective
A novel multiple marker bioassay utilizing HE4 and CA125 for the prediction of ovarian cancer in patients with a pelvic mass
Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers
Sample size for binary logistic prediction models: beyond events per variable criteria
Minimum sample size for developing a multivariable prediction model: PART II - binary and time-to-event outcomes
Penalized maximum likelihood estimation to directly adjust diagnostic and prognostic prediction models for overoptimism: a clinical example
Van Calster B, van Smeden M, Steyerberg EW. On the variability of regression shrinkage methods for clinical prediction models: simulation study on predictive performance. arXiv. 2019; https://arxiv.org/abs/1907.11493
Validation and updating of predictive logistic regression models: a study on sample size and shrinkage
A review of statistical updating methods for clinical prediction models
Dynamic prediction modeling approaches for cardiac surgery
Prediction model to estimate presence of coronary artery disease: retrospective pooled analysis of existing cohorts
External validation and extension of a diagnostic model for obstructive coronary artery disease: a cross-sectional predictive evaluation in 4888 patients of the Austrian Coronary Artery disease Risk Determination In Innsbruck by diaGnostic ANgiography (CARDIIGAN) cohort
Download references
This work was developed as part of the international STRengthening Analytical Thinking for Observational Studies (STRATOS) initiative. The objective of STRATOS is to provide accessible and accurate guidance in the design and analysis of observational studies (http://stratos-initiative.org/)
Members of the STRATOS Topic Group ‘Evaluating diagnostic tests and prediction models’ are (alphabetically) Patrick Bossuyt
This work was funded by the Research Foundation – Flanders (FWO; grant G0B4716N) and Internal Funds KU Leuven (grant C24/15/037)
http://www.stratos-initiative.org
All authors reviewed and edited the manuscript and approved the final version
Detailed illustration of the assessment of calibration and model updating: the ROMA logistic regression model
Download citation
DOI: https://doi.org/10.1186/s12916-019-1466-7
Metrics details
Machine learning is increasingly being used to predict clinical outcomes
Most comparisons of different methods have been based on empirical analyses in specific datasets
We used Monte Carlo simulations to determine when machine learning methods perform better than statistical learning methods in a specific setting
We evaluated six learning methods: stochastic gradient boosting machines using trees as the base learners
and linear regression estimated using ordinary least squares (OLS)
Our simulations were informed by empirical analyses in patients with acute myocardial infarction (AMI) and congestive heart failure (CHF) and used six data-generating processes
each based on one of the six learning methods
to simulate continuous outcomes in the derivation and validation samples
The outcome was systolic blood pressure at hospital discharge
We applied the six learning methods in each of the simulated derivation samples and evaluated performance in the simulated validation samples
The primary observation was that neural networks tended to result in estimates with worse predictive accuracy than the other five methods in both disease samples and across all six data-generating processes
Boosted trees and OLS regression tended to perform well across a range of scenarios
Three of the above four studies focused on binary outcomes
while that of Shin and colleagues considered both binary and time-to-event outcomes
The relative performance of ML methods and conventional statistical methods for predicting continuous outcomes has received substantially less attention
In the current study we focus on prediction of a specific continuous outcome important in clinical medicine: systolic blood pressure
we summarize our findings and place them in the context of the existing literature
We conducted a set of empirical analyses to compare the performance of different machine and statistical learning methods in two different disease groups: patients hospitalized with acute myocardial infarction (AMI) and patients hospitalized with congestive heart failure (CHF)
In each disease group we examined the ability of different methods to predict a patient’s systolic blood pressure at hospital discharge
Model performance was assessed using independent validation samples
the derivation sample consisted of 8145 patients discharged alive from hospital between April 1
while the validation sample consisted of 4444 patients discharged alive from hospital between April 1
the derivation sample consisted of 7156 patients discharged alive from hospital between April 1
while the validation sample consisted of 6818 patients discharged alive from hospital between April 1
the derivation and validation samples came from distinct time periods
vital signs and physical examination at presentation
and results of laboratory tests were collected for these samples
the outcome was a continuous variable denoting the patient’s systolic blood pressure at the time of hospital discharge
Differences in covariates between derivation and validation samples were tested using a t-test for continuous covariates and a Chi-squared test for binary variables
The use of the data in this project is authorized under Section 45 of Ontario’s Personal Health Information Protection Act (PHIPA) and does not require review by a Research Ethics Board
All research was performed in accordance with relevant guidelines and regulations
the grid searches resulted in the following values for the hyper-parameters: boosted trees (interaction depth: 4; shrinkage/learning rate: 0.065)
random forests (number of randomly sampled variables: 6; minimum terminal node size: 20)
neural networks (5 neurons in the hidden layer
from a grid search that considered the number of neurons ranging from 2 to 15 in increments of 1; weight decay parameter: 0.05)
random forests (number of randomly sampled variables: 8; minimum terminal node size: 20)
neural networks (6 neurons in the hidden layer
from a grid search that considered the number of neurons ranging from 2 to 15 in increments of 1; weight decay parameter: 0)
R2 was computed as the square of the Pearson correlation coefficient between observed and predicted discharge blood pressure
while MSE and MAE were estimated as \(\frac{1}{N}\sum\limits_{i = 1}^{N} {(Y_{i} - \hat{Y}_{i} )^{2} }\) and \(\frac{1}{N}\sum\limits_{i = 1}^{N} {|Y_{i} - \hat{Y}_{i} |}\)
where \(Y\) denotes the observed blood pressure and \(\hat{Y}\) denotes the estimated blood pressure
we used implementations available in R statistical software (R version 3.6.1
For random forests we used the randomForest function from the randomForest package (version 4.6-14)
The number of trees (500) was the default in this implementation
For boosted trees we used the gbm function from the gbm package (version 2.5.1)
The number of trees (100) was the default in this implementation
We used the ols and rcs functions from the rms package (version 5.1-3.1) to estimate the OLS regression model incorporating restricted cubic regression splines
Feed-forward (or multilayer perceptron) neural networks with a single hidden layer were fit using the nnet package (version 7.3-12) with a linear activation function
Ridge regression and the lasso were implemented using the functions cv.glmnet (for estimating the λ parameter using tenfold cross-validation) and glmnet from the glmnet package (version 2.0-18)
Performance in validation sample (Case study)
random forests resulted in predictions with the highest R2 (23.7%); however
differences between five of the six methods were minimal again (range: 22.2 to 23.7%)
Random forests resulted in estimates with the lowest MSE
while boosted trees resulted in estimates with the lowest MAE
MAE did not vary meaningfully across five of the six methods (range: 15.0 to 15.2)
the neural network had substantially worse performance than the other five methods across all three metrics
When comparing the three linear model-based approaches
neither of the two penalized approaches (lasso and ridge regression) had an advantage over conventional OLS regression in either disease samples
the lasso and ridge regression had very similar performance to each other
a tree-based machine learning method (either boosted trees or random forest) tended to result in estimates with the greatest predictive accuracy in the validation samples
differences between five of the methods were minimal
Neural networks resulted in estimates with substantially worse performance compared to the other five methods
We considered six different data-generating processes for each of the two diseases (AMI and CHF)
We describe the approach in detail for the AMI sample
An identical approach was used with the CHF sample
We used the derivation and validation samples described in the empirical analyses above
We made one modification to the validation samples described above
The validation sample used above consisted of 4444 subjects (AMI validation sample) and 6818 (CHF validation sample)
In order to remove variation in external performance due to small sample sizes
we sampled with replacement from each validation sample to create validation samples consisting of 100,000 subjects
the method was fit in the derivation sample
The fitted model was then applied to both the derivation sample and the validation sample
Using the model/algorithm fit in the derivation sample
a predicted outcome (discharge systolic blood pressure) was obtained for each subject in each of the two datasets (derivation and validation samples)
we proceeded as follows: Using these predicted blood pressures at discharge
a continuous outcome was simulated for each subject as follows
a residual or prediction error was computed as the difference between the true observed discharge blood pressure and the estimated blood pressure obtained from the fitted model
a residual was drawn with replacement from the empirical distribution of residuals estimated in the previous step
the sampled residual was added to the estimated discharge blood pressure
This quantity is the simulated outcome for the given patient
This process was then repeated in the validation sample to obtain a simulated outcome for each subject in the validation sample
Note that the given prediction model was only fit once (in the derivation sample) but was then applied in both the derivation and validation samples to obtain estimated values of discharge blood pressure
These simulated outcomes were then used as the ‘true’ outcomes in all subsequent analyses
The above process was used when the data-generating process was based on random forests
When the data-generating process was based on OLS regression
we used a modified version of this process
Instead of sampling from the empirical distribution of residuals
we sampled residuals from a normal distribution with mean zero and standard deviation equal to that estimated for error distribution from the OLS model
These sampled residuals were then added to the estimated discharge blood pressure to produce simulated continuous outcomes
For a given pair of derivation and validation samples
we fit each of the six statistical/machine learning methods (boosted trees
and OLS regression) in the derivation sample and then applied the fitted model to the validation sample
an estimated discharge blood pressure for each of the six prediction methods
The performance of the predictions obtained using each method was assessed using the three metrics described above (R2
for a given data-generating process and a given prediction method we obtained 1000 values of R2
when outcomes were simulated in the derivation and validation samples using random forests
we assessed the predictive accuracy of boosted trees
This process was repeated using the datasets in which outcomes were simulated using the five other data-generating processes
Performance in AMI sample (External validation)
Across all six data-generating processes and across all three performance metrics
the use of neural networks tended to result in predictions with the lowest accuracy
Even when outcomes were simulated using a neural network
the other five methods tended to result in predictions with higher accuracy than did the use of neural networks
The difference in performance between neural networks and that of the other five methods was substantially greater than the differences amongst the other five methods
When outcomes were generated using boosted trees
the use of boosted trees tended to result in estimates with the highest R2
while estimates obtained using OLS regression tended to result in estimates with comparable performance
When outcomes were generated using an OLS regression model
the use of OLS regression tended to result in estimates with the highest R2
The performance of OLS regression was followed by that of boosted trees and the two penalized regression methods
When outcomes were generated using a penalized regression method
the three linear regression models tended to result in estimates with the highest R2
when outcomes were generated using random forests
the use of boosted trees and random forests tended to result in estimates with the highest R2
When considering the three linear regression-based approaches
there was no advantage to using a penalized regression approach compared to using OLS regression
the differences between the five non-neural network approaches tended to be minimal
regardless of the data-generating processes
the use of OLS regression tended to perform well
and there were no meaningful benefits to using a different approach
MSE and MAE of estimates obtained using neural networks displayed high variability across the 1000 simulation replicates
Performance in CHF sample (External validation)
when outcomes were simulated using random forests
the use of random forests tended to result in estimates with the highest R2
although the performance of boosted trees was comparable
When outcomes were generated using a linear regression-based approach
then the three linear regression-based approaches tended to result in estimates with the highest R2
Similar results were observed when MSE and MAE were used to assess performance accuracy
when considering the three linear regression-based estimation methods
there were rarely meaningful benefits to using a penalized estimation method compared to using OLS regression
There is a growing interest in comparing the relative performance of different machine and statistical learning methods for predicting patient outcomes
To better understand differences in the relative performance of competing learning methods for predicting continuous outcomes
we used two empirical comparisons and Monte Carlo simulations using six different data-generating processes
each based upon a different learning method
These simulations enabled us to examine the performance of methods different from those under which the data were generated compared to the method that was used to generate the data
In both of the empirical analyses and in all six sets of Monte Carlo simulations
the performance of neural networks was substantially poorer than that of the other five learning methods
the number of subjects in both of our derivation samples and in both of our validation samples were substantially higher than those used in these previous studies
An advantage to the current study was its use of simulations to compare the relative performance of different learning methods for predicting blood pressure
A strength of the design of these simulations is that they were based on two real data sets
each with a realistic correlation structure between predictors and with realistic associations between predictors and outcomes
we were able to simulate datasets reflective of those that would be seen in specific clinical contexts
both the sizes of the simulated dataset and the number of predictors that we considered are reflective of what is often encountered in clinical research
Some might argue that the number of predictors (33 and 28 in the AMI and CHF studies respectively) is relatively high for conventional regression modeling
and relatively low for modern machine learning techniques
the use of boosting resulted in improved performance
The objective of the current study was not to develop a new learning method nor was it to improve existing learning methods17
Our objective was to compare the relative performance of different learning methods for predicting a continuous outcome
while there is a growing number of studies comparing different learning methods
the large majority of these studies rely on empirical comparisons using a single dataset
A strength of the current study is its use of Monte Carlo simulations to conduct these comparisons systematically
A methodological contribution of the current study is providing a framework for Monte Carlo simulations that allows for a more informed comparison of different learning methods
Because we knew which learning method was the true model that generated the outcomes
the performance of each of the other five methods could be compared to that of the true method
we demonstrated that when outcomes were generated using boosted trees
the use of OLS regression had performance comparable to that of boosted trees for predicting blood pressure (in the AMI sample)
we found that a default implementation of a neural network had substantially poorer performance compared to five other learning methods for predicting discharge systolic blood pressure in patients hospitalized with heart disease
This finding was observed both in two sets of empirical analyses and in six sets of Monte Carlo simulations
We also observed that there was no meaningful advantage to the use of penalized linear models (i.e.
the lasso or ridge regression) compared to using OLS regression
Boosted trees tended to have the best performance of the different machine learning methods for the number of covariates studied
Investigators interested in predicting blood pressure may often be able to limit their attention to OLS regression and boosted trees and select the method that performs best in their specific context
We encourage researchers to apply our simulation framework to other diseases and other empirical datasets to examine whether our findings persist across different settings and diseases
The use of data in this project was authorized under Section 45 of Ontario’s Personal Health Information Protection Act
which does not require review by a Research Ethics Board
This study did not include experiments involving human subjects or tissue samples
The data sets used for this study were held securely in a linked, de-identified form and analysed at ICES. While data sharing agreements prohibit ICES from making the data set publicly available, access may be granted to those who meet pre-specified criteria for confidential access, available at www.ices.on.ca/DAS
If you are interested in requesting ICES Data & Analytic Services
please contact ICES DAS (e-mail: das@ices.on.ca or at 1-888-480-1327)
Random forest versus logistic regression: A large-scale benchmark experiment
Comparison of artificial neural network and logistic regression models for prediction of outcomes in trauma patients: A systematic review and meta-analysis
conventional statistical models for predicting heart failure readmission and mortality
Effectiveness of public report cards for improving the quality of cardiac care: the EFFECT study: a randomized trial
Regression trees for predicting mortality in patients with cardiovascular disease: what improvement is achieved by using ensemble-based methods?
Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N
and multivariate adaptive regression splines for predicting AMI mortality
In Machine Learning: Proceedings of the Thirteenth International Conference 148–156 (Morgan Kauffman
Additive logistic regression: A statistical view of boosting (with discussion)
Propensity score estimation with boosted regression for evaluating causal effects in observational studies
The Elements of Statistical Learning 2nd edn
Machine learning compared with conventional statistical models for predicting myocardial infarction readmission and mortality: A systematic review
A plea for neutral comparison studies in computational sciences
Ten quick tips for machine learning in computational biology
Introduction to Neural Networks with Java 2nd edn
Predicting increased blood pressure using machine learning
Predicting hypertension using machine learning: Findings from Qatar Biobank Study
Predicting systolic blood pressure using machine learning
In 7th International Conference on Information and Automation for Sustainability 1–6 (2014)
Predicting blood pressure from physiological index data using the SVR algorithm
Modern modelling techniques are data hungry: A simulation study for predicting dichotomous endpoints
Random Forest vs Logistic Regression: Binary classification for heterogeneous datasets
A comparison of machine learning techniques for customer churn prediction
Predictive analytics in health care: How can we know it works?
Download references
which is funded by an annual grant from the Ontario Ministry of Health (MOH) and the Ministry of Long-Term Care (MLTC)
As a prescribed entity under Ontario’s privacy legislation
ICES is authorized to collect and use health care data for the purposes of health system analysis
Secure access to these data is governed by policies and procedures that are approved by the Information and Privacy Commissioner of Ontario
This research was supported by an operating grant from the Canadian Institutes of Health Research (CIHR) (PJT-166161)
Austin is supported in part by Mid-Career Investigator awards from the Heart and Stroke Foundation
Harrell's work on this paper was supported by CTSA award No
UL1 TR002243 from the National Center for Advancing Translational Sciences
Its contents are solely the responsibility of the authors and do not necessarily represent official views of the National Center for Advancing Translational Sciences or the National Institutes of Health
The use of data in this project was authorized under section 45 of Ontario’s Personal Health Information Protection Act
coded the simulations and wrote the first draft of the manuscript
contributed to the design of the simulations
provided clinical expertise and revised the manuscript
The authors declare no competing interests
Download citation
DOI: https://doi.org/10.1038/s41598-022-13015-5
Sorry, a shareable link is not currently available for this article.
International Journal of Data Science and Analytics (2024)
Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.
Volume 4 - 2022 | https://doi.org/10.3389/fdgth.2022.923944
Artificial Intelligence and Statistics: Just the Old Wine in New Wineskins
A Commentary on Artificial Intelligence and Statistics: Just the Old Wine in New Wineskins?
by Faes, L., Sim, D. A., van Smeden, M., Held, U., Bossuyt, P. M., and Bachmann, L. M. (2022). Front. Digit. Health 4:833912. doi: 10.3389/fdgth.2022.833912
We write to expand on Faes's et al. recent publication “Artificial intelligence and statistics: Just the old wine in new wineskins?” (1)
The authors rightly address a lack of consensus regarding terminology between the statistics and machine learning fields
Guidance is needed to provide a more unified way of reporting and comparing study results between the different fields
Major differences can be observed in the measures commonly used across these axes to evaluate predictive performance in the statistics and machine learning fields
We here highlight key measures focusing on discriminative ability and clinical utility [or effectiveness (6)]. Table 1 provides a non-exhaustive overview
All measures relate to the evaluation of probability predictions for binary outcomes
They are derived from the 2 × 2 confusion matrix for specific or consecutive decision thresholds
Evaluation measures from statistics and machine learning fields
sensitivity (fraction true positive) and specificity (fraction true negative) can
be considered as independent of event rate
Some measures are considered outdated in the classic statistical learning field
while still popular in the machine learning field
Such a measure is the crude accuracy (the fraction of correct classifications)
a 99% accuracy is the minimum for a setting with 1% event rate and classifying all subjects as “low risk.”
Decision analytical approaches move away from pure discrimination and toward clinical utility. Net benefit is the most popular among some recently proposed measures for clinical utility (4, 5). It is derived from a decision analytical framework and weighs sensitivity and specificity by clinical consequences. Net benefit has a clear interpretation when compared to treat-all and treat-none strategies (4, 5)
We recommend that the aim of the evaluation of a model should determine our focus at clinical performance (discrimination
with quantification by appropriate measures
All authors contributed to the article and approved the submitted version
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations
Any product that may be evaluated in this article
or claim that may be made by its manufacturer
is not guaranteed or endorsed by the publisher
Artificial intelligence and statistics: just the old wine in new wineskins
Measures to summarize and compare the predictive capacity of markers
PubMed Abstract | CrossRef Full Text | Google Scholar
Calibration: the achilles heel of predictive analytics
Decision curve analysis: a novel method for evaluating prediction models
PubMed Abstract | CrossRef Full Text | Google Scholar
Net benefit approaches to the evaluation of prediction models
From biomarkers to medical tests: the changing landscape of test evaluation
The need for reorientation toward cost-effective prediction: comments on 'Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond' by M
Using relative utility curves to evaluate risk prediction
The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets
The relationship between Precision-Recall and ROC curves
In: Proceedings of the 23rd International Conference on Machine Learning
PA: Association for Computing Machinery (2006)
CrossRef Full Text | Google Scholar
van Calster B and Steyerberg EW (2022) Commentary: Artificial Intelligence and Statistics: Just the Old Wine in New Wineskins
Received: 19 April 2022; Accepted: 03 May 2022; Published: 20 May 2022
Copyright © 2022 de Hond, van Calster and Steyerberg. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY)
distribution or reproduction in other forums is permitted
provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited
in accordance with accepted academic practice
distribution or reproduction is permitted which does not comply with these terms
*Correspondence: Anne A. H. de Hond, YS5hLmguZGVfaG9uZEBsdW1jLm5s
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations
Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher
94% of researchers rate our articles as excellent or goodLearn more about the work of our research integrity team to safeguard the quality of each article we publish
The use of evidence from clinical trials to support decisions for individual patients is a form of “reference class forecasting”: implicit predictions for an individual are made on the basis of outcomes in a reference class of “similar” patients treated with alternative therapies
Evidence based medicine has generally emphasized the broad reference class of patients qualifying for a trial
Yet patients in a trial (and in clinical practice) differ from one another in many ways that can affect the outcome of interest and the potential for benefit
is to narrow the reference class to yield more patient specific effect estimates to support more individualized clinical decision making
This article will review fundamental conceptual problems with the prediction of outcome risk and heterogeneity of treatment effect (HTE)
as well as the limitations of conventional (one-variable-at-a-time) subgroup analysis
It will also discuss several regression based approaches to “predictive” heterogeneity of treatment effect analysis
including analyses based on “risk modeling” (such as stratifying trial populations by their risk of the primary outcome or their risk of serious treatment-related harms) and analysis based on “effect modeling” (which incorporates modifiers of relative effect)
It will illustrate these approaches with clinical examples and discuss their respective strengths and vulnerabilities
Series explanation: State of the Art Reviews are commissioned on the basis of their relevance to academics and specialists in the US and internationally
For this reason they are written predominantly by US authors
Contributors: The concepts of this manuscript were discussed among all authors
DMK prepared the initial draft of the manuscript
Substantial revisions were made by all authors
Funding: This work was partially supported through two Patient-Centered Outcomes Research Institute (PCORI) grants (the Predictive Analytics Resource Center (PARC) (SA.Tufts.PARC.OSCO.2018.01.25) and Methods Award (ME-1606-35555))
as well as by the National Institutes of Health (U01NS086294)
Competing interests: All authors have read and understood BMJ policy on declaration of interests and declare no competing interests
Provenance and peer review: Commissioned; externally peer reviewed
Disclosures: All statements in this report
are solely those of the authors and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute (PCORI)
BMA Member Log In
Subscribe and get access to all BMJ articles
Subscribe
Access this article for 1 day for:£50 / $60/ €56 (excludes VAT)
You can download a PDF version for your personal record
Buy this article
Respond to this article
Thank you for your interest in spreading the word about The BMJ
NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it
Read related article
See previous polls
Metrics details
Early detection of severe asthma exacerbations through home monitoring data in patients with stable mild-to-moderate chronic asthma could help to timely adjust medication
We evaluated the potential of machine learning methods compared to a clinical rule and logistic regression to predict severe exacerbations
We used daily home monitoring data from two studies in asthma patients (development: n = 165 and validation: n = 101 patients)
one class SVM) and a logistic regression model provided predictions based on peak expiratory flow and asthma symptoms
These models were compared with an asthma action plan rule
Severe exacerbations occurred in 0.2% of all daily measurements in the development (154/92,787 days) and validation cohorts (94/40,185 days)
The AUC of the best performing XGBoost was 0.85 (0.82–0.87) and 0.88 (0.86–0.90) for logistic regression in the validation cohort
The XGBoost model provided overly extreme risk estimates
whereas the logistic regression underestimated predicted risks
Sensitivity and specificity were better overall for XGBoost and logistic regression compared to one class SVM and the clinical rule
We conclude that ML models did not beat logistic regression in predicting short-term severe asthma exacerbations based on home monitoring data
Clinical application remains challenging in settings with low event incidence and high false alarm rates with high sensitivity
and even fewer have been externally validated
and (c) use of \(\upbeta \)2 reliever (No M&E = No Morning & Evening
Yes M&E = Yes morning and evening) over time for three patients with no
The case of no exacerbations (top figure) is most prevalent in the data
ROC-curve for predictions from XGBoost and the logistic regression model
The sensitivity and specificity of the one class SVM and clinical prediction rule are also plotted on the left curve
On the left the points corresponding to the 0.001 (‘t = 0.001’) and 0.002 (‘t = 0.002’) probability thresholds are plotted for the XGBoost and logistic regression model
On the right the points corresponding to the thresholds resulting in 138 positive predictions (‘t for 138 pos pred’
equaling the clinical rule positive predictions) are plotted for the XGBoost and logistic regression model
For the 0.2% threshold, the XGBoost model obtained a sensitivity of 0.59, a specificity of 0.89, a positive predictive value (PPV) of 0.02, and a negative predictive value (NPV) of 1 (Table 3)
With 138 positive predictions as for the clinical rule
the XGBoost and logistic regression models again had a higher sensitivity and PPV
The differences between the AUCs of the best performing logistic regression model with one lag and XGBoost model with five lags were still significant (p = 0.02
we aimed to assess the performance of ML techniques and classic models for short-term prediction of severe asthma exacerbations based on home monitoring data
ML and logistic regression both reached higher discriminative performance than a previously proposed simple clinical rule
Logistic regression provided slightly better discriminative performance than the XGBoost algorithm
logistic regression still produced many false positives at high levels of sensitivity
This finding may be explained by the (lack of) complexity of the data that was studied
An advantage of ML techniques is the natural flexibility they offer to model complex (e.g
versus logistic regression techniques that have the advantage of being easily interpretable
Our findings illustrate that the flexibility provided by ML models may not always be needed to arrive at the best performing prediction model for medical data
The benefits of ML methods may differ between settings and should be further investigated
Improvement in discriminative ability may be achieved by reducing the noise in the exacerbation event at the time of data collection
the recording of severe exacerbations in our dataset might have been incomplete or there might have been a delay between the recording of the exacerbations and their true onset
better predicting variables of exacerbations may be needed
Our findings form a counterexample by showing that inherently interpretable techniques such as logistic regression may outperform ML for certain application types and clinical settings
Interpretability is especially relevant for clinical settings
as physicians often prefer interpretable models to assist in clinical decision making
Our findings therefore contribute to answering the question when and how to apply ML methods safely and effectively
the data used in this study contained few missing values
The quality of the data was therefore high
This implies that the registration method is unlikely to affect our conclusions
ML models may not outperform classical regression prediction model in predicting short-term asthma exacerbations based on home monitoring data
A simple regression model outperforms a simple rule
due to the high false alarm rate associated with the low probability thresholds required for high sensitivity
All patients had stable mild-to-moderate chronic asthma
Both studies were conducted in an asthma clinic in New Zealand on patients referred by their general practitioners
patients recorded their peak expiratory flow and use of \(\upbeta \)2-reliever (yes/no) in the morning and evening of every trial day in diaries
Nocturnal awakening (yes/no) was recorded in the morning (see below)
All predictors were measured or calculated daily
the average of morning and evening peak expiratory flow (PEF
measured in liters per minute) and the use of \(\upbeta \)2-reliever in morning and evening (used in both morning and evening/used in morning or evening/not used in morning and evening) were considered as potential predictors
maximum and minimum and added these as predictors
This rolling window consisted of the current day and all 6 preceding days
The PEF personal best was determined per patient during a run-in period of 4 weeks and added to the models
we constructed and added first differences (the difference in today’s measurement with respect to yesterday’s measurement) and lags (yesterday’s measurement) for PEF
Demographics and descriptive statistics of predictors (i.e.
and use of \(\upbeta \)2-reliever) were calculated for each individual patient over their respective observational periods
The XGBoost model estimates many decision-trees sequentially
These decision tree predictions are combined into an ensemble model to arrive at the final predictions
The sequential training makes the XGBoost model faster and more efficient than other tree-based algorithms
tuning an XGBoost model may become increasingly difficult
which is less of an issue with other tree-based models like random forest
Second, we trained an outlier detection model (one class SVM with Radial Basis Kernel)34
The one class SVM aims to find a frontier that delimits the contours of the original distribution
it can identify whether a new data point falls outside of the original distribution and should therefore be classified as ‘irregular’
An advantage of this model is that it is particularly apt at dealing with the low event rate in the asthma data
A downside of this model is that it does not provide probability estimates like a regular support vector machine and we therefore must base its predictive performance on its classification metrics only (see below)
it may however not provide the level of complexity needed to adequately model certain prediction problems
which comes at a cost of the interpretability of these methods
Confidence intervals were obtained through bootstrapping (based on a 1000 iterations)
and negative predictive value (NPV) were calculated for all models at the following probability thresholds (the cut-off point at which probabilities are converted into binary outcomes): 0.1% and 0.2%
These were chosen as they circle the prevalence rate of the outcome in our data
For a fair comparison with the clinical rule
we also calculated these performance metrics (sensitivity
etc.) for the XGBoost and logistic regression models at the probability thresholds producing the same number of positive predictions as produced by the one class SVM and the clinical rule
We performed a sensitivity analysis for predicting exacerbations within 4 and 8 days as opposed to 2 days (Table 4)
This enabled us to study the effect of a variation in the length of the outcome window on the models’ discrimination and calibration capacities
we performed a sensitivity analysis to assess the effect of the number of lags on model performance
we varied the number of lags from 1 to 5 for the models predicting exacerbations within 2 days
For the XGBoost and logistic regression model
All analyses were performed in Python 3.8.0. with R 3.6.3 plug-ins to obtain calibration results. The key functions and libraries can be found in additional file 2
Ethics approval was obtained for the original data collection
These studies were conducted in accordance with the principles of the Declaration of Helsinki on biomedical research
The protocols were approved by the Otago and Canterbury ethics committees and all patients gave written informed consent prior to participation
The datasets analyzed during the current study are not publicly available due to privacy restrictions
but are available to reviewers on reasonable request
Remote patient monitoring: A comprehensive study
Honkoop, P. J., Taylor, D. R., Smith, A. D., Snoeck-Stroband, J. B. & Sont, J. K. Early detection of asthma exacerbations by using action points in self-management plans. Eur. Respir. J. 41, 53–59. https://doi.org/10.1183/09031936.00205911 (2013)
Fine, M. J. et al. A prediction rule to identify low-risk patients with community-acquired pneumonia. N. Engl. J. Med. 336, 243–250. https://doi.org/10.1056/NEJM199701233360402 (1997)
Derivation of a simple clinical model to categorize patients probability of pulmonary embolism: Increasing the models utility with the SimpliRED d-dimer
British Thoraic Society. British Guideline on the Management of Asthmahttps://doi.org/10.1136/thx.2008.097741 (2019)
Mak, R. H. et al. Use of crowd innovation to develop an artificial intelligence-based solution for radiation therapy targeting. JAMA Oncol. 5, 654–661. https://doi.org/10.1001/jamaoncol.2019.0159 (2019)
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118. https://doi.org/10.1038/nature21056 (2017)
McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94. https://doi.org/10.1038/s41586-019-1799-6 (2020)
Cearns, M., Hahn, T. & Baune, B. T. Recommendations and future directions for supervised machine learning in psychiatry. Transl. Psychiatry 9, 271. https://doi.org/10.1038/s41398-019-0607-2 (2019)
Neuhaus, A. H. & Popescu, F. C. Sample size, model robustness, and classification accuracy in diagnostic multivariate neuroimaging analyses. Biol. Psychiatry 84, e81–e82. https://doi.org/10.1016/j.biopsych.2017.09.032 (2018)
Chen, P.-H.C., Liu, Y. & Peng, L. How to develop machine learning models for healthcare. Nat. Mater. 18, 410–414. https://doi.org/10.1038/s41563-019-0345-0 (2019)
Altman, D. G., Vergouwe, Y., Royston, P. & Moons, K. G. M. Prognosis and prognostic research: Validating a prognostic model. BMJ 338, b605. https://doi.org/10.1136/bmj.b605 (2009)
Wynants, L., Smits, L. J. M. & Van Calster, B. Demystifying AI in healthcare. BMJ 370, m3505. https://doi.org/10.1136/bmj.m3505 (2020)
In 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)
Christodoulou, E. et al. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J. Clin. Epidemiol. 110, 12–22. https://doi.org/10.1016/j.jclinepi.2019.02.004 (2019)
Gravesteijn, B. Y. et al. Machine learning algorithms performed no better than regression models for prognostication in traumatic brain injury. J. Clin. Epidemiol. 122, 95–107. https://doi.org/10.1016/j.jclinepi.2020.03.005 (2020)
Nusinovici, S. et al. Logistic regression was as good as machine learning for predicting major chronic diseases. J. Clin. Epidemiol. 122, 56–69. https://doi.org/10.1016/j.jclinepi.2020.03.002 (2020)
Martin, A. et al. Development and validation of an asthma exacerbation prediction model using electronic health record (EHR) data. J. Asthma 57, 1339–1346. https://doi.org/10.1080/02770903.2019.1648505 (2020)
Sanders, S., Doust, J. & Glasziou, P. A systematic review of studies comparing diagnostic clinical prediction rules with clinical judgment. PLoS ONE 10, e0128233. https://doi.org/10.1371/journal.pone.0128233 (2015)
Satici, C. et al. Performance of pneumonia severity index and CURB-65 in predicting 30-day mortality in patients with COVID-19. Int. J. Infect. Dis. 98, 84–89. https://doi.org/10.1016/j.ijid.2020.06.038 (2020)
Obradović, D. et al. Correlation between the Wells score and the Quanadli index in patients with pulmonary embolism. Clin. Respir. J. 10, 784–790. https://doi.org/10.1111/crj.12291 (2016)
Winters, B. D. et al. Technological distractions (Part 2): A summary of approaches to manage clinical alarms with intent to reduce alarm fatigue. Crit. Care Med. 46, 130–137. https://doi.org/10.1097/ccm.0000000000002803 (2018)
Mori, T. & Uchihira, N. Balancing the trade-off between accuracy and interpretability in software defect prediction. Empir. Softw. Eng. 24, 779–825. https://doi.org/10.1007/s10664-018-9638-1 (2019)
Johansson, U., Sönströd, C., Norinder, U. & Boström, H. Trade-off between accuracy and interpretability for predictive in silico modeling. Future Med. Chem. 3, 647–663. https://doi.org/10.4155/fmc.11.23 (2011)
Wallace, B. C. & Dahabreh, I. J. Improving class probability estimates for imbalanced data. Knowl. Inf. Syst. 41, 33–52. https://doi.org/10.1007/s10115-013-0670-6 (2014)
Van Calster, B. et al. Calibration: The Achilles heel of predictive analytics. BMC Med. 17, 230. https://doi.org/10.1186/s12916-019-1466-7 (2019)
Honkoop, P. J. et al. MyAirCoach: The use of home-monitoring and mHealth systems to predict deterioration in asthma control and the occurrence of asthma exacerbations; study protocol of an observational study. BMJ Open 7, e013935. https://doi.org/10.1136/bmjopen-2016-013935 (2017)
Finkelstein, J. & Jeong, I. C. Machine learning approaches to personalize early prediction of asthma exacerbations. Ann. N. Y. Acad. Sci. 1387, 153–165. https://doi.org/10.1111/nyas.13218 (2017)
Sanchez-Morillo, D., Fernandez-Granero, M. A. & Leon-Jimenez, A. Use of predictive algorithms in-home monitoring of chronic obstructive pulmonary disease and asthma: A systematic review. Chron. Respir. Dis. 13, 264–283. https://doi.org/10.1177/1479972316642365 (2016)
Smith, A. D., Cowan, J. O., Brassett, K. P., Herbison, G. P. & Taylor, D. R. Use of exhaled nitric oxide measurements to guide treatment in chronic asthma. N. Engl. J. Med. 352, 2163–2173. https://doi.org/10.1056/NEJMoa043596 (2005)
Taylor, D. R. et al. Asthma control during long-term treatment with regular inhaled salbutamol and salmeterol. Thorax 53, 744–752. https://doi.org/10.1136/thx.53.9.744 (1998)
Smith, A. E., Nugent, C. D. & McClean, S. I. Evaluation of inherent performance of intelligent medical decision support systems: Utilising neural networks as an example. Artif. Intell. Med. 27, 1–27. https://doi.org/10.1016/s0933-3657(02)00088-x (2003)
Tree boosting with xgboost-why does xgboost win" every" machine learning competition
In Proceedings of the International Joint Conference on Neural Networks
Schober, P. & Vetter, T. R. Logistic regression in medical research. Anesth. Analg. 132, 365–366. https://doi.org/10.1213/ANE.0000000000005247 (2021)
Clinical Prediction Models (Springer Nature
Download references
Taylor for contributing to the data collection
Department of Information Technology and Digital Innovation
Clinical AI Implementation and Research Lab
analyzed the data and drafted the manuscript
All authors read and approved the final manuscript
Download citation
DOI: https://doi.org/10.1038/s41598-022-24909-9
Sign up for the Nature Briefing newsletter — what matters in science
Metrics details
Clinical prediction models are often not evaluated properly in specific settings or updated
These key steps are needed such that models are fit for purpose and remain relevant in the long-term
We aimed to present an overview of methodological guidance for the evaluation (i.e.
validation and impact assessment) and updating of clinical prediction models
We systematically searched nine databases from January 2000 to January 2022 for articles in English with methodological recommendations for the post-derivation stages of interest
Qualitative analysis was used to summarize the 70 selected guidance papers
Key aspects for validation are the assessment of statistical performance using measures for discrimination (e.g.
calibration-in-the-large and calibration slope)
For assessing impact or usefulness in clinical decision-making
recent papers advise using decision-analytic measures (e.g.
the Net Benefit) over simplistic classification measures that ignore clinical consequences (e.g.
Commonly recommended methods for model updating are recalibration (i.e.
adjustment of intercept or baseline hazard and/or slope)
re-estimation of individual predictor effects)
Additional methodological guidance is needed for newer types of updating (e.g.
meta-model and dynamic updating) and machine learning-based models
Substantial guidance was found for model evaluation and more conventional updating of regression-based models
An important development in model evaluation is the introduction of a decision-analytic framework for assessing clinical usefulness
Consensus is emerging on methods for model updating
Framework from derivation to implementation of clinical prediction models
The focus of this systematic review is on model evaluation (validation
Further clarification of terminologies and methods for model evaluation may benefit applied researchers
We therefore aim to provide an overview of methodological guidance for the post-derivation stages of clinical prediction models
we focus on methods for examining an existing model’s validity in specific settings
we outline consensus on definitions to support the methodological discussion
and we highlight gaps that require further research
Articles were included if they 1) provided methodological “guidance” (i.e.
or model updating; 2) were written in English; and 3) were published between January 2000 and January 2022
as well as papers that discussed only one statistical technique or provided guidance not generalizable outside of a specific disease area
Initial selection based on title and abstract were conducted independently by two researchers (M.A.E.B
and any discrepancies were resolved through consensus meetings
methodological topic(s) discussed) were extracted
and thematic analysis was used for summarization
Full text assessment and data extraction were performed by one researcher (M.A.E.B.)
The results were reviewed by three researchers (E.W.S.
Ethics approval was not required for this review
A summary of methodological guidance for model validation
Internal validation is the minimum requirement for clinical prediction models
External validation is recommended to evaluate model generalizability in different but plausibly related settings
Designs for validation studies differ in strength (e.g.
temporal validation is a weak form of validation
Examination of two validation aspects (discrimination and calibration) is recommended for assessing statistical performance irrespective of the type of validation
Clinical usefulness is a common area between validation and impact assessment
and its examination is advised for assessing the clinical performance of models intended to be used for medical decision-making
Several performance aspects can be examined in a validation study, with various measures proposed for each (see Additional file 3 for a more complete list):
The minimum threshold for useful models can only be defined by examining decision-analytic measures (e.g.
A summary of methodological guidance for the assessment of a model’s impact
Potential impact can be examined through clinical performance measures (e.g.
Decision Curve Analysis) or health economic analysis (e.g.
Assessing actual impact requires comparative empirical studies
such as cluster randomized trials or other designs (e.g.
The literature distinguishes four types of model updating for regression-based models (Fig. 4). Updating methods for more computationally-intensive models (e.g., deep neural networks) were not identified.
A summary of methodological guidance for model updating
recalibration) is often sufficient when the differences between the derivation and new data are minimal
partial to full revision) may be appropriate
Model extension allows the incorporation of new markers in a model
to develop a meta-model that can be further updated for a new dataset
Updating can also be done periodically or continuously
Clinical prediction models are evidence-based tools that can aid in personalized medical decision-making
their applicability and usefulness are ideally evaluated prior to their clinical adoption
Suboptimal performance may be improved by model adjustment or re-specification
to incorporate additional information from a specific setting or to include new markers
We aimed to provide a summary of contemporary methodological guidance for the evaluation (validation and impact assessment) and updating of clinical prediction models
this is the first comprehensive review of guidance for these post-derivation stages
Guidance for updating is limited to regression-based models only
the validation of dynamic prediction models
We did not identify caveats for model updating when the clinical setting is not ideal (e.g.
very effective treatments are used for high-risk patients defined by the prediction model)
We also did not identify methods for retiring or replacing predictors that may have lost their clinical significance over time
Further research and additional guidance are necessary in these areas
which may help standardize concepts and methods
The post-derivation stages of clinical prediction models are important for optimizing model performance in new settings that may be contextually different from or beyond the scope of the initial model development
Substantial methodological guidance is available for model evaluation (validation and impact assessment) and updating
we found that performance measures based on decision analysis provide additional practical insight beyond statistical performance (discrimination and calibration) measures
we identified various methods including recalibration
Additional guidance is necessary for machine learning-based models and relatively new types of updating
Our summary can be used as a starting point for researchers who want to perform post-derivation research or critique published studies of similar nature
All data generated and analyzed during this review are included in the manuscript and its additional files
Area Under the Receiver Operating Characteristic curve
Preferred Reporting Items for Systematic Reviews and Meta-Analyses
PREDICT: a new UK prognostic model that predicts survival following surgery for invasive breast cancer
NABON. Dutch Guideline Breast Cancer (Landelijke richtlijn borstkanker). [Available from: https://richtlijnendatabase.nl/richtlijn/borstkanker/adjuvante_systemische_therapie.html]
Prediction models for cardiovascular disease risk in the general population: systematic review
Shared decision making: really putting patients at the centre of healthcare
The predictive accuracy of PREDICT: a personalized decision-making tool for southeast Asian women with breast cancer
Impact of provision of cardiovascular disease risk estimates to healthcare professionals and patients: a systematic review
revision and combination of prognostic survival models
Incorporating progesterone receptor expression into the PREDICT breast prognostic model
Inclusion of KI67 significantly improves performance of the PREDICT prognostication and prediction model for early breast cancer
PREDICT plus: development and validation of a prognostic model for early breast cancer that includes HER2
Prognosis research strategy (PROGRESS) 3: prognostic model research
Prognostic models for breast cancer: a systematic review
Prediction models for patients with esophageal or gastric cancer: a systematic review and meta-analysis
Reporting of prognostic clinical prediction models based on machine learning methods in oncology needs to be improved
TRIPOD statement: a preliminary pre-post analysis of reporting and methods of prediction models
Poor reporting of multivariable prediction model studies: towards a targeted implementation strategy of the TRIPOD statement
A systematic review finds prediction models for chronic kidney disease were poorly reported and often developed using inappropriate methods
Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement
Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal
Net reclassification improvement: computation
and controversies: a literature review and clinician's guide
The net reclassification index (NRI): a misleading measure of prediction improvement even with independent test data sets
Net risk reclassification p values: valid or misleading
A scoping review of interactive and personalized web-based clinical tools to support treatment decision making in breast cancer
Moorthie S. What is clinical utility?: PHG Foundation - University of Cambridge. [Available from: https://www.phgfoundation.org/explainer/clinical-utility]
step-by-step guide to interpreting decision curve analysis
Prediction models for the risk of gestational diabetes: a systematic review
Meta-analysis and aggregation of multiple published prediction models
Validity of prediction models: when is a model clinically useful
External validation is necessary in prediction research: a clinical example
On criteria for evaluating models of absolute risk
updating and impact of clinical prediction rules: a review
Prognosis and prognostic research: application and impact of prognostic models in clinical practice
Evaluating the prognostic value of new cardiovascular biomarkers
Prognostic models: a methodological framework and review of models for breast cancer
Traditional statistical methods for evaluating prediction models are uninformative as to clinical value: towards a decision analytic framework
Assessing the performance of prediction models: a framework for traditional and novel measures
Everything you always wanted to know about evaluating prediction models (but were too afraid to ask)
and assessing the incremental value of a new (bio)marker
Towards better clinical prediction models: seven steps for development and an ABCD for validation
Risk prediction models: a framework for assessment
External validation of a Cox prognostic model: principles and methods
A new framework to enhance the interpretation of external validation studies of clinical prediction models
Con: Most clinical risk scores are useless
calibration: the Achilles heel of predictive analytics
Key steps and common pitfalls in developing and validating risk models
Methodological standards for the development and evaluation of clinical prediction rules: a review of the literature
A framework for the evaluation of statistical prediction models
Minimum sample size for external validation of a clinical prediction model with a continuous outcome
External validation of prognostic models: what
Minimum sample size calculations for external validation of a clinical prediction model with a time-to-event outcome
Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review
Translating clinical research into clinical practice: impact of using prediction rules to make decisions
Assessing the incremental value of diagnostic and prognostic markers: a review and illustration
Added predictive value of high-throughput molecular data to clinical data and its validation
Assessing new biomarkers and predictive models for use in clinical practice: a clinician's guide
A framework for quantifying net benefits of alternative prognostic models
Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond
Assessing the incremental predictive performance of novel biomarkers over standard predictors
Framework for the impact analysis and implementation of clinical prediction rules (CPRs)
Beyond diagnostic accuracy: the clinical utility of diagnostic tests
Evaluating the impact of prediction models: lessons learned
Criteria for evaluation of novel markers of cardiovascular risk: a scientific statement from the American Heart Association
Ten steps towards improving prognosis research
Good practice guidelines for the use of statistical regression models in economic evaluations
A simple framework to identify optimal cost-effective risk thresholds for a single screen: comparison to decision curve analysis
Updating methods improved the performance of a clinical prediction model in new patients
Aggregating published prediction models with individual participant data: a comparison of different approaches
A closed testing procedure to select an appropriate method for updating prediction models
Updating risk prediction tools: a case study in prostate cancer
Methods for updating a risk prediction model for cardiac surgery: a statistical primer
Validation and updating of risk models based on multinomial logistic regression
Improving prediction models with new markers: a comparison of updating strategies
Individual participant data (IPD) meta-analyses of diagnostic and prognostic modeling studies: guidance on their use
Improved prediction by dynamic modeling: an exploratory study in the adult cardiac surgery database of the Netherlands Association for Cardio-Thoracic Surgery
Adaptation of clinical prediction models for application in local settings
Dynamic models to predict health outcomes: current status and methodological challenges
Updating clinical prediction models: an illustrative case study
Comparison of dynamic updating strategies for clinical prediction models
Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers
The relationship between precision-recall and ROC curves; 2006
Evaluating a new marker for risk prediction using the test tradeoff: an update
The summary test tradeoff: a new measure of the value of an additional risk prediction marker
Evaluating a new marker for risk prediction: decision analysis to the rescue
Two further applications of a model for binary regression
Conditional logit analysis of qualitative choice behavior
Regression modelling strategies for improved prognostic prediction
A unified inference procedure for a class of measures to assess improvement in risk prediction systems with survival data
Concordance probability and discriminatory power in proportional hazards regression
Time-dependent ROC curves for censored survival data and a diagnostic marker
Net reclassification improvement and integrated discrimination improvement require calibrated models: relevance from a marker and model perspective
A note on the evaluation of novel biomarkers: do not rely on integrated discrimination improvement and net reclassification index
Can machine-learning improve cardiovascular risk prediction using routine clinical data
Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence
Download references
This work was financially supported by Health~Holland grant number LSHM19121 (https://www.health-holland.com) received by MKS
the Netherlands Cancer Institute – Antoni van Leeuwenhoek Hospital
Division of Psychosocial Research and Epidemiology
The Netherlands Cancer Institute – Antoni van Leeuwenhoek Hospital
critical revision of the manuscript; WS: design
critical revision of the manuscript; EWS: design
Description of methods and related information
Overview of selected articles included in the review
Summary of performance measures from the selected methodological literature
Download citation
DOI: https://doi.org/10.1186/s12874-022-01801-8
Metrics details
Clinical prediction models (CPMs) are tools that compute the risk of an outcome given a set of patient characteristics and are routinely used to inform patients
Although much hope has been placed on CPMs to mitigate human biases
CPMs may potentially contribute to racial disparities in decision-making and resource allocation
and scholars have called for eliminating race as a variable from CPMs
others raise concerns that excluding race may exacerbate healthcare disparities and this controversy remains unresolved
The Guidance for Unbiased predictive Information for healthcare Decision-making and Equity (GUIDE) provides expert guidelines for model developers and health system administrators on the transparent use of race in CPMs and mitigation of algorithmic bias across contexts developed through a 5-round
modified Delphi process from a diverse 14-person technical expert panel (TEP)
Deliberations affirmed that race is a social construct and that the goals of prediction are distinct from those of causal inference
and emphasized: the importance of decisional context (e.g.
shared decision-making versus healthcare rationing); the conflicting nature of different anti-discrimination principles (e.g.
anticlassification versus antisubordination principles); and the importance of identifying and balancing trade-offs in achieving equity-related goals with race-aware versus race-unaware CPMs for conditions where racial identity is prognostically informative
comprising 31 key items in the development and use of CPMs in healthcare
and offers guidance for examining subgroup invalidity and using race as a variable in CPMs
This GUIDE presents a living document that supports appraisal and reporting of bias in CPMs to support best practice in CPM development and use
no direct guidance clarifies how prediction modelers should approach race as a candidate variable in CPMs
nor how health systems and clinicians should consider the role of race in choosing and using CPMs
either with individual patients or at the population level
The purpose of this Guidance for Unbiased predictive Information for healthcare Decision-making and Equity (GUIDE) is to offer a set of practical recommendations to evaluate and address algorithmic bias (here defined as differential accuracy of CPMs across racial groups) and algorithmic fairness (here defined as clinical decision-making that does not systematically favor members of one protected class over another)
with special attention to potential harms that may result from including or omitting race
We approach this with a shared understanding that race is a social construct
as well as an appreciation of the profound injuries that interpersonal and structural racism cause to individual and population health
This guidance is meant to be responsive to widespread differences in health by race that are historically and structurally rooted
which have been exacerbated by racial bias embedded in the U.S
and offer a starting point for the development of best practices
we provide consensus-based: (1) recommendations regarding the use of race in CPMs
(2) guidance for model developers on identifying and addressing algorithmic bias (differential accuracy of CPMs by race)
and (3) guidance for model developers and policy-makers on recognizing and mitigating algorithmic unfairness (differential access to care by race)
Given the widespread impact of CPMs in healthcare
the GUIDE is intended to provide a first step to assist CPM developers
regulatory agencies and professional medical societies who share responsibility for use and implementation of CPMs
Since different considerations apply where CPMs are either directly used to allocate scarce resources or used to align decisions with a patient’s own values and preferences
separate guidelines were developed for these different contexts
predictive effects of variables within a valid CPM may even have the opposite sign as the true causal effect
“risk factors” measured in observational studies may associate with health outcomes for many reasons aside from direct causation
Valid prediction only requires these associations are stable across other similarly selected population samples
not that they correspond to causal effects
causal modeling typically requires specification of a primary exposure variable-of-interest and a set of (often unverifiable) causal assumptions based on content knowledge external to the data
we affirm that race is a social construct and
can only cause outcomes indirectly through the health effects of racism
it may be correlated with many unknown or poorly-measured variables that affect health outcomes (e.g.
genetic ancestry) and might account for differences in outcomes in groups defined by self-identified race
being an indirect cause of health outcomes via racism or acting as a proxy for other unknown/unmeasured causes of health outcomes)
race is often empirically observed to be an important predictor of health outcomes
Inappropriately racializes medicine: Herein
for which there is now broad interdisciplinary consensus
there is a long tradition of pseudoscientific biological determinism and racial essentialism that connects race to inherited biological distinctions—explaining or justifying differences in medical outcomes
This perspective is seen as damaging to a decent
and just society—creating a broad taboo against any use of ‘race’ that might be misconstrued to provide even indirect or accidental support for these racist notions
Using race in CPMs may also serve to further entrench racialization and conflict with the goal of a post-racial future
Though we know of no direct evidence of this
incorporation of ‘race’ may undermine trust not only in prediction itself but more broadly in the medical system for patients of all races
There is broad agreement that individuals with similar outcome risks should be treated similarly regardless of race
We call this principle “equal treatment for equal risk.” When race has no prognostic information independent of relevant clinical characteristics
since only characteristics contributing to prognosis are included in CPMs
Controversy arises only when race is predictive of differences in outcome risk
despite clinical characteristics that appear similar
Omitting race systematically under-estimates diabetes risk in Black patients
deprioritizing their care compared to Whites at similar risk
Including race better aligns predicted with observed risks in Black patients
supporting similar treatment for similar risk
models restricted from using any prognostic candidate variable won’t be more accurate than models considering all available information
the race-aware model may also be disparity-reducing compared to the race-unaware model
If one were offering a lifestyle modification program to the top risk-quarter (>~10% diabetes-risk threshold)
Black patients would comprise 31% of the treatment-prioritized group with a race-unaware model
The race-unaware model would prioritize lower-risk White ahead of higher-risk Black patients
While the causes of excess risk in some minorities may be unclear
this excess risk is no less important for decision-making than the risk associated with other variables in the model
when Black people are found to be at higher-risk than White people
leaving race out of risk calculations does not treat Black and White people equally—it systematically ignores those (unknown/unmeasured) causes of greater risk that are more common in Black than White people
Label bias is a particular concern because this bias is not detectable with the conventional set of performance metrics that attend to model fit
Table 4 provides recommendations for the use of race in CPMs in non-polar (shared decision-making) contexts
where predictive accuracy is the paramount modeling priority
The TEP considered how to balance anticlassification principles (which preclude use of race) and antisubordination principles (which may require use of race to prevent minoritized groups from being disadvantaged in some circumstances)
Given the importance of accurate predictions to enabling patient autonomy in decision-making (Item 16)
the TEP found that inclusion of race may be justified when the predictive effects are statistically robust
and go beyond other ascertainable attributes (Item 17)
The precise threshold where the statistical benefits of improved calibration will outweigh anticlassification principles may differ across clinical contexts
This ‘prevalence-sensitivity’ can be shown in simulations using prediction models that are known to have no model invalidity (i.e.
they correspond exactly to the data-generating process)
The Table below provides an illustration where the predictive performance of the data-generating model (i.e.
a model with no model invalidity) is measured across two groups with different burdens of the same risk factors
we propose that good calibration is a more appropriate and useful measure of subgroup invalidity
This was shown to result in predictions that systematically under-estimate need in Black versus White patients
minoritized communities (including Black and Asian patients) are at higher-risk for diabetes and some cancers than White patients (e.g.
even when the causes of a risk difference or disparity is incompletely understood
it is often implausible to attribute this difference in risk to label bias
the plausible direction of any label bias is in the opposite direction of the disparity and there are many other potential explanations available for the observed risk differences
and other key issues such as biased training data
and the unique concerns of different decision-making contexts
Our GUIDE targets these gaps with a set of consensual premises and actionable recommendations
these are fundamentally causal definitions of fairness
which are challenging to satisfy in practice because causality is generally unidentifiable in observational data (without strong unverifiable assumptions)
and because race might be inadvertently reconstructed through proxies even when not explicitly encoded
particularly when high-dimensional machine learning approaches are applied
we offer a pragmatic approach based on an assessment of observable outcomes that seeks to maximize benefits for the population (utilitarianism) and at the same time to reduce disparities (egalitarianism)
Future work should encourage more routine use of variables for which race may be a proxy—such as social determinants of health or genetic ancestry; better collection of more representative training data; and evolution in how health systems populate electronic health records and other healthcare databases to ensure these data consistently reflect self-reporting
We note that CMS is putting regulatory pressure behind the collection of data on social drivers of health
with quality measures that require screening in five domains: food insecurity
the GUIDE provides a framework for identifying
understanding and deliberating about the trade-offs inherent in these issues when developing CPMs
We present it to support those developing or implementing CPMs in their goal of providing unbiased predictions to support fair decision-making
and for the broader community to better understand these issues
The project was approved by the Tufts Health Sciences Institutional Review Board
Informed consent was obtained from participants
Experts were invited based on professional expertise
a TEP co-chair (JKP or KL) alongside a professional facilitator (Mark Adkins
PhD) moderated consensus-building and voting
KL) presented the topic with illustrative cases uniquely developed for that meeting
Using a 5-point scale (1-Strongly Disagree to 5-Strongly Agree
experts were asked to rate their level of agreement with the item’s importance and feasibility of assessment
the vote (rating) was carried out anonymously using the MeetingSphere software
after which ratings and comments were shared with the TEP in real time
Deliberation and discussion followed the first round of voting at each meeting
ratings on items had to have “broad agreement”
or exceed the pre-specified supermajority threshold of 75% of the TEP endorsing the item as “agree” or “strongly agree” (4 or 5)
was required to prevent the majority from eroding the influence of minority voices
without requiring strict unanimity for all items
Items without broad agreement were always discussed and revised
and TEP members could nominate additional items to be considered
and improvement of items; dissenting views were acknowledged and incorporated where possible
Revised items were then voted on a second time
experts had the opportunity to refine items and revise their judgments prior to subsequent meetings where re-rating occurred
All analyses of item scores and comments were performed independently by the professional facilitator using MeetingSphere
discussed and agreed to the content and final wording of the guidelines
The final GUIDE represents points of convergence across the TEP who held diverse opinions and approaches
especially to mitigating bias in shared decision-making contexts
and these data provided insight into the values and reasoning underlying the opinions of patient stakeholders pertaining to inclusion of race in CPMs
Patient stakeholder feedback was presented to the TEP for incorporation in the final GUIDE during the final meeting
Further information on research design is available in the Nature Research Reporting Summary linked to this article
Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study
New creatinine- and cystatin C-based equations to estimate GFR without race
Reconsidering the consequences of using race to estimate kidney function
Assessment of adherence to reporting guidelines by commonly used clinical prediction models from a single vendor: a systematic review
Race and genetic ancestry in medicine—a time for reckoning with racism
Will precision medicine move us beyond race
How to act upon racism-not race-as a risk factor
Racial disparities in low-risk prostate cancer
Geographic distribution of racial differences in prostate cancer mortality
Diabetes screening by race and ethnicity in the United States: equivalent body mass index and age thresholds
Racial and ethnic bias in risk prediction models for colorectal cancer recurrence when race and ethnicity are omitted as predictors
Projecting individualized absolute invasive breast cancer risk in African American women
Race adjustments in clinical algorithms can help correct forracial disparities in data quality
Clinical implications of removing race-corrected pulmonary function tests for African American patients requiring surgery for lung cancer
Methods for using race and ethnicity in prediction models for lung cancer screening eligibility
Using prediction-models to reduce persistent racial/ethnic disparities in draft 2020 USPSTF lung-cancer screening guidelines
Patient-centered appraisal of race-free clinical risk assessment
Adding a coefficient for race to the 4K score improves calibration for black men
Using measures of race to make clinical predictions: decision making
Equity in essence: a call for operationalising fairness in machine learning for healthcare
Addressing racism in preventive services: a methods project for the U.S
(Agency for Healthcare Research and Quality
Research Protocol: impact of healthcare algorithms on racial and ethnic disparities in health and healthcare
Agency for Healthcare Research and Quality
Assessing Algorithmic Bias and Fairness in Clinical Prediction Models for Preventive Services: A Health Equity Methods Project for the U.S
Office for Civil Rights) (Office for Civil Rights
Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities
Dissecting racial bias in an algorithm used to manage the health of populations
(Center for Applied Artificial Intelligence
Statement on principles for responsible algorithmic systems
How to regulate evolving AI health algorithms
CMS Innovation Center Tackles Implicit Bias
In Health Affairs Forefront (Health Affairs
Leveraging Affordable Care Act Section 1557 to address racism in clinical algorithms
in Health Affairs Forefront (Health Affairs
HHS proposes revised ACA anti-discrimination rule
Prevention of bias and discrimination in clinical practice algorithms
a new company forms to vet models and root out weaknesses
The Supreme Court’s rulings on race neutrality threaten progress in medicine and health
National health care leaders will develop AI code of conduct
Centers for Medicare & Medicaid Services
Office of the Secretary & Department of Health and Human Services
Nondiscrimination in health programs and activities
Blueprint for an AI bill of rights: making automated systems work for the American people
White House Office of Science and Technology Policy) (United States Government
Conceptualising fairness: three pillars for medical algorithms and health equity
Reporting guideline for the early stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI
Evaluation and mitigation of racial bias in clinical machine learning models: scoping review
Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare
Implications of predicting race variables from medical images
and Interoperability: Certification Program Updates
and Information Sharing (HTI-1) Proposed Rule
Office of the National Coordinator for Health Information Technology) (Office of the National Coordinator for Health Information Technology
Removing structural racism in pulmonary function testing-why nothing is ever easy
When personalization harms performance: reconsidering the use of group attributes in prediction
Reevaluating the role of race and ethnicity in diabetes screening
and the equitable allocation of scarce COVID-19 treatments
Equal treatment for equal risk: should race be included in allocation algorithms for Covid-19 therapies
Reassessment of the role of race in calculating the risk for urinary tract infection: a systematic review and meta-analysis
Prediction of vaginal birth after cesarean delivery in term gestations: a calculator without race and ethnicity
Flawed racial assumptions in eGFR have care implications in CKD
In Am J Manag Care (The American Journal of Managed Care
Implications of race adjustment in lung-function equations
An ethical analysis of clinical triage protocols and decision-making frameworks: what do the principles of justice
and a disability rights approach demand of us
PROBAST: a tool to assess the risk of bias and applicability of prediction model studies
Guidance for developers of health research reporting guidelines
An experimental application of the delphi method to the use of experts
Potential biases in machine learning algorithms using electronic health record data
Comparison of methods to reduce bias from clinical prediction models of postpartum depression
Ensuring fairness in machine learning to advance health equity
Implementing machine learning in health care - addressing ethical challenges
Addressing bias in artificial intelligence in health care
Updated guidance on the reporting of race and ethnicity in medical and science journals
Qualitative Research & Evaluation Methods (SAGE Publications
The Coding Manual for Qualitative Researchers (SAGE Publications
Office of Information and Regulatory Affairs & Executive Office of the President
Revisions to OMB’s statistical policy directive no
and presenting federal data on race and ethnicity
Fair prediction with disparate impact: a study of bias in recidivism prediction instruments
Inherent trade-offs in the fair determination of risk scores
The American civil rights tradition: anticlassification or antisubordination
Reflection on modern methods: generalized linear models for prognosis and intervention-theory
practice and implications for machine learning
A scoping review of their conflation within current observational research
The Table 2 fallacy: presenting and interpreting confounder and modifier coefficients
Differences in the patterns of health care system distrust between blacks and whites
Perspectives on racism in health care among black veterans with chronic kidney disease
Prior experiences of racial discrimination and racial differences in health care system distrust
An electronic health record-compatible model to predict personalized treatment effects from the diabetes prevention program: a cross-evidence synthesis approach using clinical trial and real-world data
Racial and ethnic disparities in diagnosis and treatment: a review of the evidence and a consideration of causes
in Unequal Treatment: Confronting Racial and Ethnic Disparities in Health Care (eds
R.) 417–454 (National Academies Press (US)
Income and cancer overdiagnosis—when too much care is harmful
Download references
Research reported in this publication was funded through a “Making a Difference” and Presidential Supplement Awards from The Greenwall Foundation (PI Kent)
The views presented in this publication are solely the responsibility of the authors and do not necessarily represent the views of the Greenwall Foundation
The Greenwall Foundation was not involved in the design of the study; the collection
and interpretation of the data; and the decision to approve publication of the finished manuscript
Departments of Occupational Therapy and Community Health
Predictive Analytics and Comparative Effectiveness Center
Center for Individualized Medicine Bioethics
Tufts Clinical and Translational Science Institute
D.M.K.) contributed to development of conclusions
and reviewed and contributed significantly to the final manuscript
The authors declare the following competing interests: Dr
Duru declares no Competing Financial Interests but the following Competing Non-Financial Interests as a consultant for ExactCare Pharmacy®
research funding from the Patient Centered Outcomes Research Institute (PCORI)
the Centers for Disease Control and Prevention (CDC)
and the National Institutes of Health (NIH)
Kent declares no Competing Financial Interests but Competing Non-Financial Interests in research funding from the Greenwall Foundation
Patient Centered Outcomes Research Institute (PCORI)
Ladin declares no Competing Financial Interests but Competing Non-Financial Interests in research funding from Paul Teschan Research Fund #2021-08
Steyerberg declares no Competing Financial Interests but Competing Non-Financial Interests in funding from the EU Horizon program (4D Picture project
Ustun declares no Competing Financial Interests but Competing Non-Financial Interests in research funding from the National Science Foundation IIS 2040880
the NIH Bridge2AI Center Grant U54HG012510
All other authors declare no Competing Financial or Non-Financial Interests
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations
Download citation
DOI: https://doi.org/10.1038/s41746-024-01245-y
Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research
Metrics details
Clinical prediction models are widely used in health and medical research
The area under the receiver operating characteristic curve (AUC) is a frequently used estimate to describe the discriminatory ability of a clinical prediction model
The AUC is often interpreted relative to thresholds
with “good” or “excellent” models defined at 0.7
These thresholds may create targets that result in “hacking”
where researchers are motivated to re-analyse their data until they achieve a “good” result
We extracted AUC values from PubMed abstracts to look for evidence of hacking
We used histograms of the AUC values in bins of size 0.01 and compared the observed distribution to a smooth distribution from a spline
The distribution of 306,888 AUC values showed clear excesses above the thresholds of 0.7
0.8 and 0.9 and shortfalls below the thresholds
The AUCs for some models are over-inflated
which risks exposing patients to sub-optimal clinical decision-making
Decisions guided by model probabilities or categories may rule out low-risk patients to reduce unnecessary treatments or identify high-risk patients for additional monitoring
If the model has good discrimination and gives estimated risks for all patients with the outcome that are higher than all patients without
If the model discrimination is no better than a coin toss
Qualitative descriptors of model performance for AUC thresholds between 0.5 and 1 have been published
including “area under the receiver operating characteristic curve” or the acronyms “AUC” and “AUROC”
We included all AUCs regardless of the study’s aim and therefore included model development and validation studies
We did not consider other commonly reported metrics for evaluating clinical prediction models
We examined abstracts published in PubMed because it is a large international database that includes most health and medical journals. To indicate its size, there were over 1.5 million abstracts published on PubMed in 2022. The National Library of Medicine make the PubMed data freely and easily available for research. We downloaded the entire database in XML format on 30 July 2022 from https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
We started with all the available PubMed data
Entries with an empty abstract or an abstract of 10 words or fewer
which often use area under the curve statistics to refer to dosages and volumes that are unrelated to prediction models
as we were interested in original research
Our inclusion criterion was abstracts with one or more AUC values
We created a text-extraction algorithm to find AUC values using the team’s expertise and trial and error
We validated the algorithm by randomly sampling 300 abstracts with a Medical Subject Heading (MeSH) of “Area under curve” that had an abstract available and quantifying the number of AUC values that were correctly extracted
We also examined randomly selected results from the algorithm that equalled the thresholds of 0.7
We report the validation in more detail in the results
but note here that the algorithm could not reliably extract AUC values that were exactly 1
AUC values equal to 1 were therefore excluded
Challenges in extracting the AUC values from abstracts included the frequent use of long lists of statistics
including the sensitivity and specificity; unrelated area under the curve statistics from pharmacokinetic studies; references to AUC values as a threshold (e.g
“The AUC ranges between 0.5 and 1”); and the many different descriptors used
AUC values reported as a percent were converted to 0 to 1
We removed any AUC values that were less than 0 or greater than or equal to 1
We categorised each AUC value as a mean or the lower or upper limit of the confidence interval
“0.704 (95% CI 0.603 to 0.806)” would be a mean
For the specific examples from published papers in the results
we give the PubMed ID number (PMID) rather than citing the paper
Our hypothesis was that there would be an excess of AUC values just above the thresholds 0.7
and an upper threshold that was + 0.01 greater
0.70] included every AUC greater than 0.69 and less than or equal to 0.70
We excluded AUCs with a decimal place of 1 (e.g
as these results would create spikes in the histogram that were simply due to rounding
We do not know what the distribution of AUC values from the health and medical literature would look like if there was no AUC-hacking
we are confident that it should be relatively smooth with no inflexion points
potentially caused by re-analysing the data to get a more publishable but inflated AUC
we noted that many abstracts gave multiple AUC values from competing models
we plotted the distribution using the highest AUC value per abstract
This subgroup analysis examined whether the best presented models were often just above the thresholds
We used a subgroup analyses that used only AUC values from the results section of structured abstracts
This potentially increased the specificity of the extracted AUC values
methods and discussion sections were more likely to be general references to the AUC rather than results
The flow chart of included abstracts is shown in Fig. 1.
The number of examined abstracts was over 19 million, and 96,986 (0.5%) included at least one AUC value. The use of AUC values has become more popular in recent years (see Additional file 2: Fig
The median publication year for the AUC values was 2018
with first to third quartile of 2015 to 2018
For abstracts with at least one AUC value, the median number of AUC values was 2, with a first to third quartile of 1 to 4 (see Additional file 3: Fig
There was a long tail in the distribution of AUC values
with 1.1% of abstracts reporting 20 or more AUC values
These high numbers were often from abstracts that compared multiple models
The total number of included AUC values was 306,888
There were 92,529 (31%) values reported as lower or upper confidence limits and the remainder as means
Histogram of AUC mean values (top panel) and residuals from a smooth fit to the histogram (bottom panel)
The dotted line in the top panel shows the smooth fit
The distribution from the largest AUC mean value per abstract excluding confidence intervals is shown in Fig. 3. The strong changes in the distribution at the thresholds observed in Fig. 2 remain.
Histogram of the largest AUC mean value per abstract (top panel) and residuals from a smooth fit to the histogram (bottom panel)
The distribution for AUC values published in PLOS ONE show a similar pattern to the full sample, with many more AUC values just above the 0.8 threshold (see Additional file 6: Fig
For abstracts where either the algorithm or manual entry found one or more AUC values, we made a Bland–Altman plot of the number of AUC values extracted (see Additional file 7: Fig
the algorithm missed more AUC values than the manual entry
a discrepancy that was generally due to non-standard presentations
as we would rather lean towards missing valid AUC values than wrongly including invalid AUC values
We used a regression model to examine differences in the AUC values extracted by the algorithm and manual entry
AUC values that were wrongly included by the algorithm were smaller on average than the AUC values that were correctly included
This is because the values extracted were often describing other aspects of the prediction model
The validation helped identify MESH terms that identified pharmacokinetic studies that were excluded from our main analysis
we manually checked 100 randomly sampled abstracts that the algorithm identified as not having an AUC statistic and another 100 randomly sampled abstracts that the algorithm identified as having an AUC statistic
All abstracts identified as not having an AUC statistic were correctly classified (95% confidence interval for negative predictive value: 0.964 to 1.000)
All but one abstract identified as having an AUC statistic was correct (95% confidence interval for positive predictive value: 0.946 to 1.000)
making for a larger gap between the thresholds and values exceeding the threshold
which would be stronger evidence of poor practice
To investigate the excess at (0.56, 0.57], we manually extracted the AUC values from 300 abstracts where our algorithm found an AUC value of 0.57 and another 300 from 0.58 as a nearby comparison with no excess. The error proportions from the algorithm were relatively low (see Additional file 7: Table S3)
indicating that the excess at 0.57 was not due to errors
with relatively low AUC values under 0.75 described as “excellent” (PMID35222547)
with the inflated AUCs values regressing to the mean
The widespread use of these poor practices creates a biased evidence base and is misinforming health policy
We did not examine other commonly reported performance metrics used to evaluate clinical prediction model performance
It is possible that values such as model sensitivity and specificity may also be influenced by “acceptable” thresholds
It is likely that the highest AUC value presented in the abstract is also the highest in the full text
so the “best” model would be captured in the abstract
and the “best” AUC value is the one most likely to be created by hacking
In addition to hacking, publication bias likely also plays a role in the selection of AUC values, with higher values more likely to be accepted by peer reviewers and journal editors. Our subgroup analysis of PLOS ONE abstracts (Additional file 6: Fig
S6) provides some evidence that the “hacking” pattern in AUC values is due to author behaviour not journal behaviour
We used an automated algorithm that provided a large and generalisable sample but did not perfectly extract all AUC values
we were not able to reliably extract AUC values of 1
and this is an important value as it is the best possible result and could be a target for hacking
We believe that the errors and exclusions in the data are not large enough to change our key conclusion
Clinical prediction models are growing in popularity
likely because of increased patient data availability and accessible software tools to build models
many published models have serious flaws in their design and presentation
as the AUCs for some models have been over-inflated
Publishing overly optimistic models risks exposing patients to sub-optimal clinical decision-making
An urgent reset is needed in how clinical prediction models are built
Actionable steps towards greater transparency are as follows: the wider use of protocols and registered reports
Area under the receiver operating characteristic curve
Clinical prediction models: diagnosis versus prognosis
Knaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: a severity of disease classification system. Crit Care Med. 1985;13(10):818–29. https://doi.org/10.1097/00003246-198510000-00009
Wynants L, van Smeden M, McLernon DJ, Timmerman D, Steyerberg EW, Calster BV. Three myths about risk thresholds for prediction models. BMC Med. 2019;17(1). https://doi.org/10.1186/s12916-019-1425-3
Search filters for finding prognostic and diagnostic prediction studies in Medline to enhance systematic reviews
Hand DJ. Classifier technology and the illusion of progress. Stat Sci. 2006;21(1). https://doi.org/10.1214/088342306000000060
Collins GS, de Groot JA, Dutton S, Omar O, Shanyinde M, Tajar A, et al. External validation of multivariable prediction models: a systematic review of methodological conduct and reporting. BMC Med Res Methodol. 2014;14(1). https://doi.org/10.1186/1471-2288-14-40
Miller E, Grobman W. Prediction with conviction: a stepwise guide toward improving prediction and clinical care. BJOG. 2016;124(3):433. https://doi.org/10.1111/1471-0528.14187
Steyerberg EW, Uno H, Ioannidis JPA, van Calster B, Ukaegbu C, Dhingra T, et al. Poor performance of clinical prediction models: the harm of commonly applied methods. J Clin Epidemiol. 2018;98:133–43. https://doi.org/10.1016/j.jclinepi.2017.11.013
Riley RD, Ensor J, Snell KIE, Harrell FE, Martin GP, Reitsma JB, et al. Calculating the sample size required for developing a clinical prediction model. BMJ. 2020:m441. https://doi.org/10.1136/bmj.m441
Wynants L, Van Calster B, Collins GS, Riley RD, Heinze G, Schuit E, et al. Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal. BMJ. 2020;369. https://doi.org/10.1136/bmj.m1328
Methodological conduct of prognostic prediction models developed using machine learning in oncology: a systematic review
Clinical prediction models in psychiatry: a systematic review of two decades of progress and challenges
Trends in the conduct and reporting of clinical prediction model development and validation: a systematic review
Prognosis Research Strategy (PROGRESS) 3: prognostic model research
Mandrekar JN. Receiver operating characteristic curve in diagnostic test assessment. J Thorac Oncol. 2010;5(9):1315–6. https://doi.org/10.1097/jto.0b013e3181ec173d
Khouli RHE, Macura KJ, Barker PB, Habba MR, Jacobs MA, Bluemke DA. Relationship of temporal resolution to diagnostic performance for dynamic contrast enhanced MRI of the breast. J Magn Reson Imaging. 2009;30(5):999–1004. https://doi.org/10.1002/jmri.21947
Pitamberwale A, Mahmood T, Ansari AK, Ansari SA, Limgaokar K, Singh L, et al. Biochemical parameters as prognostic markers in severely Ill COVID-19 patients. Cureus. 2022. https://doi.org/10.7759/cureus.28594
Calster BV, Steyerberg EW, Wynants L, van Smeden M. There is no such thing as a validated prediction model. BMC Med. 2023;21(1). https://doi.org/10.1186/s12916-023-02779-w
de Hond AAH, Steyerberg EW, van Calster B. Interpreting area under the receiver operating characteristic curve. Lancet Digit Health. 2022;4(12):e853–5. https://doi.org/10.1016/s2589-7500(22)00188-1
Fraser H, Parker T, Nakagawa S, Barnett A, Fidler F. Questionable research practices in ecology and evolution. PLoS ONE. 2018;13(7):1–16. https://doi.org/10.1371/journal.pone.0200303
John LK, Loewenstein G, Prelec D. Measuring the Prevalence of questionable research practices with incentives for truth telling. Psychol Sci. 2012;23(5):524–32. https://doi.org/10.1177/0956797611430953
Stefan AM, Schönbrodt FD. Big little lies: a compendium and simulation of p-hacking strategies. R Soc Open Sci. 2023;10(2):220346. https://doi.org/10.1098/rsos.220346
Altman DG, Lausen B, Sauerbrei W, Schumacher M. Dangers of using “optimal” cutpoints in the evaluation of prognostic factors. J Natl Cancer Inst. 1994;86(11):829–35. https://doi.org/10.1093/jnci/86.11.829
Picard D. Torch.manual_seed(3407) is all you need: on the influence of random seeds in deep learning architectures for computer vision. CoRR. 2021. arXiv:2109.08203
An observational analysis of the trope “A p-value of\(< 0.05\) was considered statistically significant” and other cut-and-paste statistical methods
Masicampo EJ, Lalande DR. A peculiar prevalence of p values just below .05. Q J Exp Psychol (Hove). 2012;65(11):2271–2279. https://doi.org/10.1080/17470218.2012.711335
Barnett AG, Wren JD. Examination of confidence intervals in health and medical journals from 1976 to 2019: an observational study. BMJ Open. 2019;9(11). https://doi.org/10.1136/bmjopen-2019-032506
Zwet EW, Cator EA. The significance filter, the winner’s curse and the need to shrink. Stat Neerl. 2021;75(4):437–52. https://doi.org/10.1111/stan.12241
Hussey I, Alsalti T, Bosco F, Elson M, Arslan RC. An aberrant abundance of Cronbach’s alpha values at .70. 2023. https://doi.org/10.31234/osf.io/dm8xn
Regression modeling strategies: with applications to linear models
R Core Team. R: a language and environment for statistical computing. Vienna; 2023. https://www.R-project.org/
Barnett AG. Code and data for our analysis of area under the curve values extracted from PubMed abstracts. 2023. https://doi.org/10.5281/zenodo.8275064
Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press; 2003. https://doi.org/10.1017/CBO9780511755453
Chiu K, Grundy Q, Bero L. ‘Spin’ in published biomedical literature: a methodological systematic review. PLoS Biol. 2017;15(9):e2002173. https://doi.org/10.1371/journal.pbio.2002173
Brodeur A, Cook N, Heyes A. Methods matter: p-hacking and publication bias in causal analysis in economics. Am Econ Rev. 2020;110(11):3634–60. https://doi.org/10.1257/aer.20190687
Adda J, Decker C, Ottaviani M. P-hacking in clinical trials and how incentives shape the distribution of results across phases. Proc Natl Acad Sci U S A. 2020;117(24):13386–92. https://doi.org/10.1073/pnas.1919906117
Rohrer JM, Tierney W, Uhlmann EL, DeBruine LM, Heyman T, Jones B, et al. Putting the self in self-correction: findings from the loss-of-confidence project. Perspect Psychol Sci. 2021;16(6):1255–69. https://doi.org/10.1177/1745691620964106
Moons KGM, Donders ART, Steyerberg EW, Harrell FE. Penalized maximum likelihood estimation to directly adjust diagnostic and prognostic prediction models for overoptimism: a clinical example. J Clin Epidemiol. 2004;57(12):1262–70. https://doi.org/10.1016/j.jclinepi.2004.01.020
Chambers CD, Tzavella L. The past, present and future of Registered Reports. Nat Hum Behav. 2021;6(1):29–42. https://doi.org/10.1038/s41562-021-01193-7
Penders B. Process and bureaucracy: scientific reform as civilisation. Bull Sci Technol Soc. 2022;42(4):107–16. https://doi.org/10.1177/02704676221126388
Chan AW, Hróbjartsson A, Haahr MT, Gøtzsche PC, Altman DG. Empirical evidence for selective reporting of outcomes in randomized trials. JAMA. 2004;291(20):2457. https://doi.org/10.1001/jama.291.20.2457
Mathieu S. Comparison of registered and published primary outcomes in randomized controlled trials. JAMA. 2009;302(9):977. https://doi.org/10.1001/jama.2009.1242
Goldacre B, Drysdale H, Powell-Smith A, Dale A, Milosevic I, Slade E, et al. The COMPare Trials Project. 2016. www.COMPare-trials.org
Schwab S, Janiaud P, Dayan M, Amrhein V, Panczak R, Palagi PM, et al. Ten simple rules for good research practice. PLoS Comput Biol. 2022;18(6):1–14. https://doi.org/10.1371/journal.pcbi.1010139
Assessing the performance of prediction models: a framework for some traditional and novel measures
Vickers AJ, Calster BV, Steyerberg EW. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ. 2016:i6. https://doi.org/10.1136/bmj.i6
Parsons R, Blythe RD, Barnett AG, Cramb SM, McPhail SM. predictNMB: an R package to estimate if or when a clinical prediction model is worthwhile. J Open Source Softw. 2023;8(84):5328. https://doi.org/10.21105/joss.05328
Stark PB, Saltelli A. Cargo-cult statistics and scientific crisis. Significance. 2018;15(4):40–3. https://doi.org/10.1111/j.1740-9713.2018.01174.x
Christian K, ann Larkins J, Doran MR. We must improve conditions and options for Australian ECRs. Nat Hum Behav. 2023. https://doi.org/10.1038/s41562-023-01621-w
Wang MQ, Yan AF, Katz RV. Researcher requests for inappropriate analysis and reporting: a U.S. survey of consulting biostatisticians. Ann Intern Med. 2018;169(8):554. https://doi.org/10.7326/m18-1230
Download references
Thanks to the National Library of Medicine for making the PubMed data available for research
Twitter handles: @nicolem\(\_\)white (Nicole White); @RexParsons8 (Rex Parsons); @GSCollins (Gary Collins)
GSC was supported by Cancer Research UK (programme grant: C49297/A27294)
Australian Centre for Health Services Innovation and Centre for Healthcare Transformation
Rheumatology & Musculoskeletal Sciences
All authors contributed to the interpretation of the results and critical revision of the manuscript
The corresponding author attests that all listed authors meet the authorship criteria and that no others meeting the criteria have been omitted
We used publicly available data that were published to be read and scrutinised by researchers and hence ethical approval was not required
Examples of qualitative descriptors for AUC thresholds
Number and proportion of abstracts with at least one AUC value over time
Bar chart of the number of AUC values per abstract
Distribution of AUC values and residuals from a smooth fit to the distribution using only AUC values that were in the results section of the abstract
Histograms of AUC values that were lower or upper confidence limits and residuals from a smooth fit to the histograms
Subgroup analysis of AUC values from the journal PLOS ONE
Bland–Altman plot of the difference in the number of AUC values per abstract extracted manually and by the algorithm
Box-plots of AUC values grouped by whether they were extracted by the algorithm or manual-check only
Estimates from a linear regression model examining the differences in AUC values extracted by the algorithm and manual checking
Proportion of correct AUC values from the algorithm for four selected AUC values
Proportion of correct AUC values from the algorithm for two selected AUC values
Download citation
DOI: https://doi.org/10.1186/s12916-023-03048-6
Metrics details
Baseline outcome risk can be an important determinant of absolute treatment benefit and has been used in guidelines for “personalizing” medical decisions
We compared easily applicable risk-based methods for optimal prediction of individualized treatment effects
We simulated RCT data using diverse assumptions for the average treatment effect
the shape of its interaction with treatment (none
and the magnitude of treatment-related harms (none or constant independent of the prognostic index)
We predicted absolute benefit using: models with a constant relative treatment effect; stratification in quarters of the prognostic index; models including a linear interaction of treatment with the prognostic index; models including an interaction of treatment with a restricted cubic spline transformation of the prognostic index; an adaptive approach using Akaike’s Information Criterion
We evaluated predictive performance using root mean squared error and measures of discrimination and calibration for benefit
The linear-interaction model displayed optimal or close-to-optimal performance across many simulation scenarios with moderate sample size (N = 4,250; ~ 785 events)
The restricted cubic splines model was optimal for strong non-linear deviations from a constant treatment effect
particularly when sample size was larger (N = 17,000)
The adaptive approach also required larger sample sizes
These findings were illustrated in the GUSTO-I trial
An interaction between baseline risk and treatment assignment should be considered to improve treatment effect predictions
By assuming treatment effect is a function of baseline risk
risk modeling methods impose a restriction on the shape of treatment effect heterogeneity
With smaller sample sizes or limited information on effect modification
can provide a good option for evaluating treatment effect heterogeneity
with larger sample sizes and/or a limited set of well-studied strong effect modifiers
treatment effect modeling methods can potentially result in a better bias-variance tradeoff
the setting in which treatment effect heterogeneity is evaluated is crucial for the selection of the optimal approach
even though treatment effect estimates at the risk subgroup level may be accurate
these estimates may not apply to individual patients
as homogeneity of treatment effects is assumed within risk strata
With stronger overall treatment effect and larger variability in predicted risks
patients assigned to the same risk subgroup may still differ substantially with regard to their benefits from treatment
we aim to summarize and compare different risk-based models for predicting treatment effects
We simulate different relations between baseline risk and treatment effects and also consider potential harms of treatment
We illustrate the different models by a case study of predicting individualized effects of treatment for acute myocardial infarction in a large RCT
We observe RCT data \(\left(Z,X,Y\right)\)
where for each patient \({Z}_{i}=0,1\) is the treatment status
\({Y}_{i}=0,1\) is the observed outcome and \({X}_{i}\) is a set of measured covariates
Let \(\{{Y}_{i}\left(z\right),z=0,1\}\) denote the unobservable potential outcomes
We observe \({Y}_{i}={Z}_{i}{Y}_{i}\left(1\right)+\left(1-{Z}_{i}\right){Y}_{i}\left(0\right)\)
We are interested in predicting the conditional average treatment effect (CATE)
Assuming that \(\left(Y\left(0\right),Y\left(1\right)\right)\perp Z|X\)
comparing equally-sized treatment and control arms in terms of a binary outcome
For each patient we generated 8 baseline covariates \({X}_{1},\dots ,{X}_{4}\sim N\left(0,1\right)\) and \({X}_{5},\dots ,{X}_{8}\sim B\left(1,0.2\right)\)
Outcomes in the control arm were generated from Bernoulli variables with true probabilities following a logistic regression model including all baseline covariates
\(P\left(Y\left(0\right)=1 | X=x\right)={\text{expit}}\left(l{p}_{0}\right)={e}^{l{p}_{0}}/\left(1+{e}^{l{p}_{0}}\right)\)
with \(l{p}_{0}=l{p}_{0}\left(x\right)={x}^{t}\beta\)
In the base scenarios coefficient values \(\beta\) were such
that the control event rate was 20% and the discriminative ability of the true prediction model measured using Harrell’s c-statistic was 0.75
The c-statistic represents the probability that for a randomly selected discordant pair from the sample (patients with different outcomes) the prediction model assigns larger risk to the patient with the worse outcome
For the simulations this was achieved by selecting \(\beta\) values such that the true prediction model would achieve a c-statistic of 0.75 in a simulated control arm with 500,000 patients
We achieved a true c-statistic of 0.75 by setting \(\beta ={\left(-2.08,0.49,\dots ,0.49\right)}^{t}\)
Outcomes in the treatment arm were first generated using 3 simple scenarios for a true constant odds ratio (OR): absent (OR = 1)
moderate (OR = 0.8) or strong (OR = 0.5) constant relative treatment effect
quadratic and non-monotonic deviations from constant treatment effects using:
We compared different methods for predicting absolute treatment benefit
that is the risk difference between distinct treatment assignments
We use the term absolute treatment benefit to distinguish from relative treatment benefit that relies on the ratio of predicted risk under different treatment assignments
Patients are stratified into equally-sized risk strata—in this case based on risk quartiles
are estimated by the difference in event rate between control and treatment arm patients
We considered this approach as a reference
expecting it to perform worse than the other candidates
as its objective is to provide an illustration of HTE rather than to optimize individualized benefit predictions
we fitted a logistic regression model which assumes constant relative treatment effect (constant odds ratio)
\(P\left(Y=1 | X=x,Z=z;\widehat{\beta }\right)={\text{expit}}\left({\widehat{lp}}_{0}+{\delta }_{1}z\right)\)
absolute benefit is predicted from \(\tau \left(x;\widehat{\beta }\right)={\text{expit}}\left({\widehat{lp}}_{0}\right)-{\text{expit}}\left({\widehat{lp}}_{0}+{\delta }_{1}\right)\)
where \({\delta }_{1}\) is the log of the assumed constant odds ratio and \({\widehat{lp}}_{0}={\widehat{lp}}_{0}\left(x;\widehat{\beta }\right)={x}^{t}\widehat{\beta }\) the linear predictor of the estimated baseline risk model
we fitted a logistic regression model including treatment
\(P\left(Y=1 | X=x,Z=z;\widehat{\beta }\right)={\text{expit}}\left({\delta }_{0}+{\delta }_{1}z+{\delta }_{2}{\widehat{lp}}_{0}+{\delta }_{3}z{\widehat{lp}}_{0}\right)\)
Absolute benefit is then estimated from \(\tau \left(x;\widehat{\beta }\right)={\text{expit}}\left({\delta }_{0}+{\delta }_{2}{\widehat{lp}}_{0}\right)-{\text{expit}}\left({(\delta }_{0}+{\delta }_{1})+{(\delta }_{2}{+{\delta }_{3})\widehat{lp}}_{0}\right)\)
We will refer to this method as the linear interaction approach
we considered an adaptive approach using Akaike’s Information Criterion (AIC) for model selection
we ranked the constant relative treatment effect model
and 5 knots based on their AIC and selected the one with the lowest value
The extra degrees of freedom were 1 (linear interaction)
3 and 4 (RCS models) for these increasingly complex interactions with the treatment effect
We evaluated the predictive accuracy of the considered methods by the root mean squared error (RMSE):
The observed benefits are regressed on the predicted benefits using a locally weighted scatterplot smoother (loess)
The ICI-for-benefit is the average absolute difference between predicted and smooth observed benefit
Values closer to 0 represent better calibration
For each scenario we performed 500 replications
within which all the considered models were fitted
We simulated a super-population of size 500,000 for each scenario within which we calculated RMSE and discrimination and calibration for benefit of all the models in each replication
We demonstrated the different methods using 30,510 patients with acute myocardial infarction (MI) included in the GUSTO-I trial
10,348 patients were randomized to tissue plasminogen activator (tPA) treatment and 20,162 were randomized to streptokinase
The outcome of interest was 30-day mortality (total of 2,128 events)
Predicted baseline risk is derived by setting the treatment indicator to 0 for all patients
RMSE of the considered methods across 500 replications was calculated from a simulated super-population of size 500,000
The scenario with true constant relative treatment effect (panel A) had a true prediction c-statistic of 0.75 and sample size of 4250
The RMSE is also presented for strong linear (panel B)
and non-monotonic (panel D) deviations from constant relative treatment effects
Panels on the right side present the true relations between baseline risk (x-axis) and absolute treatment benefit (y-axis)
and 97.5 percentiles of the risk distribution are expressed by the boxplot on the top
and 97.5 percentiles of the true benefit distributions are expressed by the boxplots on the side of the right-handside panel
RMSE of the considered methods across 500 replications calculated in simulated samples of size 17,000 rather than 4,250 in Fig. 1
RMSE was calculated on a super-population of size 500,000
RMSE of the considered methods across 500 replications calculated in simulated samples 4,250
Discrimination for benefit of the considered methods across 500 replications calculated in simulated samples of size 4,250 using the c-statistic for benefit
The c-statistic for benefit represents the probability that from two randomly chosen matched patient pairs with unequal observed benefit
the pair with greater observed benefit also has a higher predicted benefit
Calibration for benefit of the considered methods across 500 replications calculated in a simulated sample of size 500,000
True prediction c-statistic of 0.75 and sample size of 4,250
Our main conclusions remained unchanged in the sensitivity analyses where correlations between baseline characteristics were introduced (Supplement
The results from all individual scenarios can be explored online at https://mi-erasmusmc.shinyapps.io/HteSimulationRCT/. Additionally, all the code for the simulations can be found at https://github.com/mi-erasmusmc/HteSimulationRCT
We used the derived prognostic index to fit a constant treatment effect
a linear interaction and an RCS-3 model individualizing absolute benefit predictions
an adaptive approach with the 3 candidate models was applied
Individualized absolute benefit predictions based on baseline risk when using a constant treatment effect approach
a linear interaction approach and RCS smoothing using 3 knots
Risk stratified estimates of absolute benefit are presented within quartiles of baseline risk as reference
95% confidence bands were generated using 10,000 bootstrap resamples
where the prediction model was refitted in each run to capture the uncertainty in baseline risk predictions
we also provide 95% confidence intervals for the baseline risk quarter-specific average predicted risk over the 10,000 bootstrap samples
The linear interaction and the RCS-3 models displayed very good performance under many of the considered simulation scenarios
The linear interaction model was optimal in cases with moderate sample sizes (4.250 patients; ~ 785 events) and moderately performing baseline risk prediction models
was better calibrated for benefit and had better discrimination for benefit
even in scenarios with strong quadratic deviations
In scenarios with true non-monotonic deviations
the linear interaction model was outperformed by RCS-3
especially in the presence of treatment-related harms
Increasing the sample size or the prediction model’s discriminative ability favored RCS-3
especially in scenarios with strong non-linear deviations from a constant treatment effect
RCS-4 and RCS-5 were too flexible in all considered scenarios
increased variability of discrimination for benefit and worse calibration of benefit predictions
Even with larger sample sizes and strong quadratic or non-monotonic deviations
these more flexible methods did not outperform the simpler RCS-3 approach
Higher flexibility may only be helpful under more extreme patterns of HTE compared to the quadratic deviations considered here
Considering interactions in RCS-3 models as the most complex approach often may be reasonable
Our results can also be interpreted in terms of bias-variance trade-off
The increasingly complex models considered allow for more degrees of freedom which
increase the variance of our absolute benefit estimates
this increased complexity did not always result in substantial decrease in bias
especially with lower sample sizes and weaker treatment effects
in most scenarios the simpler linear interaction model achieved the best bias-variance balance and outperformed the more complex RCS methods
even in the presence of non-linearity in the true underlying relationship between baseline risk and treatment effect
the simpler constant treatment effect model was often heavily biased and
was outperformed by the other methods in the majority of the considered scenarios
Increasing the discriminative ability of the risk model reduced RMSE for all methods
Higher discrimination translates in higher variability of predicted risks
allows the considered methods to better capture absolute treatment benefits
better risk discrimination also led to higher discrimination between those with low or high benefit (as reflected in values of c-for-benefit)
The adaptive approach had adequate median performance, following the “true” model in most scenarios. With smaller sample sizes it tended to miss the treatment-baseline risk interaction and selected simpler models (Supplement Sect
This conservative behavior resulted in increased RMSE variability in these scenarios
especially with true strong linear or non-monotonic deviations
with smaller sample sizes the simpler linear interaction model may be a safer choice for predicting absolute benefits
especially in the presence of any suspected treatment-related harms
Even though the average error rates increased for all the considered methods
due to the miss-specification of the outcome model
the linear interaction model had the lowest error rates
The constant treatment effect model was often biased
especially with moderate or strong treatment-related harms
Future simulation studies could explore the effect of more extensive deviations from risk-based treatment effects
in all our simulation scenarios we assumed all covariates to be statistically independent
the effect of continuous covariates to be linear
and no interaction effects between covariates to be present
This can be viewed as a limitation of our extensive simulation study
as all our methods are based on the same fitted risk model
we do not expect these assumptions to significantly influence their relative performance
the linear interaction approach is a viable option with moderate sample sizes and/or moderately performing risk prediction models
assuming a non-constant relative treatment effect plausible
RCS-3 is a better option with more abundant sample size and when non-monotonic deviations from a constant relative treatment effect and/or substantial treatment-related harms are anticipated
Increasing the complexity of the RCS models by increasing the number of knots does not improve benefit prediction
Using AIC for model selection is attractive with larger sample size
The dataset supporting the conclusions of this article is available in the Vanderbilt University repository maintained by the Biostatistics Department, https://hbiostat.org/data/gusto.rda
Global Utilization of Streptokinase and Tissue plasminogen activator for Occluded coronary arteries
Synergy Between PCI With Taxus and Cardiac Surgery
A framework for the analysis of heterogeneity of treatment effect in patient-centered outcomes research
Personalized evidence based medicine: predictive approaches to heterogeneous treatment effects
Recursive partitioning for heterogeneous causal effects
Some methods for heterogeneous treatment effect estimation in high dimensions
Estimation and Inference of Heterogeneous Treatment Effects using Random Forests
Predictive approaches to heterogeneous treatment effects: a scoping review
Assessing and reporting heterogeneity in treatment effects in clinical trials: a proposal
Benefit and harm of intensive blood pressure treatment: Derivation and validation of risk models using data from the SPRINT and ACCORD trials
Analysis of randomized comparative clinical trial data for personalized treatment selections
Metalearners for estimating heterogeneous treatment effects using machine learning
A robust method for estimating optimal treatment regimes
Estimating Optimal Treatment Regimes from a Classification Perspective
Simple subgroup approximations to optimal treatment regimes from randomized clinical trial data
Regularized outcome weighted subgroup identification for differential treatment effects
Models with interactions overestimated heterogeneity of treatment effects and were prone to treatment mistargeting
A tutorial on individualized treatment effect prediction from randomized trials with a binary endpoint
Simple risk stratification at admission to identify patients with reduced mortality from primary angioplasty
Should Vitamin A injections to prevent bronchopulmonary dysplasia or death be reserved for high-risk infants
Reanalysis of the National Institute of Child Health and Human Development Neonatal Research Network Randomized Trial
Improving diabetes prevention with benefit based tailored treatment: risk based reanalysis of Diabetes Prevention Program
The Predictive Approaches to Treatment effect Heterogeneity (PATH) statement
The Predictive Approaches to Treatment effect Heterogeneity (PATH) Statement: explanation and elaboration
Kent DM, Nelson J, Dahabreh IJ, Rothwell PM, Altman DG, Hayward RA. Risk and treatment effect heterogeneity: re-analysis of individual participant data from 32 large clinical trials. Int J Epidemiol. 2016;45(6):2075–88. https://doi.org/10.1093/ije/dyw118
using internally developed risk models to assess heterogeneity in treatment effects in clinical trials
Endogenous stratification in randomized experiments
Regression models in clinical studies: determining relationships between predictors and response
The proposed `concordance-statistic for benefit’ provided a useful metric when modeling heterogeneous treatment effects
The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models
Selection of thrombolytic therapy for individual patients: development of a clinical model
Clinical trials in acute myocardial infarction: should we adjust for baseline characteristics
Can overall results of clinical trials be applied to all patients
An evidence based approach to individualising treatment
Treatment selections using risk–benefit profiles based on data from comparative randomized clinical trials with multiple endpoints
Identification of biomarker-by-treatment interactions in randomized clinical trials with survival outcomes and high-dimensional spaces
A bayesian approach to subgroup identification
Athey S, Tibshirani J, Wager S. Generalized random forests. Annals Stat. 2019;47(2):1148–78. https://doi.org/10.1214/18-AOS1709
Estimating Individual Treatment Effect in Observational Data Using Random Forest Methods
Anatomical and clinical characteristics to guide decision making between coronary artery bypass surgery and percutaneous coronary intervention for individual patients: development and validation of SYNTAX score II
Redevelopment and validation of the SYNTAX score II to individualise decision making between percutaneous and surgical revascularisation in patients with complex coronary artery disease: secondary analysis of the multicentre randomised controlled SYNTAXES trial with external cohort validation
Download references
This work has been performed in the European Health Data and Evidence Network (EHDEN) project
This project has received funding from the Innovative Medicines Initiative 2 Joint Undertaking (JU) under grant agreement No 806968
The JU receives support from the European Union’s Horizon 2020 research and innovation program and EFPIA
Institute for Clinical Research and Health Policy Studies
created the software used in this work and ran the analysis; All authors interpreted the results
The author(s) read and approved the final manuscript
Ethics approval was not required as the empirical illustration of this study was based on anonymized
work for a research group that received/receives unconditional research grants from Yamanouchi
None of these relate to the content of this paper
The remaining authors have disclosed that they do not have any potential conflicts of interest
Download citation
DOI: https://doi.org/10.1186/s12874-023-01889-6
Metrics details
there is an emergent need to develop a robust prediction model for estimating an individual absolute risk for all-cause mortality
so that relevant assessments and interventions can be targeted appropriately
evaluate and validate (internally and externally) a risk prediction model allowing rapid estimations of an absolute risk of all-cause mortality in the following 10 years
data came from English Longitudinal Study of Ageing study
which comprised 9154 population-representative individuals aged 50–75 years
1240 (13.5%) of whom died during the 10-year follow-up
Internal validation was carried out using Harrell’s optimism-correction procedure; external validation was carried out using Health and Retirement Study (HRS)
which is a nationally representative longitudinal survey of adults aged ≥50 years residing in the United States
Cox proportional hazards model with regularisation by the least absolute shrinkage and selection operator
where optimisation parameters were chosen based on repeated cross-validation
was employed for variable selection and model fitting
sensitivity and specificity were determined in the development and validation cohorts
The model selected 13 prognostic factors of all-cause mortality encompassing information on demographic characteristics
The internally validated model had good discriminatory ability (c-index=0.74)
specificity (72.5%) and sensitivity (73.0%)
the model’s prediction accuracy remained within a clinically acceptable range (c-index=0.69
The main limitation of our model is twofold: 1) it may not be applicable to nursing home and other institutional populations
and 2) it was developed and validated in the cohorts with predominately white ethnicity
A new prediction model that quantifies absolute risk of all-cause mortality in the following 10-years in the general population has been developed and externally validated
It has good prediction accuracy and is based on variables that are available in a variety of care and research settings
This model can facilitate identification of high risk for all-cause mortality older adults for further assessment or interventions
which are now included in clinical guidelines for therapeutic management
a prediction model for all-cause mortality in older people can be used to communicate risk to individuals and their families (if appropriate) and guide strategies for risk reduction
we used data from England to develop our mortality model and data from United States to externally validate it
To ensure that the cohorts employed were as representative of the general populations as possible
we did not limit them based on their help and health statuses
this sample was followed-up every two years
wave 1 formed our baseline and follow-up data were obtained from wave 6 (2012–2013)
To limit the overriding influence of age in a “cohort of survivors”
we excluded participants who were > 75 years old
A more detailed description of the HRS sample is provided in Supplementary Materials
For the purpose of validating our mortality model
we included information on mortalities that occurred from 30 January 2004 to 1 August 2015 giving us a 10-year follow-up period
which is in line with the derivation cohort
To make the external sample more consistent with the derivation data
we further limited it to those who were aged 50–75 years old
The outcome was all-cause mortality that occurred from 2002 to 2003 through to 2013
which was ascertained from the National Health Service central register
which captures all deaths occurring in the UK
All participants included in this study provided written consent for linkage to their official records
Survival time was defined as the period from baseline when all ELSA participants were alive to the date when an ELSA participant was reported to have died during the 10-year follow-up
For those who did not die during follow-up
the survival time was calculated using the period spanning from baseline until the end of the study
A more detailed description of these methods is provided in the Supplementary Methods
estimates their effects and introduces parsimony
Cox-Lasso automatically performs variable selection and deals with collinearity
Selection of the tuning parameter λ optimising the model performance is described below
Calibration plot presenting agreement between the predicted and observed survival rates at 10-years as estimated by our newly developed model
Nomogram for Cox-Lasso regression which enables calculating individual normalized prognostic indexes (PI
given by the linear predictor line) for all-cause mortality in the following 10 years
Coefficients are based on the Lasso-Cox model as estimated by the final model for the all-cause mortality
The nomogram allows computing the normalized prognostic index (PI) for a new individual
The PI is a single-number summary of the combined effects of a patient’s risk factors and is a common method of describing the risk for an individual
the PI is a linear combination of the risk factors
with the estimated regression coefficients as weights
The exponentiated PI gives the relative risk of each participant in comparison with a baseline participant (in this context the baseline participant would have value 0 for all the continuous covariates and being at the reference category for the categorical ones)
The PI is normalized by subtracting the mean PI
it can be used as a first-stage screening aid that might prolong life-expectancy by alerting to an individual’s heightened risk profile and a need for more targeted evaluation and prevention
It could also be used by non-professionals to improve self-awareness of their health status
and by governmental and health organisations to decrease the burden of certain risk factors in the general population of older people
the consideration of these factors will help identify high-risk groups who might otherwise be under-detected
based on prognostic factors chosen through multiple sequential hypothesis testing
Specificity of the externally validated 10-item index was also considerably lower (64.4%) compared to our externally validated model (70.5%)
implying it is likely to falsely classify a higher proportion of older adults as high risk for all-cause mortality in the following 10 years
using baseline variables reflects the real-life clinical information available to a physician and a participant when they need to make decisions on the likely risk of all-cause mortality for an individual during the next 10 years
it would be of interest to include potential interaction with a smaller set of candidate predictors in the future studies
Having employed modern statistical learning algorithms and addressed the weaknesses of previous models
a new mortality model achieved good discrimination and calibration as shown by its performance in a separate validation cohort
which are available by patient report in a variety of care and research settings
It allows rapid estimations of an individual’s risk of all-cause mortality based on an individual risk profile
These characteristics suggest that our model may be useful for clinical
The English Longitudinal Study of Ageing (ELSA) was developed by a team of researchers based at University College London, the Institute for Fiscal Studies and the National Centre for Social Research. The datasets generated and/or analysed during the current study are available in UK Data Services and can be accessed at: https://discover.ukdataservice.ac.uk
No administrative permissions were required to access these data
Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis guidelines
Prediction of coronary heart disease using risk factor categories
Validation studies for models projecting the risk of invasive and total breast cancer incidence
and evaluation of a new QRISK model to estimate lifetime risk of cardiovascular disease: cohort study using QResearch database
Predicting 10-year mortality for older adults
Development and validation of a prognostic index for 4-year mortality in older adults
Development and validation of a prognostic index for 1-year mortality in older adults after hospitalization
The development and validation of an index to predict 10-year mortality risk in a longitudinal cohort of older English adults
Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets
Predictive analytics in information systems research
Trends in life expectancy and age-specific mortality in England and Wales
in comparison with a set of 22 high-income countries: an analysis of vital statistics data
Second Edition ed: Springer Nature Switzerland; 2019
Cohort profile: the English longitudinal study of ageing
Cohort profile: the health and retirement study (HRS)
A 10-year follow-up of the health and retirement study
Development and validation of a prediction model to estimate the risk of liver cirrhosis in primary care patients with abnormal liver blood test results: protocol for an electronic health record study in clinical practice research Datalink
Calculating the sample size required for developing a clinical prediction model
MissForest—non-parametric missing value imputation for mixed-type data
A Bayesian missing value estimation method for gene expression profile data
The lasso method for variable selection in the Cox model
Validation of prediction models based on lasso regression with multiply imputed data
A selective overview of variable selection in high dimensional feature space
The elements of statistical learning: data mining
A review and suggested modifications of methodological standards
Classifier technology and the illusion of Progress
Multivariable prognostic models: issues in developing models
Three myths about risk thresholds for prediction models
The inconsistency of "optimal" cutpoints obtained using two criteria based on the receiver operating characteristic curve
Assessment of predictive performance in incomplete data by combining internal validation and multiple imputation
Nomograms in oncology: more than meets the eye
Comparisons of nomograms and urologists' predictions in prostate cancer
Guidelines on preventing cardiovascular disease in clinical practice
Long-term effects of wealth on mortality and self-rated health status
Purpose in life is associated with mortality among community-dwelling older persons
The association between self-rated health and mortality in a well-characterized sample of coronary artery disease patients
Regularization and variable selection via the elastic net
Screening for prediabetes using machine learning models
Predicting urinary tract infections in the emergency department with machine learning
Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints
Using the outcome for imputation of missing predictor values was preferred
Multiple imputation in the presence of high-dimensional data
Estimation of ten-year risk of fatal cardiovascular disease in Europe: the SCORE project
Cardiovascular risk prediction models for people with severe mental illness: results from the prediction and management of cardiovascular risk in people with severe mental illnesses (PRIMROSE) research program
Download references
The English Longitudinal Study of Ageing is funded by the National Institute on Aging (grant RO1AG7644) and by a consortium of UK government departments coordinated by the Economic and Social Research Council (ESRC)
is further funded by the National Institute for Health Research (NIHR) (NIHR Post-Doctoral Fellowship - PDF-2018-11-ST2–020)
DS and DA were part funded part funded by the National Institute for Health Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London
receive salary support from the National Institute for Health Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust the NIHR Maudsley BRC
The views expressed in this publication are those of the authors and not necessarily those of the NHS
the National Institute for Health Research or the Department of Health and Social Care
The Health and Retirement Study is funded by the National Institute on Aging (NIA U01AG009740) and the US Social Security Administration
M receive salary support from the National Institute on Aging (NIA U01AG009740)
The sponsors had no role in the design and conduct of the study; collection
and interpretation of the data; preparation
or approval of the manuscript; and decision to submit the manuscript for publication
Department of Behavioural Science and Health
Department of Biostatistics & Health Informatics
Experimental Biomedicine and Clinical Neuroscience (BIONEC)
OA had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis
AS and OA conceived the idea for the study
RMC and JF conducted data preparation and management
OA wrote the first draft of the manuscript
DA and DS edited the manuscript and approved the final version
The authors read and approved the final manuscript
The London Multicentre Research Ethics Committee granted ethical approval for the ELSA (MREC/01/2/91)
and informed consent was obtained from all participants
This manuscript is approved by all authors for publication
and is an editor of Psychological Medicine Journal
All other authors declare that they have no conflict of interest
Outlines a list of all variables considered in the analyses and whether they have been included or excluded from the model building
Distribution of missing and observed variables included in the analyses in ELSA
Sample calculations for Survival outcomes (Cox prediction models)
Distributions of the variables at baseline before and after multiple imputations
Apparent coefficients for the Cox-LASSO regression for all-cause mortality during the 10-year follow-up
Apparent models’ performance in prediction the 10-year risk of all-cause mortality in older adults
Optimism-corrected models’ performance in prediction the 10-year risk of all-cause mortality in older adults
Apparent models’ discrimination in prediction the 10-year risk of all-cause mortality in older adults
Internally validated though optimism-correction models’ discrimination for prediction the 10-year risk of all-cause mortality in older adults
Histogram depicting distribution of prognostic index (PI) estimated based on 13 variables included in the model in the development cohort and external cohort
The distribution of survival probabilities estimated based on 13 variables included in the model in the development and validation cohorts
Distributions of the variables included in the final all-cause mortality model in derivation cohort (ELSA) and validation cohort (HRS)
Download citation
DOI: https://doi.org/10.1186/s12874-020-01204-7
Metrics details
Recent evidence suggests that there is often substantial variation in the benefits and harms across a trial population
We aimed to identify regression modeling approaches that assess heterogeneity of treatment effect within a randomized clinical trial
We performed a literature review using a broad search strategy
complemented by suggestions of a technical expert panel
The approaches are classified into 3 categories: 1) Risk-based methods (11 papers) use only prognostic factors to define patient subgroups
relying on the mathematical dependency of the absolute risk difference on baseline risk; 2) Treatment effect modeling methods (9 papers) use both prognostic factors and treatment effect modifiers to explore characteristics that interact with the effects of therapy on a relative scale
These methods couple data-driven subgroup identification with approaches to prevent overfitting
such as penalization or use of separate data sets for subgroup identification and effect estimation
3) Optimal treatment regime methods (12 papers) focus primarily on treatment effect modifiers to classify the trial population into those who benefit from treatment and those who do not
we also identified papers which describe model evaluation methods (4 papers)
Three classes of approaches were identified to assess heterogeneity of treatment effect
including both simulations and empirical evaluations
is required to compare the available methods in different settings and to derive well-informed guidance for their application in RCT analysis
is the cornerstone of precision medicine; its goal is to predict the optimal treatments at the individual level
accounting for an individual’s risk for harm and benefit outcomes
In this scoping review [9]
we aim to identify and categorize the variety of regression-based approaches for predictive heterogeneity of treatment effects analysis
Predictive approaches to HTE analyses are those that provide individualized predictions of potential outcomes in a particular patient with one intervention versus an alternative or
that can predict which of 2 or more treatments will be better for a particular patient
taking into account multiple relevant patient characteristics
We distinguish these analyses from the typical one-variable-at-a-time subgroup analyses that appear in forest plots of most major trial reports
and from other HTE analyses which explore or confirm hypotheses regarding whether a specific covariate or biomarker modifies the effects of therapy
To guide future work on individualizing treatment decisions
we aimed to summarize the methodological literature on regression modeling approaches to predictive HTE analysis
Titles, abstracts and full texts were retrieved and double-screened by six independent reviewers against eligibility criteria. Disagreements were resolved by group consensus in consultation with a seventh senior expert reviewer (DMK) in meetings.
Treatment effect modeling methods use both the main effects of risk factors and covariate-by-treatment interaction terms (on the relative scale) to estimate individualized benefits. They can be used either for making individualized absolute benefit predictions or for defining patient subgroups with similar expected treatment benefits (Table 2
Publications included in the review from 1999 until 2019
Numbers inside the bars indicate the method-specific number of publications made in a specific year
In a range of plausible scenarios evaluating HTE when considering binary endpoints
simulations showed that studies were generally underpowered to detect covariate-by-treatment interactions
but adequately powered to detect risk-by-treatment interactions
even when a moderately performing prediction model was used to stratify patients
risk stratification methods can detect patient subgroups that have net harm even when conventional methods conclude consistency of effects across all major subgroups
Primarily binary or time-to-event outcomes were considered
Researchers should demonstrate how relative and absolute risk reduction vary by baseline risk and test for HTE with interaction tests
Externally validated prediction models should be used
this approach may not be optimal for risk-based assessment of HTE
where accurate ranking of risk predictions is of primary importance for the calibration of treatment benefit predictions
Their proportional interactions model assumes that the effects of prognostic factors in the treatment arm are equal to their effects in the control arm multiplied by a constant
Testing for an interaction along the linear predictor amounts to testing that the proportionality factor is equal to 1
If high risk patients benefit more from treatment (on the relative scale) and disease severity is determined by a variety of prognostic factors
the proposed test results in greater power to detect HTE on the relative scale compared to multiplicity-corrected subgroup analyses
Even though the proposed test requires a continuous response
it can be readily implemented in large clinical trials with binary or time-to-event endpoints
For model selection an all subsets approach combined with a modified Bonferroni correction method can be used
This approach accounts for correlation among nested subsets of considered proportional interactions models
thus allowing the assessment of all possible proportional interactions models while controlling for the familywise error rate
They compared different Cox regression models for the prediction of treatment benefit: 1) a model without any risk factors; 2) a model with risk factors and a constant relative treatment effect; 3) a model with treatment
a prognostic index and their interaction; and 4) a model including treatment interactions with all available prognostic factors
fitted both with conventional and with penalized ridge regression
Benefit predictions at the individual level were highly dependent on the modeling strategy
with treatment interactions improving treatment recommendations under certain circumstances
They compared 12 different approaches in a high-dimensional setting with survival outcomes
Their methods ranged from a straightforward univariate approach as a baseline
where Wald tests accounting for multiple testing were performed for each treatment-covariate interaction to different approaches for dealing with hierarchy of effects—whether they enforce the inclusion of the respective main effects if an interaction is selected—and also different magnitude of penalization of main and interaction effects
by assigning outcomes into meaningful ordinal categories
Overfitting can be avoided by randomly splitting the sample into two parts; the first part is used to select and fit ordinal regression models in both the treatment and the control arm
the models that perform best in terms of a cross-validated estimate of concordance between predicted and unobservable true treatment difference— defined as the difference in probability of observing a worse outcome under control compared to treatment and the probability of observing a worse outcome under treatment compared to control—are used to define treatment benefit scores for patients
Treatment effects conditional on the treatment benefit score are then estimated through a non-parametric kernel estimation procedure
focusing on the identification of a subgroup that benefits from treatment
They repeatedly split the sample population based on the first-stage treatment benefit scores and estimate the treatment effect in subgroups above different thresholds
These estimates are plotted against the score thresholds to assess the adequacy of the selected scoring rule
This method could also be used for the evaluation of different modeling strategies by selecting the one that identifies the largest subgroup with an effect estimate above a desired threshold
They also start by fitting separate outcome models within treatment arms
rather than using these models to calculate treatment benefit scores
they imputed individualized absolute treatment effects
defined as the difference between the observed outcomes and the expected counterfactual (potential) outcomes based on model predictions
two separate regression models—one in each treatment arm—are fitted to the imputed treatment effects
they combined these two regression models for a particular covariate pattern by taking a weighted average of the expected treatment effects
the binary cadit is 1 when a treated patient has a good outcome or when an untreated patient does not
the dependent variable implicitly codes treatment assignment and outcome simultaneously
They first demonstrated that the absolute treatment benefit equals 2 × P(cadit = 1) − 1 and then they derived patient-specific treatment effect estimates by fitting a logistic regression model to the cadit
A similar approach was described for continuous outcomes with the continuous cadit defined as − 2 and 2 times the centered outcome
the outcome minus the overall average outcome
The approach identifies single covariates likely to modify treatment effect
along with the expected individualized treatment effect
The authors also extended their methodology to include two covariates simultaneously
allowing for the assessment of multivariate subgroups
Real-valued (continuous or binary) are considered without considering censoring
if common approaches for the assessment of model fit had been examined
They argue that if adequately fitting outcome models had been thoroughly sought
the extra modeling required for the robust methods of Zhang et al
they recursively update non-parametric estimates of the treatment-covariate interaction function from baseline risk estimates and vice-versa until convergence
The estimates of absolute treatment benefit are then used to restrict treatment to a contiguous sub-region of the covariate space
Starting from continuous responses they generalized their methodology to binary and time-to-event outcomes
Using LASSO regression to reduce the space of all possible combinations of covariates and their interaction with treatment to a limited number of covariate subsets
their approach selects the optimal subset of candidate covariates by assessing the increase in the expected response from assigning based on the considered treatment effect model
versus the expected response of treating everyone with the treatment found best from the overall RCT result
The considered criterion also penalizes models for their size
providing a tradeoff between model complexity and the increase in expected response
The method focuses solely on continuous outcomes
suggestions are made on its extension to binary type of outcomes
The GEM is defined as the linear combination of candidate effect modifiers and the objective is to derive their individual weights
This is done by fitting linear regression models within treatment arms where the independent variable is a weighted sum of the baseline covariates
while keeping the weights constant across treatment arms
The intercepts and slopes of these models along with the individual covariate GEM contributions are derived by maximizing the interaction effect in the GEM model
or by maximizing the statistical significance of an F-test for the interaction effects—a combination of the previous two
The authors derived estimates that can be calculated analytically
the subgroup that is assigned treatment based on the OTR
Their methodology returns an estimate of the population level effect of treating based on the OTR compared to treating no one
μ-risk metrics evaluate the ability of models to predict the outcome of interest conditional on treatment assignment
Treatment effect is either explicitly modeled by treatment interactions or implicitly by developing separate models for each treatment arm
τ-risk metrics focus directly on absolute treatment benefit
since absolute treatment benefit is unobservable
Value-metrics originate from OTR methods and evaluate the outcome in patients that were assigned to treatment in concordance with model recommendations
The method relies on the expression of disease-related harms and treatment-related harms on the same scale
The minimum absolute benefit required for a patient to opt for treatment (treatment threshold) can be viewed as the ratio of treatment-related harms and harms from disease-related events
Net benefit is then calculated as the difference between the decrease in the proportion of disease-related events and the proportion of treated patients multiplied by the treatment threshold
The latter quantity can be viewed as harms from treatment translated to the scale of disease-related harms
the net benefit of a considered prediction model at a specific treatment threshold can be derived from a patient-subset where treatment received is congruent with treatment assigned based on predicted absolute benefits and the treatment threshold
The model’s clinical relevance is derived by comparing its net benefit to the one of a treat-all policy
A model’s ability to discriminate between patients with higher or lower benefits is challenging
since treatment benefits are unobservable in the individual patient (since only one of two counterfactual potential outcomes can be observed)
Under the assumption of uncorrelated counterfactual outcomes
the authors matched patients from different treatment arms by their predicted treatment benefit
The difference of the observed outcomes between the matched patient pairs (0
0: harm) acts as a proxy for the unobservable absolute treatment difference
The c-statistic for benefit can then be defined on the basis of this tertiary outcome as the proportion of all possible pairs of patient pairs in which the patient pair observed to have greater treatment benefit was predicted to do so
they link observed outcomes to unobservable quantities
they derive posterior probability estimates of false inclusion or false exclusion in the final model for the considered covariates
Following the definition of an outcome-space sub-region that is considered beneficial
individualized posterior probabilities of belonging to that beneficial sub-region can be derived as a by-product of the proposed methodology
while 2) risk stratification analyzes treatment effects within strata of predicted risk
This approach is straightforward to implement
and may provide adequate assessment of HTE in the absence of strong prior evidence for potential effect modification
The approach might better be labeled ‘benefit magnification’
since benefit increases by higher baseline risk and a constant relative risk
Treatment effect modeling methods focus on predicting the absolute benefit of treatment through the inclusion of treatment-covariate interactions alongside the main effects of risk factors
modeling such interactions can result in serious overfitting of treatment benefit
especially in the absence of well-established treatment effect modifiers
Penalization methods such as LASSO regression
ridge regression or a combination (elastic net penalization) can be used as a remedy when predicting treatment benefits in other populations
Staging approaches starting from—possibly overfitted— “working” models predicting absolute treatment benefits that can later be used to calibrate predictions in groups of similar treatment benefit provide another alternative
While these approaches should yield well calibrated personalized effect estimates when data are abundant
it is yet unclear how broadly applicable these methods are in conventionally sized randomized RCTs
the additional discrimination of benefit of these approaches compared to the less flexible risk modeling approaches remains uncertain
Simulations and empirical studies should be informative regarding these questions
Because prognostic factors do not affect the sign of the treatment effect
several OTR methods rely primarily on treatment effect modifiers
when treatments are associated with adverse events or treatment burdens (such as costs) that are not captured in the primary outcome—as is often the case—estimates of the magnitude of treatment effect are required to ensure that only patients above a certain expected net benefit threshold (i.e
outweighing the harms and burdens of therapy) are treated
these classification methods do not provide comparable opportunity for incorporation of patient values and preferences for shared decision making which prediction methods do
While there is an abundance of proposed methodological approaches
examples of clinical application of HTE prediction models remain quite rare
This may reflect the fact that all these approaches confront the same fundamental challenges
These challenges include the unobservability of individual treatment response
the curse of dimensionality from the large number of covariates
the lack of prior knowledge about the causal molecular mechanisms underlying variation in treatment effects and the relationship of these mechanism to observable variables
and the very low power in which to explore interactions
Because of these challenges there might be very serious constraints on the usefulness of these methods as a class; while some methods may be shown to have theoretical advantages
the practical import of these theoretical advantages may not be ascertainable
it is uncertain whether any of these approaches will add value to the more conventional EBM approach of using an overall estimate of the main effect
or to the risk magnification approach of applying that relative estimate to a risk model
our review is descriptive and did not compare the approaches for their ability to predict individualized treatment effects or to identify patient subgroups with similar expected treatment benefits
we identified a large number of methodological approaches for the assessment of heterogeneity of treatment effects in RCTs developed in the past 20 years which we managed to divide into 3 broad categories
Extensive simulations along with empirical evaluations are required to assess those methods’ relative performance under different settings and to derive well-informed guidance for their implementation
This may allow these novel methods to inform clinical practice and provide decision makers with reliable individualized information on the benefits and harms of treatments
While we documented an exuberance of new methods
we do note a marked dearth of comparative studies in the literature
Future research could shed light on advantages and drawbacks of methods in terms of predictive performance in different settings
Users’ guides to the medical literature: II
How to use an article about therapy or prevention a
Explanatory and pragmatic attitudes in therapeutical trials
Evidence based medicine: concerns of a clinical neurologist
Limitations of applying summary results of clinical trials to individual patients: the need for risk stratification
Enhancing the scoping study methodology: a large
inter-professional team’s experience with Arksey and O’Malley’s framework
Harrell F. Viewpoints on Heterogeneity of Treatment Effect and Precision Medicine [Internet]. Statistical Thinking. 2018 [cited 2020 Jun 14]. Available from: https://fharrell.com/post/hteview/
Rothman K, Greenland S, Lash TL. Modern Epidemiology, 3rd Edition. 2007 31 [cited 2020 Jul 27]; Available from: https://www.rti.org/publication/modern-epidemiology-3rd-edition
The predictive approaches to treatment effect heterogeneity (PATH) statement
The predictive approaches to treatment effect heterogeneity (PATH) statement: explanation and elaboration
Using group data to treat individuals: understanding heterogeneous treatment effects in the age of precision medicine and patient-centred evidence
Estimating treatment effects for individual patients based on the results of randomised clinical trials
Method for evaluating prediction models that apply the results of randomized trials to individual patients
Profile-specific survival estimates: making reports of clinical trials more patient-relevant
Selection of thrombolytic therapy for individual patients: development of a clinical model GUSTO-I Investigator
Multivariable risk prediction can greatly enhance the statistical power of clinical trial subgroup analysis
Implications of heterogeneity of treatment effect for reporting and analysis of randomized trials in critical care
Using internally developed risk models to assess heterogeneity in treatment effects in clinical trials
Risk and treatment effect heterogeneity: re-analysis of individual participant data from 32 large clinical trials
Baseline characteristics predict risk of progression and response to combined medical therapy for benign prostatic hyperplasia (BPH)
Improving diabetes prevention with benefit based tailored treatment: risk based reanalysis of diabetes prevention program
Multistate Model to Predict Heart Failure Hospitalizations and All-Cause Mortality in Outpatients With Heart Failure With Reduced Ejection Fraction: Model Derivation and External Validation
Explicit inclusion of treatment in prognostic modeling was recommended in observational and randomized settings
A multivariate test of interaction for use in clinical trials
Assessing heterogeneity of treatment effect in a clinical trial with the proportional interactions model
Percutaneous coronary intervention versus coronary-artery bypass grafting for severe coronary artery disease
Estimates of absolute treatment benefit for individual patients required careful modeling of statistical interactions
Benefit and harm of intensive blood pressure treatment: derivation and validation of risk models using data from the SPRINT and ACCORD trials
Action to Control Cardiovascular Risk in Diabetes Study Group
Effects of intensive glucose lowering in type 2 diabetes
Treatment selections using risk-benefit profiles based on data from comparative randomized clinical trials with multiple endpoints
Effectively selecting a target population for a future comparative study
Post hoc subgroups in clinical trials: anathema or analytics
A Bayesian approach to subgroup identification
Performance guarantees for individualized treatment rules
Reader reaction to “a robust method for estimating optimal treatment regimes” by Zhang et al
Estimating optimal treatment regimes from a classification perspective
A simple method for estimating interactions between a treatment and a large number of covariates
and combining moderators of treatment on outcome after randomized clinical trials: a parametric approach
A novel approach for developing and interpreting treatment moderator profiles in randomized clinical trials
Advancing personalized medicine: application of a novel statistical method to identify treatment moderators in the coordinated anxiety learning and management study
Variable selection for qualitative interactions in personalized medicine while controlling the family-wise error rate
Generated effect modifiers (GEM’s) in randomized clinical trials
Statistical Inference For The Mean Outcome Under A Possibly Non-Unique Optimal Treatment Strategy
Targeted learning of the mean outcome under an optimal dynamic treatment rule
Inference about the expected performance of a data-driven dynamic treatment regime
Evaluating the impact of treating the optimal subgroup
Discussion of “Dynamic treatment regimes: Technical challenges and applications”
Schuler A, Baiocchi M, Tibshirani R, Shah N. A comparison of methods for model selection when estimating individual treatment effects. arXiv:180405146 [cs, stat] [Internet]. 2018 13 [cited 2020 Jun 14]; Available from: http://arxiv.org/abs/1804.05146
The proposed “concordance-statistic for benefit” provided a useful metric when modeling heterogeneous treatment effects
Bayesian variable selection with joint modeling of categorical and survival outcomes: an application to individualizing chemotherapy treatment in advanced colorectal cancer
Measuring the performance of markers for guiding treatment decisions
The Fundamental Difficulty With Evaluating the Accuracy of Biomarkers for Guiding Treatment
Assessing treatment-selection markers using a potential outcomes framework
Statistical and practical considerations for clinical evaluation of predictive biomarkers
Harrell F. EHRs and RCTs: Outcome Prediction vs. Optimal Treatment Selection [Internet]. Statistical Thinking. 2017 [cited 2020 Jun 14]. Available from: https://fharrell.com/post/ehrs-rcts/
Estimating individualized treatment rules using outcome weighted learning
Doubly robust learning for estimating individualized treatment with censored data
Louizos C, Shalit U, Mooij J, Sontag D, Zemel R, Welling M. Causal Effect Inference with Deep Latent-Variable Models. arXiv:170508821 [cs, stat] [Internet]. 2017 [cited 2020 Jun 14]; Available from: http://arxiv.org/abs/1705.08821
Use of open access platforms for clinical trial data
Clinical research data sharing: what an open science world means for researchers involved in evidence synthesis
Overview and experience of the YODA Project with clinical trial data sharing after 5 years
Download references
We acknowledge support from the Innovative Medicines Initiative (IMI) and helpful comments from Victor Talisa
Data Scientist from the University of Pittsburgh
and writing for this work were supported by a Patient Centered Outcomes Research Institute (PCORI) contract
the Predictive Analytics Resource Center [SA.Tufts.PARC.OSCO.2018.01.25]
Predictive Analytics and Comparative Effectiveness (PACE) Center
Institute for Clinical Research and Health Policy Studies (ICRHPS)
DK and DVK contributed to the conception and design of the work
DK and DVK contributed to the acquisition of the data
DK and DVK contributed to the interpretation of the data
All authors have approved the submitted version
Download citation
DOI: https://doi.org/10.1186/s12874-020-01145-1
Metrics details
The number of clinician burnouts is increasing and has been linked to a high administrative burden
Automatic speech recognition (ASR) and natural language processing (NLP) techniques may address this issue by creating the possibility of automating clinical documentation with a “digital scribe”
We reviewed the current status of the digital scribe in development towards clinical practice and present a scope for future research
We performed a literature search of four scientific databases (Medline
and Arxiv) and requested several companies that offer digital scribes to provide performance data
We included articles that described the use of models on clinical conversational data
either automatically or manually transcribed
three described ASR models for clinical conversations
The other 17 articles presented models for entity extraction
or summarization of clinical conversations
Two studies examined the system’s clinical validity and usability
while the other 18 studies only assessed their model’s technical validity on the specific NLP task
The most promising models use context-sensitive word embeddings in combination with attention-based neural networks
the studies on digital scribes only focus on technical validity
while companies offering digital scribes do not publish information on any of the research phases
Future research should focus on more extensive reporting
iteratively studying technical validity and clinical validity and usability
and investigating the clinical utility of digital scribes
This digital scribe uses techniques such as automatic speech recognition (ASR) and natural language processing (NLP) to automate (parts of) clinical documentation
The proposed structure for a digital scribe includes a microphone that records a conversation
an ASR system that transcribes this conversation
and a set of NLP models to extract or summarize relevant information and present it to the physician
or use the extracted information for diagnosis support
A scoping review of current evidence is needed to determine the current status of the digital scribe and to make recommendations for future research
researchers can find a suitable dataset or collect data themselves
Researchers should also check if the dataset contains any unintended bias or underrepresented groups
the researchers should prospectively study the model in clinical practice
the model might run in clinical practice without showing the output to the end-users
end-users analyze the output to identify any errors
a prospective study can be set up to determine clinical impact
The purpose of the present study is to perform a scoping review of the literature and contact companies on the current status of digital scribes in healthcare
Which methods are being used to develop (part of) a digital scribe
Have any of these methods been evaluated in clinical practice
These companies were requested to provide unpublished performance data for their digital scribe
Our definition of a digital scribe is any system that uses a clinical conversation as input
and automatically extracts information that can be used to generate an encounter note
We included articles that describe the performance of either ASR or NLP on clinical conversational data
A clinical conversation was defined as a conversation—in real life
or via chat—between at least one patient and one healthcare professional
Because ASR and NLP are different fields of expertize and will often be described in separate studies
we chose to include studies that only focused on part of a digital scribe
Studies that described NLP models that were not aimed at creating an encounter note but
extracted information for research purposes
Articles written in any language other than English were excluded
Because of the rapidly evolving research field and the time lag for publications
and S.A.C.) independently screened all articles on title and abstract
using the inclusion and exclusion criteria
The selected articles were assessed for eligibility by reading the full text
The first reviewer extracted information from the included articles and the unpublished data provided by companies
The second reviewer verified the extracted information
The following aspects were extracted and assessed:
The four phases of article selection following the PRISMA-ScR statement
We were unable to obtain performance data from other companies
None of the studies investigated the clinical utility
WER: This metric counts the number of substitutions
and insertions in the automatic transcript
F1 score: The F1 score is the harmonic mean between the precision (or positive predictive value) and the recall (or sensitivity)
ROUGE: This is a score that measures the similarity between the automatic summary and the gold standard summary
or the longest common subsequence (ROUGE-L)
The ROUGE-L score considers sentence-level structure
while the ROUGE-1 and ROUGE-2 scores only examine if a uni- or bigram occurs in both the automatic and gold standard summary
Scope of the different aspects and techniques of the included digital scribes
Highest F1 scores per entity extraction task
One study33 tested their classification model on manually transcribed data and automatically transcribed data
The model performed better on the manually transcribed data
with a difference in F1 score ranging from 0.03 to 0.06
although they did not mention if the difference was significant
They formed 18 disadvantaged and advantaged groups based on gender
there was a statistically significant difference in favor of the advantaged group
The main reason for the disparity is a difference in the type of medical visit
“blood” is a strong lexical cue to classify a sentence as important for the “Plan” section of the summary
but this word is said less often in conversations with Asian patients
The F1 score for the latter study was 0.61
limiting the comparability with the other studies
When using the same model with automatically extracted noteworthy utterances
Physicians found that 80% of the summaries included “all” or “most” relevant facts
The study did not specify which parts were deemed relevant or not or if the model missed specific information
DeepScribe did not provide information on the models used for summarization but included how often a summary needed to be adjusted in practice
They report that 77% of their summaries do not need modification by a medical scribe before being sent to the physician
74% of their summaries do not need modification from a medical scribe or a physician before being accepted as part of the patient’s record
Attention-based neural networks: These models specifically take the sequence of the words into account
only passing the relevant subset of the input to the next layer
and has an attention mechanism to focus on the relevant parts of the input
It first identifies the relevant parts of the text and then classifies those relevant parts into symptoms that are or are not present
The relation-span-attribute tagging model (R-SAT) is a variant of the SAT that focuses on relations between attributes
The added value of a PGNet is that it has the ability to generate new words or copy words from the text
This scoping review provides an overview of the current state of the development
Although the digital scribe is still in an early research phase
there appears to be a substantial research body testing various techniques in different settings
The first results are promising: state-of-the-art models are trained on vast corpora of annotated clinical conversations
Although the performance of these models varies per task
the results give a clear view of which tasks and which models yield high performance
Reports of clinical validity and usability
These approaches are promising new ways to decrease the WER
what is most important is whether the WER is good enough to extract all the relevant information
the NLP models trained on manually transcribed data outperform those trained on automatically transcribed data
which means there is room for improvement of the WER
the diverseness in both tasks and underlying models was large
The classification models focused mainly on extracting metadata
such as relevance or structure induction of an utterance
and used various models ranging from logistic regression to neural networks
The entity extraction models were more homogeneous in models but extracted many different entities
whereas the summarization task was mostly uniform
One notable aspect of the NLP tasks overall is the use of word embeddings
Only one study did not use word embeddings
but this was a study from 2006 when context-sensitive word embeddings were not yet available
All the other studies were published after 2019 and used various word embeddings as input
The introduction of context-sensitive word embeddings has been essential for extracting entities and summarizing clinical conversations
led to better performance than more general tasks
such as extracting symptoms and their properties
An explanation for this is the heterogeneity in
These properties can be phrased in various ways
which will be much more homogeneous in phrasing
this homogeneity leads to many more annotations per entity
which describe the decrease in neural networks’ performance with increased input length
the model knows which parts of the text are important for its task
Adding attention not only improves performance; it also decreases the amount of training data needed
which is useful in a field such as healthcare
where gathering large datasets can be challenging
including how they dealt with ambiguity and labeling errors
it would have been interesting to include error analyses to investigate the models’ blind spots
we believe it is vital to improve the ASR for clinical conversations further and use them as input for NLP models
A remarkable finding was that most studies used manually transcribed conversations as input to their NLP model
These manual transcripts may outperform automatically transcribed conversations regarding data quality
leading to an overestimation of the results
NLP models that require manual transcription may increase administrative burden when implemented in clinical practice
which should be the basis for reporting on digital scribes as well
where physicians qualitatively analyze the model’s output
These results lead to new insights for improving technical validity
Studying these two research phases iteratively leads to a solution that is well-suited for clinical practice
These studies should be the starting point for researchers and developers working on a digital scribe
The current work is the first effort to review all available literature on developing a digital scribe
We believe our search strategy was complete
leading to a comprehensive and focused scope of the digital scribe’s current research body
we create a broader overview than just the digital scribe’s scientific status
which means we have to trust the company in providing us with legitimate data
We hope this review is an encouragement for other companies to study their digital scribes scientifically
One limitation is the small amount of journal papers included in this review
as opposed to the amount of Arxiv preprints and workshop proceedings
These types of papers are often refereed very loosely
only including journal papers would not lead to a complete scope of this quickly evolving field
Contacting various digital scribe companies was a first step towards gaining insight into implemented digital scribes and their performance on the different ASR and NLP tasks
we believe it is a valuable addition to this review
It indicates that their implemented digital scribe does not differ significantly in techniques or performance from the included studies’ models while already saving physicians’ time
it highlights the gap between research and practice
The studies published by companies all describe techniques that are not part of a fully functional digital scribe (yet)
none of the companies offering digital scribes have published about the technical validity
Although the digital scribe field has only recently started to accelerate
the presented techniques achieve promising results
while companies offering digital scribes do not publish on any of the research phases
Any data generated or analyzed are included in this article and the Supplementary Information files
Aggregate data analyzed in this study are available from the corresponding author on reasonable request
Changes in burnout and satisfaction with work-life integration in physicians and the general US working population between 2011 and 2017
Taking Action Against Clinician Burnout: A Systems Approach to Professional Well-Being (The National Academies Press
Tethered to the EHR: primary care physician workload assessment using EHR event log data and time-motion observations
Allocation of physician time in ambulatory practice: a time and motion study in 4 specialties
Electronic health record logs indicate that physicians split time evenly between seeing patients and desktop medicine
The impact of administrative burden on academic physicians
“It’s like texting at the dinner table”: a qualitative analysis of the impact of electronic health records on patient-physician interaction in hospitals
Electronic health record effects on work-life balance and burnout within the I3 population collaborative
Physician stress and burnout: the impact of health information technology
Impact of scribes on physician satisfaction
and charting efficiency: a randomized controlled trial
Association of medical scribes in primary care with physician workflow and patient experience
Challenges of developing a digital scribe to reduce clinical documentation burden
Ambient clinical intelligence: the exam of the future has arrived. Nuance Communications (2019). Available at: https://www.nuance.com/healthcare/ambient-clinical-intelligence.html
Amazon comprehend medical. Amazon Web Services, Inc (2018). Available at: https://aws.amazon.com/comprehend/medical/
Robin Healthcare | automated clinic notes, coding and more. Robin Healthcare (2019). Available at: https://www.robinhealthcare.com
Reimagining clinical documentation with artificial intelligence
PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation
Speech recognition for medical conversations
properties and their relations from clinical conversations
of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 4979–4990 (Association for Computational Linguistics
Joint speech recognition and speaker diarization via sequence transduction
Extracting relevant information from physician-patient dialogues for automated clinical note taking
of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI)
65–74 (Association for Computational Linguistics
A systematic comparison of contemporary automatic speech recognition engines for conversational clinical speech
683–689 (American Medical Informatics Association
Automatic analysis of medical dialogue in the home hemodialysis domain: structure induction and summarization
Automatically charting symptoms from patient-physician conversations using machine learning
Medication regimen extraction from medical conversations
of International Workshop on Health Intelligence of the 34th AAAI Conference on Artificial Intelligence (Association for Computational Linguistics
The medical scribe: corpus development and model performance analyses
of the 12th Language Resources and Evaluation Conference (European Language Resources Association
summarize: global summarization of medical dialogue by exploiting local structures
In Findings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
3755–3763 (Association for Computational Linguistics
Topic-aware pointer-generator networks for summarizing spoken conversations
IEEE Automatic Speech Recognition Understanding Workshop 2019
Extracting Structured Data from Physician-Patient Conversations by Predicting Noteworthy Utterances
(eds) Explainable AI in Healthcare and Medicine
vol 914 (Springer International Publishing
Generating SOAP notes from doctor-patient conversations
MedFilter: improving extraction of task-relevant utterances through integration of discourse structure and ontological knowledge
of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
7781–7797 (Association for Computational Linguistics
Towards an automated SOAP note: classifying utterances from medical conversations
Towards fairness in classifying medical conversations into SOAP sections
In To be presented at AAAI 2021 Workshop: Trustworthy AI for Healthcare (AAAI Press
Weakly supervised medication regimen extraction from medical conversations
of the 3rd Clinical Natural Language Processing Workshop
178–193 (Association for Computational Linguistics
Towards understanding ASR error correction for medical conversations
of the First Workshop on Natural Language Processing for Medical Conversations
7–11 (Association for Computational Linguistics
Generating medical reports from patient-doctor conversations using sequence-to-sequence models
22–30 (Association for Computational Linguistics
Extracting symptoms and their status from clinical conversations
of the 57th Annual Meeting of the Association for Computational Linguistics
915–925 (Association for Computational Linguistics
DeepScribe - AI-Powered Medical Scribe. DeepScribe (2020). Available at: https://www.deepscribe.ai
Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension
Transcribing videos | Cloud speech-to-text documentation. Google Cloud (2016). Available at: https://cloud.google.com/speech-to-text/docs/video-model
Watson speech to text - Overview. IBM (2021). Available at: https://www.ibm.com/cloud/watson-speech-to-text
Kaldi ASR. Kaldi (2015). Available at: https://kaldi-asr.org
mozilla/DeepSpeech. GitHub (2020). Available at: https://github.com/mozilla/DeepSpeech
Speech-to-text: automatic speech recognition | Google Cloud. Google Cloud (2016). Available at: https://cloud.google.com/speech-to-text
Jhu aspire system: robust LVCSR with TDNNs
In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)
Deliberation model based two-pass end-to-end speech recognition
In IEEE International Conference on Acoustics
Neural machine translation by jointly learning to align and translate
Learning phrase representations using RNN encoder–decoder for statistical machine translation
of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
1724–1734 (Association for Computational Linguistics
MINIMAR (MINimum Information for Medical AI Reporting): developing reporting standards for artificial intelligence in health care
The practical implementation of artificial intelligence technologies in medicine
Do no harm: a roadmap for responsible machine learning for health care
Envisioning an artificial intelligence documentation assistant for future primary care consultations: a co-design study with general practitioners
Identifying relevant information in medical conversations to summarize a clinician-patient encounter
Regulatory frameworks for development and evaluation of artificial intelligence–based diagnostic imaging models: summary and recommendations
Gender and dialect bias in YouTube’s automatic captions
of the First ACL Workshop on Ethics in Natural Language Processing
53–59 (Association for Computational Linguistics
DECIDE-AI: new reporting guidelines to bridge the development-to-implementation gap in clinical artificial intelligence
Sequence to sequence learning with neural networks
of the 27th International Conference on Neural Information Processing Systems (NIPS) 2
Get to the point: summarization with pointer-generator networks
of the 55th Annual Meeting of the Association for Computational Linguistics
1073–1083 (Association for Computational Linguistics
Distributed representations of words and phrases and their compositionality
of the 26th International Conference on Neural Information Processing Systems (NIPS) 2
of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL) 1
2227–2237 (Association for Computational Linguistics
BERT: pre-training of deep bidirectional transformers for language understanding
of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL)
4171–4186 (Association for Computational Linguistics
Download references
Department of Information Technology & Digital Innovation
Department of Quality & Patient Safety
contributed to design and critical revision of the manuscript
All authors gave their final approval and accepted accountability for all aspects of the work
Download citation
DOI: https://doi.org/10.1038/s41746-021-00432-5
Metrics details
Report cards on the health care system increasingly report provider-specific performance on indicators that measure the quality of health care delivered
A natural reaction to the publishing of hospital-specific performance on a given indicator is to create ‘league tables’ that rank hospitals according to their performance
many indicators have been shown to have low to moderate rankability
meaning that they cannot be used to accurately rank hospitals
Our objective was to define conditions for improving the ability to rank hospitals by combining several binary indicators with low to moderate rankability
Monte Carlo simulations to examine the rankability of composite ordinal indicators created by pooling three binary indicators with low to moderate rankability
We considered scenarios in which the prevalences of the three binary indicators were 0.05
and 0.25 and the within-hospital correlation between these indicators varied between − 0.25 and 0.90
Creation of an ordinal indicator with high rankability was possible when the three component binary indicators were strongly correlated with one another (the within-hospital correlation in indicators was at least 0.5)
When the binary indicators were independent or weakly correlated with one another (the within-hospital correlation in indicators was less than 0.5)
the rankability of the composite ordinal indicator was often less than at least one of its binary components
The rankability of the composite indicator was most affected by the rankability of the most prevalent indicator and the magnitude of the within-hospital correlation between the indicators
Pooling highly-correlated binary indicators can result in a composite ordinal indicator with high rankability
the composite ordinal indicator may have lower rankability than some of its constituent components
It is recommended that binary indicators be combined to increase rankability only if they represent the same concept of quality of care
or length of stay) or a process of care (e.g.
discharge prescribing of evidence-based medications in specific patient populations) that is used to assess the quality of health care
A common practice is to report hospital-specific means of health care indicators (e.g.
the proportion of patients who died in each hospital or mean length of stay)
Crude (or unadjusted) or risk-adjusted estimates of hospital performance on specific indicators can be reported
They found that rankability ranged from 0.01 for patients with osteoarthritis undergoing total hip arthroplasty/total knee arthroplasty to 0.71 following hospitalization for stroke
A question when developing indicators for assessing quality of health care is whether several binary indicators reflecting outcomes of increasing severity
which individually have poor to moderate rankability
can be combined into an ordinal indicator to increase rankability
The objective of the current study was to examine how the rankability of composite ordinal indicators compared to the rankabilities of the component binary indicators
The paper is structured as follows: In Section 2
we provide background and formally define rankability
we conduct a series of Monte Carlo simulations to examine the relationship between the rankability of a binary indicator and the intraclass correlation coefficient (ICC) of that indicator across hospitals (as a measure of the between-hospital variation)
we conduct a series of Monte Carlo simulations to examine the relationship between the rankability of a composite ordinal indicator and the rankabilities of the individual binary indicators from which it was formed
in Section 5 we summarize our findings and place them in the context of the existing literature
Let Y denote a binary indicator that is used to assess the performance of a health care provider (e.g.
we will refer to the hospital as the provider
but the methods are equally applicable to other healthcare providers (e.g.
physicians or health care administrative regions)
Yij = 1 denote that the indicator was positive or present (e.g.
the patient died or SSI occurred) for the ith patient at the jth hospital
while Yij = 0 denotes that the indicator was negative for this patient (e.g.
the patient did not die or SSI did not occur)
Let Xij denote a vector of covariates measured on the ith patient at the jth hospital (e.g.
A random effects logistic regression model can be fit to model the variation in the indicator:
we used the above definition because it appears to be the most frequently used definition in the context of multilevel analysis
Instead of fitting a random effects model to model variation in the indicator
one could replace the hospital-specific random effects by fixed hospital effects:
where there are k-1 indicator or dummy variables to represent the fixed effects of the k hospitals
Let sj denote the standard error of the estimated hospital effect for the jth hospital
These standard errors denote the precision with which the hospital-specific fixed effects are estimated
The rankability relates the total variation from the random effects model to the uncertainty of the individual hospital effects from the fixed effects model
It can be interpreted as the proportion of the variation between hospitals that is not due to chance
We conducted a series of Monte Carlo simulations to examine the relationship between ICC and the rankability of a single binary indicator
Let X and Y denote a continuous risk score and a binary indicator
The following random effects model relates the continuous risk score to the presence of the binary indicator:
The hospital-specific random effects follow a normal distribution: α0j~N(α0
determines the overall prevalence of the binary indicator
determines the magnitude of the strength of the relationship between the risk score and the presence of the binary indicator
Fixing the standard deviation of the random effects distribution at \( \tau =\pi \sqrt{\frac{\mathrm{ICC}}{3\left(1-\mathrm{ICC}\right)}} \) will result in a model with the desired value of the ICC
We then simulated a binary outcome for the indicator from a Bernoulli distribution with subject-specific parameter Pr(Yij = 1)
We designed the simulations so that hospital volume was fixed across hospitals
This was done to remove any effect of varying hospital volume on rankability
We allowed the following three factors to vary: (i) the ICC; (ii) the average intercept (α0); (iii) the fixed slope (α1)
The ICC was allowed to take on 13 values from 0 to 0.24 in increments of 0.02
These values were selected as they range from no effect of clustering (ICC = 0) to a strong effect of clustering
The average intercept was allowed to take on four values: − 3
The fixed slope was allowed to take on three values: − 0.25
and thus considered 156 different scenarios
In each of the 156 different scenarios we simulated 100 datasets
we estimated the rankability of the binary indicator using the methods described in Section 2 (in each simulated dataset rankability was estimated using the estimated variance of the random effects
we then computed the average rankability across the 100 simulated datasets for that scenario
The simulations were conducted using the R statistical programming language (version 3.5.1)
The random effects logistic regression models were fit using frequentist methods using the glmer function from the lme4 package for R
We used an extensive series of Monte Carlo simulations to examine whether combining three binary indicators into an ordinal indicator resulted in an ordinal indicator with greater rankability compared to that of its binary components
We examined scenarios with three binary indicators: Y1
The following three random effects models relate an underlying continuous risk factor to the presence of each of the three binary indicators:
we assumed that the hospital-specific random effects followed a normal distribution: \( {\alpha}_{0 kj}\sim N\left({\alpha}_{0k},{\tau}_{kk}^2\right) \)
We assumed that the distribution of the triplet of hospital-specific random effects followed a multivariate normal distribution:
This indicator would have an overall prevalence of 25% by construction
The ordinal indicator in our study was defined as:
Thus, a subject had the most severe/serious level of the composite ordinal indicator (5) if the most serious of the binary indicators (Y1) was present, regardless of whether or not any of the other two indicators had occurred. A subject had the least severe/serious level of the composite ordinal indicator (1) if none of the binary indicators was present
We computed the rankability of the ordinal indicator
The mean rankability of each of the three binary indicators and the one ordinal indicator was determined over 100 iterations for each scenario
For each of the 16 combinations of the above two factors we considered three different sets of rankability values for the three binary indicators
The ordinal logistic regression model was fit using the polr function from the MASS package
while the random effects ordinal logistic regression model was fit using the clmm function from the ordinal package for R
and third binary indicators across the 48 scenarios were 0.05
and third binary indicators across the 48 scenarios were 0.36 (range 0.22 to 0.43)
Rankability of binary and ordinal indicators
restricting the analysis to those scenarios in which the correlation between hospital-specific random effects was less than or equal to 0.5
The use of 100 replications in each of the 48 scenarios in the Monte Carlo simulations allowed us to estimate rankability with relatively good precision
For each scenario and for each of the indicators we computed the standard deviation of the rankability across the 100 replications for that scenario
The mean standard deviation of the rankability of the first binary indicator was 0.067 across the 48 scenarios (ranging from 0.062 to 0.074)
The mean standard deviation of the rankability of the second binary indicator was 0.058 across the 48 scenarios (ranging from 0.046 to 0.069)
The mean standard deviation of the rankability of the third binary indicator was 0.056 across the 48 scenarios (ranging from 0.037 to 0.072)
The mean standard deviation of the rankability of the composite ordinal indicator was 0.057 across the 48 scenarios (ranging from 0.032 to 0.078)
Rankability of binary and ordinal indicators (equal prevalences)
We conducted a series of simulations to examine whether combining three binary indicators reflecting outcomes with increasing severity
which individually had low or moderate rankability
could produce an ordinal indicator with high rankability
We found that this was feasible when the three binary indicators had at least moderate rankability and were strongly correlated with one another
When the binary indicators were independent or weakly correlated with one another
the rankability of the composite ordinal indicator was often less than that of at least one of its binary components
There is an increasing interest in many countries and jurisdictions in reporting on the quality and outcomes of health care delivery
Public reporting of hospital-specific performance on indicators of health care quality can lead to the production of ‘league tables’
in which hospitals are ranked according to their performance
The rankability of an indicator denotes its ability to allow for the accurate ranking of hospitals
many indicators have been shown to have poor to moderate rankability
Our focus was on pooling binary indicators reflecting outcomes of increasing severity to create a composite ordinal indicator that described a gradient from lowest (least severe/serious) to highest (most severe/serious)
We did not consider other methods of creating composite indicators such as summing up the number of positive binary indicators
Such an approach would not necessarily preserve the ordering of severity present in the individual indicators
For instance given three indicators of differing severity (e.g.
then a subject who died (and who was not readmitted and who had a short length of hospital stay) and a subject who had a long hospital stay (but who did not die and who was not readmitted) would both have one positive indicator
they would have very different severity of the underlying binary indicators
Our composite ordinal indicator reflects this ordering of severity/seriousness
while counting the number of positive indicators would not
Our research has shown that rankability is increased when individual indicators are combined with other indicators with which they are highly correlated
Individual indicators underlying the same concepts of (quality) of care can thereby be combined to produce a more reliable ranking with the added advantage of showing a more complete picture of quality of care
indicators that are not correlated might represent other important quality domains
although their limited rankability should be taken into account in the interpretation of potential differences between hospitals
The finding that combining binary outcomes that are negatively correlated
into an ordinal outcome decreases rankability is a result of violation of the proportional odds assumption
The proportional odds model assumes that the effect of the parameter of interest
in this case the hospital-specific random effects
on the outcome is comparable across the cut-offs of the ordinal scale
If the binary indicators are not correlated this assumption is not satisfied
when a specific hospital has a low mortality rate (meaning a negative random effect estimate on one cut-off) but high readmission rate (positive random effect estimate on other cut-off) these random effect estimates average out
This reduces the variation of the hospital-specific random effects
to obtain a composite ordinal indicator with high rankability
the proportional odds assumption must be met to some extent
in order for a composite indicator to provide information on which a hospital can take action
it would be reasonable to combine indicators that address aspects of health care quality for the same set of patients (e.g.
that pertain to the same surgical procedure or to the treatment of the same set of patients)
Identifying indicators that satisfy these requirements may be challenging in some settings
when binary indicators have low to moderate within-hospital correlation
It is recommended that related binary indicators be combined in order to increase rankability
which reflects that they represent the same concept of quality of care
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request
Cardiac Surgery in New Jersey in 2002: A Consumer Report
Annual Report of the California Hospital Outcomes Project
California Office of Statewide Health Planning and Development
Adult Coronary Artery Bypass Graft Surgery in the Commonwealth of Massachusetts: Fiscal Year 2010 Report
Pennsylvania Health Care Cost Containment Council
Consumer Guide to Coronary Artery Bypass Graft Surgery
Focus on heart attack in Pennsylvania: research methods and results
Coronary artery bypass graft surgery in New York State 1989-1991
the Cardiac Care Network Steering Committee
Outcomes of Coronary Artery Bypass Surgery in Ontario
Cardiovascular Health and Services in Ontario: An ICES Atlas
Toronto: Institute for Clinical Evaluative Sciences; 1999
Acute Myocardial Infarction Outcomes in Ontario
League Tables and Their Limitations: Statistical Issues in Comparisons of Institutional Performance
van Houwelingen, H. C., Brand, R., and Louis, T. A. Empirical Bayes Methods for Monitoring Health Care Quality https://www.lumc.nl/sub/3020/att/EmpiricalBayes.pdf (Accessed May 8
Lingsma HF, Eijkemans MJ, Steyerberg EW. Incorporating natural variation into IVF clinic league tables: The Expected Rank. BMC.Med.Res.Methodol. 2009;9:53. https://doi.org/10.1186/1471-2288-9-53
Dimick JB, Staiger DO, Birkmeyer JD. Ranking hospitals on surgical mortality: the importance of reliability adjustment. Health ServRes. 2010;45(6 Pt 1):1614–29. https://doi.org/10.1111/j.1475-6773.2010.01158.x
Verburg IW, de Keizer NF, Holman R, Dongelmans D, de Jonge E, Peek N. Individual and clustered Rankability of ICUs according to case-mix-adjusted mortality. Crit Care Med. 2016;44(5):901–9. https://doi.org/10.1097/CCM.0000000000001521
Reliability adjustment: a necessity for trauma center ranking and benchmarking
Henneman D, van Bommel AC, Snijders A, Snijders HS, Tollenaar RA, Wouters MW, Fiocco M. Ranking and rankability of hospital postoperative mortality rates in colorectal cancer surgery. Ann.Surg. 2014;259(5):844–9. https://doi.org/10.1097/SLA.0000000000000561
van Dishoeck AM, Koek MB, Steyerberg EW, van Benthem BH, Vos MC, Lingsma HF. Use of surgical-site infection rates to rank hospital performance across several types of surgery. Br.J.Surg. 2013;100(5):628–36. https://doi.org/10.1002/bjs.9039
Lingsma HF, Steyerberg EW, Eijkemans MJ, Dippel DW, Scholte Op Reimer WJ, van Houwelingen HC. Comparing and ranking hospitals based on outcome: results from the Netherlands stroke survey. QJM. 2010;103(2):99–108. https://doi.org/10.1093/qjmed/hcp169
van Dishoeck AM, Lingsma HF, Mackenbach JP, Steyerberg EW. Random variation and rankability of hospitals using outcome indicators. BMJ Qual.Saf. 2011;20(10):869–74. https://doi.org/10.1136/bmjqs.2010.048058
Lawson EH, Ko CY, Adams JL, Chow WB, Hall BL. Reliability of evaluating hospital quality by colorectal surgical site infection type. Ann.Surg. 2013;258(6):994–1000. https://doi.org/10.1097/SLA.0b013e3182929178
Hofstede SN, Ceyisakar IE, Lingsma HF, Kringos DS, Marang-van de Mheen PJ. Ranking hospitals: do we gain reliability by using composite rather than individual indicators? BMJ Qual.Saf. 2019;28(2):94–102. https://doi.org/10.1136/bmjqs-2017-007669
Roozenbeek B, Lingsma HF, Perel P, Edwards P, Roberts I, Murray GD, Maas AI, Steyerberg EW. The added value of ordinal analysis in clinical trials: an example in traumatic brain injury. Crit Care. 2011;15(3):R127. https://doi.org/10.1186/cc10240
Bath PM, Gray LJ, Collier T, Pocock S, Carpenter J. Can we improve the statistical analysis of stroke trials? Statistical reanalysis of functional outcomes in stroke trials. Stroke. 2007;38(6):1911–5. https://doi.org/10.1161/STROKEAHA.106.474080
Multilevel analysis: an introduction to basic and advanced multilevel modeling
Partitioning variation in generalised linear multilevel models
Wu S, Crespi CM, Wong WK. Comparison of methods for estimating the intraclass correlation coefficient for binary responses in cancer prevention cluster randomized trials. Contemp.Clin.Trials. 2012;33(5):869–80. https://doi.org/10.1016/j.cct.2012.05.004
Download references
which is funded by an annual grant from the Ontario Ministry of Health and Long-Term Care (MOHLTC)
results and conclusions reported in this paper are those of the authors and are independent from the funding sources
No endorsement by ICES or the Ontario MOHLTC is intended or should be inferred
This research was supported by operating grant from the Canadian Institutes of Health Research (CIHR) (MOP 86508)
Austin is supported in part by a Mid-Career Investigator award from the Heart and Stroke Foundation of Ontario
and PM contributed to the design of the simulations
PA coded the simulations and conducted the statistical analyses
and PM contributed to revising the manuscript
and PM read and approved the final manuscript
The study consisted of Monte Carlo simulations that used simulated data
No ethics approval or consent to participate was necessary
Consent for publication was not required as only simulated data were used
Download citation
DOI: https://doi.org/10.1186/s12874-019-0769-x
Metrics details
This article has been updated
Computed tomography (CT) is presently a standard procedure for the detection of distant metastases in patients with oesophageal or gastric cardia cancer
We aimed to determine the additional diagnostic value of alternative staging investigations
We included 569 oesophageal or gastric cardia cancer patients who had undergone CT neck/thorax/abdomen
Sensitivity and specificity were first determined at an organ level (results of investigations
and then at a patient level (results for combinations of investigations)
considering that the detection of distant metastases is a contraindication to surgery
we compared three strategies for each organ: CT alone
CT plus another investigation if CT was negative for metastases (one-positive scenario)
and CT plus another investigation if CT was positive
but requiring that both were positive for a final positive result (two-positive scenario)
life expectancy and quality adjusted life years (QALYs) were compared between different diagnostic strategies
CT showed sensitivities for detecting metastases in celiac lymph nodes
which was higher than the sensitivities of US abdomen (44% for celiac lymph nodes and 65% for liver metastases)
US neck showed a higher sensitivity for the detection of malignant supraclavicular lymph nodes than CT (85 vs 28%)
sensitivity for detecting distant metastases was 66% and specificity was 95% if only CT was performed
A higher sensitivity (86%) was achieved when US neck was added to CT (one-positive scenario)
This strategy resulted in lower costs compared to CT only
at an almost similar (quality adjusted) life expectancy
Slightly higher specificities (97–99%) were achieved if liver and/or lung metastases found on CT
were confirmed by US abdomen or chest X-ray
These strategies had only slightly higher QALYs
The combination of CT neck/thorax/abdomen and US neck was most cost-effective for the detection of metastases in patients with oesophageal or gastric cardia cancer
whereas the performance of CT only had a lower sensitivity for metastases detection and higher costs
which may be due to the low number of M1b celiac lymph nodes detected in this series
It remains to be determined whether the application of positron emission tomography will further increase sensitivities and specificities of metastases detection without jeopardising costs and QALYs
The presence of distant metastases from oesophageal or gastric cardia cancer is usually investigated by more than one modality
CT neck/thorax/abdomen is a standard investigation
and chest X-ray are also necessary for assessing the presence of distant metastases in these patients
we aimed to determine the diagnostic value of EUS
and chest X-ray in addition to CT in patients with oesophageal or gastric cardia cancer
We evaluated these diagnostic procedures both at an organ level and at a patient level for the detection of metastases
The assumption was that the finding of distant metastases in patients with oesophageal or gastric cardia cancer would eliminate the option of a curative surgical treatment
We used a prospectively collected database with information on 1088 patients with oesophageal or gastric cardia cancer who were diagnosed and treated between January 1994 and October 2003 at the Erasmus MC – University Medical Center Rotterdam
Data that were collected included general patient characteristics
which was not present in the database but necessary for this study
was obtained from the electronic hospital information system
We assessed which preoperative investigations had been performed in these 1088 patients
FNA was performed if the result could change the treatment decision
If multiple suspicious lesions were present
FNA of the most suspicious lesion was performed
The results of the investigations were compared with the gold standard
which was postoperative pathological TNM stage
or a radiological finding in the relevant organ with ⩾6 months of follow-up
the latter was repeated to determine whether the lesion could be found using the CT information and to evaluate whether FNA could be performed
we did not use the results of this repeated investigation
but used the result of the initial US neck or abdomen
false-positive and false-negative results of CT
for the detection of metastases in the various organs were calculated
The combined results were calculated twice
the result was considered positive for metastases if at least one of two investigations that were performed for a particular organ was positive
and negative if both investigations were negative (one-positive scenario)
This is a strategy that uses the possible additional diagnostic information of the second investigation in case of a negative CT
the result of another investigation is irrelevant in this strategy
because the final result will remain positive irrespective of the result of the other investigation
the result was considered positive if both CT and another investigation were positive and negative if at least one of the investigations was negative (two-positive scenario)
This is a strategy that uses additional diagnostic investigations to confirm a positive CT finding
the performance of another investigation is unnecessary using this strategy
because the final result will remain negative irrespective of the result of the other investigation
the number of false-positive and false-negative results was also calculated for the combination of CT
In addition to analyses at the organ level
we considered analyses at the patient level
we assessed whether distant metastases (M1b) were present in liver
whether a curative oesophageal resection should have been performed or not on the basis of combinations of staging investigations using the data of 264 patients who had undergone all investigations
The assumption was that an oesophageal resection should only be performed if no distant metastases are detected
Similarly to the analyses at the organ level
and the one-positive and two-positive scenarios for the detection of metastases in liver
81 different combinations of investigations were possible (3 strategies for 4 organs)
Sensitivities and specificities for the detection of distant metastases at the patient level were calculated for each combination
We plotted sensitivity against one-specificity in a receiver operating characteristic (ROC) curve for a visual comparison of the accuracy of combinations of staging investigations using the data of 264 patients who had undergone all investigations
Sensitivity is the proportion of patients who are correctly identified as having distant metastases (true positive results)
and one-specificity is the proportion of patients in whom the gold standard is negative for distant metastases
and who are incorrectly identified as positive by the staging investigation (false-positive results)
ROC curves were made for the detection of distant metastases (M1b) with CT and the combination of CT and another investigation (both the two-positive and one-positive scenarios) in an organ
whereas in the other organs we only included the CT result
to assess whether both CT and US abdomen should be performed to determine whether liver metastases were present
we compared three different strategies: (1) combination of CT and US abdomen in the two-positive scenario for the liver and CT for the other organs; (2) combination of CT and US abdomen in the one-positive scenario for the liver and CT for the other organs; (3) CT for all organs
All P-values were based on two-sided tests of significance
A P-value<0.05 was considered as statistically significant
Life expectancy was assumed to be 2.41 and 1.00 year for local/regional disease with and without resection
and 0.42 and 0.37 year for distant disease with and without resection
QALYs were estimated to be 1.45 and 0.70 for local/regional disease
A cost-effectiveness plane was constructed in which the differences in costs between strategies (Δ costs) were plotted against the differences in QALY (Δ QALY)
Costs were expressed per $1000 (k$) for easier interpretation
In Table 1
patient and tumour characteristics are shown for all 569 patients who had undergone both CT neck/thorax/abdomen and at least one other investigation
for the 264 patients who had undergone all investigations and for the 305 patients who had undergone some diagnostic investigations
χ2 testing revealed that the differences between the patients with all (n=264) or some (n=305) diagnostic investigations were statistically not significant
In Table 2
the gold standard diagnoses are shown per organ
Positive gold standard diagnoses were confirmed by FNA or resection in the majority of cases (92/135
whereas such confirmation could not be used in the remaining cases
A reason for this was that several patients had two or more suspicious lesions and FNA had already been performed for one of these lesions
which confirmed the presence of a distant metastasis
FNA of the other suspicious lesions was therefore not indicated in these patients
Sensitivity for the detection of liver metastases was higher for CT than for US abdomen, but this was statistically not significant (73 vs 65%, P=0.63; Table 3)
Sensitivity for celiac lymph node metastases was higher for CT than for US abdomen (69 vs 44%
Sensitivity for supraclavicular lymph node metastases was higher for US neck than for CT (85 vs 28%
Sensitivity for lung metastases was slightly higher for CT than for chest X-ray
but this was statistically not significant (90 vs 68%
the results of EUS for the detection of malignant celiac lymph nodes were inferior than for CT and US abdomen
EUS was considered to be less relevant for the detection of distant metastases
and was not included in the part of the analyses concerning patient level
ROC curves for the detection of metastases with CT and the combination of CT and another investigation (one-positive and two-positive scenario) in an organ
whereas for the other organs only the result of CT was included vs the gold standard
combination of CT and another investigation for the investigated region
with a positive result if at least one investigation is positive (one-positive)
with a positive result if both investigations are positive (two-positive)
The sensitivity for detecting distant metastases was 66% and specificity was 95% if only CT was performed for all organs (Table 4)
Higher sensitivities and specificities could be obtained by the addition of one or more other staging investigations
which could be obtained with 12 of the 81 different combinations of staging investigations
whereas for 6 other combinations the specificity was only slightly higher (94.9%)
The lowest number of investigations for a sensitivity of 86% and a specificity of 94.9% was the combination of CT plus US neck for the detection of supraclavicular lymph node metastases (one-positive scenario)
and CT only for the detection of metastases in celiac lymph nodes
A slightly higher specificity of 97% was achieved by the addition of US abdomen for liver metastases
When chest X-ray (two-positive scenario) for the detection of lung metastases was added
Sensitivity declined with increasing specificity
meaning that more patients would have undergone a curative treatment option in the presence of distant metastases (more false-negative results)
The addition of US abdomen for the detection of malignant celiac lymph nodes did not result in better results; however
only 3/264 patients had M1b celiac lymph nodes
whereas 49 other patients had M1a celiac lymph nodes that did not preclude a resection
The average results obtained from the data with imputation of missing values (n=569) were roughly equal compared to the results obtained from the complete data of patients who had undergone all staging investigations (n=264 patients; Table 4)
Marginal cost-effectiveness plane calculated in patients with oesophageal or gastric cardia cancer who had undergone all staging investigations (n=264) and using the five completed data sets (n=569)
The combination of CT and US neck for the detection of supraclavicular lymph node metastases (one-positive scenario)
liver and lung was considered as reference strategy
CT=computed tomography; CXR=chest X-ray; QALY
quality adjusted life year; USa=ultrasound abdomen; USn=ultrasound neck
Surgery is presently the only established curative treatment option for patients with oesophageal or gastric cardia cancer
with a substantial risk of morbidity and mortality
adequate staging is of outmost importance to select patients without distant metastases for undergoing surgery
we assessed which traditional staging investigations should be performed in patients with oesophageal or gastric cardia cancer to determine whether distant metastases were present and
Our findings demonstrated that the performance of CT only was not sensitive enough for the detection of distant metastases
The addition of US neck to CT for the detection of supraclavicular lymph node metastases resulted in the highest sensitivity
For a slightly higher specificity (less false positives)
but this required that both CT and these investigations were positive for metastases to define the result as positive (two-positive scenario)
A higher specificity would however result in a decline in sensitivity and consequently in more resections in patients with distant metastases
We recognise that the requirement of two staging procedures being positive is not a common clinical strategy
is sometimes already used to confirm the suspicion of metastases on CT
no single combination of investigations was more cost-effective than the combination CT and US neck
only patients who had undergone CT neck/thorax/abdomen and one or more other investigations
no statistically significant differences were found within the whole group of patients (n=569)
according to whether all or some investigations had been performed
the optimal strategy to stage patients with oesophageal or gastric cardia cancer is not automatically the combination of CT and US neck
as sensitivities and specificities of combinations of investigations largely depend on the quality of the staging investigations in a centre
This quality is determined by both experience of the investigator and quality of the equipment
further studies need to determine the exact role of PET in the staging of oesophageal or gastric cardia cancer
the combination of CT neck and US neck for the detection of supraclavicular lymph node metastases and CT thorax/abdomen for the detection of metastases in celiac lymph nodes
and lung is a cost-effective strategy for the detection of distant metastases in patients with oesophageal or gastric cardia cancer
US abdomen and chest X-ray have only limited additional value in the detection of distant metastases in these patients
These staging investigations should only be performed for specific indications in patients with oesophageal or gastric cardia cancer
as the treatment decision is not improved in most of the patients if these investigations are added to the diagnostic work-up
The role of EUS for the detection of distant metastases seems also be limited
which may be particularly due to the low number of M1b celiac lymph nodes in the present study
This paper was modified 12 months after initial publication to switch to Creative Commons licence terms
Gignoux M (1996) Contribution of cervical ultrasound and ultrasound fine-needle aspiration biopsy to the staging of thoracic oesophageal carcinoma
Sivak Jr MV (1999) Evaluation of metastatic celiac axis lymph nodes in patients with esophageal carcinoma: accuracy of EUS
Hayabuthi N (1996) Comparative analysis of imaging modalities in the preoperative assessment of nodal metastasis in esophageal cancer
Hoffman BJ (2001) The utility of EUS and EUS-guided fine needle aspiration in detecting celiac lymph node metastasis in patients with esophageal cancer: a single-center experience
Henson DE (1997) AJCC Cancer Staging Manual
Weinstein MC (1996) Cost-Effectiveness in Health and Medicine
Metreweli C (2000) Neck ultrasound in staging squamous oesophageal carcinoma – a high yield technique
DeMeester TR (2001) Curative resection for esophageal adenocarcinoma: analysis of 100 en bloc esophagectomies
Junginger T (2003) Positron emission tomography for staging esophageal cancer: does it lead to a different therapeutic approach
Knottnerus JA (2001) The Evidence Base of Clinical Diagnosis
Leccisotti L (2006) Positron emission tomography in the staging of esophageal cancer
Siewert JR (1988) Assessment of resectability of esophageal cancer by computed tomography and magnetic resonance imaging
Kulkarni KG (2005) Role of endoscopic ultrasonography in the staging and follow-up of esophageal cancer
McConnell DB (1993) Role of computed tomographic scans in the staging of esophageal and proximal gastric malignancies
Aikou T (1999) Assessment of cervical lymph node metastasis in esophageal carcinoma using ultrasonography
Waxman I (2002) Clinical impact of endoscopic ultrasound-guided fine needle aspiration of celiac axis lymph nodes (M1a disease) in esophageal cancer
Ann Thorac Surg 73: 916–920; discussion 920–921
Gross BH (1985) Esophageal carcinoma: CT findings
Hawes RH (1999) Esophageal cancer staging: improved accuracy by endoscopic ultrasound of celiac lymph nodes
Ann Thorac Surg 67: 319–321; discussion 322
Siewert JR (1998) Preoperative bronchoscopic assessment of airway invasion by esophageal cancer: a prospective study
Kuhl H (2006) Staging and follow-up of gastrointestinal tumors with PET/CT
Little RJA (2002) Statistical Analysis with Missing Data
Graham JW (2002) Missing data: our view of the state of the art
Siewert JR (2001) Esophageal cancer: patient evaluation and pre-treatment staging
Yamaguchi H (1994) Neck ultrasonography for thoracic esophageal carcinoma
Korobkin M (1983) Computed tomography for staging esophageal and gastroesophageal cancer: reevaluation
Schutte HE (1993) Improved assessment of supraclavicular and abdominal metastases in oesophageal and gastro-oesophageal junction carcinoma with the combination of ultrasound and computed tomography
Schutte HE (1992) Assessment of distant metastases with ultrasound-guided fine-needle aspiration biopsy and cytologic study in carcinoma of the esophagus and gastroesophageal junction
Siersema PD (2006a) A comparison between low-volume referring regional centers and a high-volume referral center in quality of preoperative metastasis detection in esophageal carcinoma
Siersema PD (2006b) Staging of esophageal carcinoma in a low-volume EUS center compared with reported results from high-volume centers
Wiersema MJ (2001) Impact of EUS-guided fine-needle aspiration on lymph node staging in patients with esophageal carcinoma
Alderson D (1998) Oesophageal cancer staging using endoscopic ultrasonography
Reed CE (2002) An analysis of multiple staging management strategies for carcinoma of the esophagus: computed tomography
ultrasound and computed tomography in cancer of the oesophagus and gastric cardia: a prospective comparison for detecting intra-abdominal metastases
Briggs AH (2006) Statistical Analysis Of cost-Effectiveness Data
Morifuji H (1985) Ultrasonic detection of lymph node metastases in the region around the celiac axis in esophageal and gastric cancer
Download references
We are grateful to Mrs Conny Vollebregt for collecting the data of the database
EPMV was funded by a grant from the ‘Doelmatigheidsonderzoek’ fund of Erasmus MC Rotterdam
Department of Gastroenterology and Hepatology
Erasmus MC – University Medical Center Rotterdam
From twelve months after its original publication
this work is licensed under the Creative Commons Attribution-NonCommercial-Share Alike 3.0 Unported License
visit http://creativecommons.org/licenses/by-nc-sa/3.0/
Download citation
DOI: https://doi.org/10.1038/sj.bjc.6603960