Metrics details Clinical prediction models should be validated before implementation in clinical practice But is favorable performance at internal validation or one external validation sufficient to claim that a prediction model works well in the intended clinical context We argue to the contrary because (1) patient populations vary and (3) populations and measurements change over time we have to expect heterogeneity in model performance between locations and settings It follows that prediction models are never truly validated This does not imply that validation is not important the current focus on developing new models should shift to a focus on more extensive and well-reported validation studies of promising models Principled validation strategies are needed to understand and quantify heterogeneity and update prediction models when appropriate Such strategies will help to ensure that prediction models stay up-to-date and safe to support clinical decision-making Whereas internal validation focuses on reproducibility and overfitting external validation focuses on transportability Although assessing transportability of model performance is vital an external validation with favorable performance does not prove universal applicability and does not justify the claim that the model is ‘externally valid’ the aim should be to assess performance across many locations and over time in order to maximize the understanding of model transportability we argue that it is impossible to definitively claim that a model is ‘externally valid’ and that such terminology should be avoided We discuss three reasons for this argument Distribution of patient age in the 9 largest centers from the ovarian cancer study. Histograms, density estimates, and mean (standard deviation) are given per center Distribution of maximum lesion diameter in the 9 largest centers from the ovarian cancer study and median (interquartile range) are given per center Median cohort size was 283 (range 25 to 25,056) mean patient age varied between 45 and 71 years the percentage of male patients varied between 45 and 74% Pooled performance estimates were 0.77 for the c-statistic 0.65 for the observed over expected (O:E) ratio The O:E ratio < 1 suggests that the model tends to overestimate the risk of in-hospital mortality The calibration slope < 1 suggests that risk estimates also tend to be too extreme (i.e. Large heterogeneity in performance was observed with 95% prediction intervals of 0.63 to 0.87 for the c-statistic and of 0.34 to 0.66 for the calibration slope 95% prediction intervals indicate the performance that can be expected when evaluating the model in new clusters When adjusting for differences in patient characteristics This suggests that about one third of the decrease in discrimination at external validation was due to more homogenous patient samples given that clinical trial datasets were used for external validation which often contain more homogeneous samples than observational datasets Such measurements are increasingly used in prediction modeling studies based on electronic health records The c-statistic on the test set (5970 radiographs from 2256 patients; random train-test split) was 0.78 When non-fracture and fracture test set cases were matched on patient variables (age the c-statistic for hip fracture decreased to 0.67 When matching also included hospital process variables (including scanner model This suggests that variables such as the type of scanner can inflate predictions for hip fracture Reported methods included the Confusion Assessment Method (CAM) Frequency varied between once to more than once per day These images were randomly selected after stratification by classification given by a deep learning model (50 images labeled as positive for pneumonia There was a complete agreement for 52 cases Pairwise kappa statistics varied between 0.38 and 0.80 Each patient was examined by one of 40 different clinicians across 19 hospitals The researchers calculated the proportion of the variance in the measurements that is attributable to systematic differences between clinicians For the binary variable indicating whether the patient was using hormonal therapy the analysis suggested that 20% of the variability was attributed to the clinician doing the assessment The percentage of patients reporting the use of hormonal therapy roughly varied between 0 and 20% A subsequent survey among clinicians revealed that clinicians reporting high rates of hormonal therapy had assessed this more thoroughly and that there was a disagreement of the definition of hormonal therapy the radiologists evaluated the risk of MVI on a five-point scale (definitely positive Kappa values were between 0.42 and 0.47 for the features The c-statistic of the risk for MVI (with histopathology as the reference standard) but the validity of predictions may be distorted The models were developed using different algorithms (e.g. and were validated over time using similar data from patients admitted up to and including 2012 Although discrimination remained fairly stable there was clear evidence of calibration drift for all models: the risk of the event became increasingly overestimated over time Accompanying shifts in the patient population were noted: for example the incidence of the event steadily decreased from 7.7 to 6.2% the proportion of patients with a history of cancer or diabetes increased and the use of various medications increased observed mortality was 4.1% whereas EuroSCORE had an average estimated risk of 5.6% observed mortality was 2.8% but the average estimated risk was 7.6% The c-statistic showed no systematic deterioration temporal changes were observed for several predictors (e.g. average age and prevalence of recent myocardial infarction increased) and surgical procedures (e.g. fewer isolated coronary artery bypass graft procedures) The authors further stated that surgeons may have been more willing to operate on patients due to improvements in anesthetic Such criteria lack scientific underpinning We question the requirement from some journals that model development studies should include “an external validation” this requirement may induce selective reporting of a favorable result in a single setting Imagine a model that has been externally validated in tens of locations Discrimination and calibration results were good with limited heterogeneity between locations This would obviously be an important and reassuring finding there is still no 100% guarantee that the prediction model will also work fine in a new location it remains unclear how populations change in the future Such strategies help to ensure that prediction models stay up-to-date to support medical decision-making Information from all examples (except the example on ovarian cancer) was based on information available in published manuscripts The data used for the ovarian cancer example were not generated in the context of this study and were reused to describe differences between populations from different centers The data cannot be shared for ethical/privacy reasons Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis Prognosis and prognostic research: validating a prognostic model Prediction models need appropriate internal Predictive analytics in health care: how can we know it works Internal validation of predictive models: efficiency of some procedures for logistic regression analysis Assessing the generalizability of prognostic information What do we mean by validating a prognostic model The myth of generalisability in clinical research and machine learning in health care and outcomes in patients with traumatic brain injury in CENTER-TBI: a European prospective Calibration: the Achilles heel of predictive analytics External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges Calibration of risk prediction models: impact on decision-analytic performance Generalizability of Cardiovascular Disease Clinical Prediction Models: 158 Independent External Validations of 104 Unique Models Validation of models to diagnose ovarian cancer in patients managed surgically or conservatively: multicentre cohort study Clinical prediction models for mortality in patients with covid-19: external validation and individual participant data meta-analysis Verification of the harmonization of human epididymis protein 4 assays The heterogeneity of concentrated prescribing behavior: Theory and evidence from antipsychotics Biases in electronic health record data due to processes within the healthcare system: retrospective observational study Changing predictor measurement procedures affected the performance of prediction models in clinical examples Impact of predictor measurement heterogeneity across settings on the performance of prediction models: A measurement error perspective Deep learning predicts hip fracture using confounding patient and healthcare variables Derivation of a simple clinical model to categorize patients probability of pulmonary embolism: increasing the models utility with the SimpliRED D-dimer Critical issues in the evaluation and management of adult patients presenting to the emergency department with suspected pulmonary embolism Clinical experience and pre-test probability scores in the diagnosis of pulmonary embolism Systematic review of prediction models for delirium in the older adult inpatient Accurate auto-labeling of chest X-ray images based on quantitative similarity to an explainable AI model Screening for data clustering in multicenter studies: the residual intraclass correlation Interobserver Variability and Diagnostic Performance of Gadoxetic Acid-enhanced MRI for Predicting Microvascular Invasion in Hepatocellular Carcinoma Reynard C, Jenkins D, Martin GP, Kontopantelis E, Body R. Is your clinical prediction model past its sell by date? Emerg Med J. 2022. https://doi.org/10.1136/emermed-2021-212224 Feature robustness in non-stationary health records: caveats to deployable model performance in common clinical machine learning tasks Detection of calibration drift in clinical prediction models to inform model updating Continual updating and monitoring of clinical prediction models: time for dynamic prediction systems Prediction models will be victims of their own success Informative missingness in electronic health record systems: the curse of knowing Calibration drift in regression and machine learning models for acute kidney injury Dynamic trends in cardiac surgery: why the logistic EuroSCORE is no longer suitable for contemporary cardiac surgery and implications for future risk models A clinical prediction model for outcome and therapy delivery in transplant-ineligible patients with myeloma (UK Myeloma Research Alliance Risk Profile): a development and validation study Understanding receiver operating characteristic (ROC) curves Assessment of heterogeneity in an individual participant data meta-analysis of prediction models: an overview and illustration and evaluating clinical prediction models in an individual participant data meta-analysis A framework for meta-analysis of prediction model studies with binary and time-to-event outcomes Does ignoring clustering in multicenter data influence the performance of prediction models Geographic and temporal validity of prediction models: different approaches were useful to examine model performance Validation of prediction models: examining temporal and geographic stability of baseline risk and estimated covariate effects Untapped potential of multicenter studies: a review of cardiovascular risk prediction models revealed inappropriate analyses and wide variation in reporting Internal-external cross-validation helped to evaluate the generalizability of prediction models in large clustered datasets Multicentre prospective validation of use of the Canadian C-Spine Rule by triage nurses in the emergency department Minimum sample size for external validation of a clinical prediction model with a binary outcome Transparent reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): the TRIPOD statement Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration Transparent reporting of multivariable prediction models developed or validated using clustered data: TRIPOD-Cluster checklist Transparent reporting of multivariable prediction models developed or validated using clustered data (TRIPOD-Cluster): explanation and elaboration Methodological guidance for the evaluation and updating of clinical prediction models: a systematic review Sample size considerations for the external validation of a multivariable prognostic model: a resampling study A calibration hierarchy for risk models was defined: from utopia to empirical datra Download references BVC was funded by the Research Foundation – Flanders (FWO; grant G097322N) Internal Funds KU Leuven (grant C24M/20/064) and University Hospitals Leuven (grant COPREDICT) Department of Development and Regeneration CAPHRI Care and Public Health Research Institute Julius Center for Health Sciences and Primary Care and MvS reviewed and edited the manuscript All authors agree to take accountability for this work The reuse of the ovarian cancer data for methodological purposes was approved by the Research Ethics Committee UZ / KU Leuven (number S64709) and the need for individual information letters was waived The authors declare that they have no competing interests Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations unless otherwise stated in a credit line to the data Download citation DOI: https://doi.org/10.1186/s12916-023-02779-w Anyone you share the following link with will be able to read this content: a shareable link is not currently available for this article Metrics details The assessment of calibration performance of risk prediction models based on regression or more flexible machine learning algorithms receives little attention we argue that this needs to change immediately because poorly calibrated algorithms can be misleading and potentially harmful for clinical decision-making We summarize how to avoid poor calibration at algorithm development and how to assess calibration at algorithm validation emphasizing balance between model complexity and the available sample size calibration curves require sufficiently large samples Algorithm updating should be considered for appropriate support of clinical practice Efforts are required to avoid poor calibration when developing prediction models to evaluate calibration when validating models The ultimate aim is to optimize the utility of predictive analytics for shared decision-making and patient counseling These predictions may support clinical decision-making and better inform patients Algorithms (or risk prediction models) should give higher risk estimates for patients with the event than for patients without the event (‘discrimination’) discrimination is quantified using the area under the receiver operating characteristic curve (AUROC or AUC) also known as the concordance statistic or c-statistic it may be desirable to present classification performance at one or more risk thresholds such as sensitivity another key aspect of performance that is often overlooked and summarize how calibration can be assessed We explain the relevance of calibration in this paper and suggest solutions to prevent or correct poor calibration and thus make predictive algorithms more clinically relevant Irrespective of how well the models can discriminate between treatments that end in live birth versus those that do not it is clear that strong over- or underestimation of the chance of a live birth makes the algorithms clinically unacceptable a strong overestimation of the chance of live birth after IVF would give false hope to couples going through an already stressful and emotional experience has a favorable prognosis exposes the woman unnecessarily to possible harmful side effects When using the traditional risk threshold of 20% to identify high-risk patients for intervention QRISK2–2011 would select 110 per 1000 men aged between 35 and 74 years NICE Framingham would select almost twice as many (206 per 1000 men) because a predicted risk of 20% based on this model actually corresponded to a lower event rate This example illustrates that overestimation of risk leads to overtreatment Illustrations of different types of miscalibration Illustrations are based on an outcome with a 25% event rate and a model with an area under the ROC curve (AUC or c-statistic) of 0.71 Calibration intercept and slope are indicated for each illustrative curve a General over- or underestimation of predicted risks b Predicted risks that are too extreme or not extreme enough we recommend against using the Hosmer–Lemeshow test to assess calibration it is reasonable for a model not to be developed at all internal validation procedures can quantify the calibration slope calibration-in-the-large is irrelevant since the average of predicted risks will match the event rate calibration-in-the-large is highly relevant at external validation where we often note a mismatch between the predicted and observed risks which was published under the Creative Commons Attribution–Noncommercial (CC BY-NC 4.0) license which was published under the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license Tufts PACE clinical predictive model registry: update 1990 through 2015 Prognostic models in obstetrics: available A calibration hierarchy for risk models was defined: from utopia to empirical data External validation of multivariable prediction models: a systematic review of methodological conduct and reporting A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models Reporting and methods in clinical prediction research: a systematic review A spline-based tool to assess and visualize the calibration of multiclass risk predictions Modern modeling techniques had limited external validity in predicting mortality from traumatic brain injury Big data and predictive analytics: recalibrating expectations A deep learning mammography-based model for improved breast cancer risk prediction Predicting the chance of live birth for women undergoing IVF: a novel pretreatment counselling tool Predicting the 10 year risk of cardiovascular disease in the United Kingdom: independent and external validation of an updated version of QRISK2 Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries Strategies to diagnose ovarian cancer: new evidence from phase 3 of the multicentre international IOTA study Prediction of indolent prostate cancer: validation and updating of a prognostic nomogram Prospective validation of the good outcome following attempted resuscitation (GO-FAR) score for in-hospital cardiac arrest prognosis Predicting the cumulative chance of live birth over multiple complete cycles of in vitro fertilization: an external validation study Comparison of two models predicting IVF success; the effect of time trends on model performance Poor performance of clinical prediction models: the harm of commonly applied methods Clinical impact of prostate specific antigen (PSA) inter-assay variability on management of prostate cancer Impact of predictor measurement heterogeneity across settings on performance of prediction models: a measurement error perspective A novel multiple marker bioassay utilizing HE4 and CA125 for the prediction of ovarian cancer in patients with a pelvic mass Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers Sample size for binary logistic prediction models: beyond events per variable criteria Minimum sample size for developing a multivariable prediction model: PART II - binary and time-to-event outcomes Penalized maximum likelihood estimation to directly adjust diagnostic and prognostic prediction models for overoptimism: a clinical example Van Calster B, van Smeden M, Steyerberg EW. On the variability of regression shrinkage methods for clinical prediction models: simulation study on predictive performance. arXiv. 2019; https://arxiv.org/abs/1907.11493 Validation and updating of predictive logistic regression models: a study on sample size and shrinkage A review of statistical updating methods for clinical prediction models Dynamic prediction modeling approaches for cardiac surgery Prediction model to estimate presence of coronary artery disease: retrospective pooled analysis of existing cohorts External validation and extension of a diagnostic model for obstructive coronary artery disease: a cross-sectional predictive evaluation in 4888 patients of the Austrian Coronary Artery disease Risk Determination In Innsbruck by diaGnostic ANgiography (CARDIIGAN) cohort Download references This work was developed as part of the international STRengthening Analytical Thinking for Observational Studies (STRATOS) initiative. The objective of STRATOS is to provide accessible and accurate guidance in the design and analysis of observational studies (http://stratos-initiative.org/) Members of the STRATOS Topic Group ‘Evaluating diagnostic tests and prediction models’ are (alphabetically) Patrick Bossuyt This work was funded by the Research Foundation – Flanders (FWO; grant G0B4716N) and Internal Funds KU Leuven (grant C24/15/037) http://www.stratos-initiative.org All authors reviewed and edited the manuscript and approved the final version Detailed illustration of the assessment of calibration and model updating: the ROMA logistic regression model Download citation DOI: https://doi.org/10.1186/s12916-019-1466-7 Metrics details Machine learning is increasingly being used to predict clinical outcomes Most comparisons of different methods have been based on empirical analyses in specific datasets We used Monte Carlo simulations to determine when machine learning methods perform better than statistical learning methods in a specific setting We evaluated six learning methods: stochastic gradient boosting machines using trees as the base learners and linear regression estimated using ordinary least squares (OLS) Our simulations were informed by empirical analyses in patients with acute myocardial infarction (AMI) and congestive heart failure (CHF) and used six data-generating processes each based on one of the six learning methods to simulate continuous outcomes in the derivation and validation samples The outcome was systolic blood pressure at hospital discharge We applied the six learning methods in each of the simulated derivation samples and evaluated performance in the simulated validation samples The primary observation was that neural networks tended to result in estimates with worse predictive accuracy than the other five methods in both disease samples and across all six data-generating processes Boosted trees and OLS regression tended to perform well across a range of scenarios Three of the above four studies focused on binary outcomes while that of Shin and colleagues considered both binary and time-to-event outcomes The relative performance of ML methods and conventional statistical methods for predicting continuous outcomes has received substantially less attention In the current study we focus on prediction of a specific continuous outcome important in clinical medicine: systolic blood pressure we summarize our findings and place them in the context of the existing literature We conducted a set of empirical analyses to compare the performance of different machine and statistical learning methods in two different disease groups: patients hospitalized with acute myocardial infarction (AMI) and patients hospitalized with congestive heart failure (CHF) In each disease group we examined the ability of different methods to predict a patient’s systolic blood pressure at hospital discharge Model performance was assessed using independent validation samples the derivation sample consisted of 8145 patients discharged alive from hospital between April 1 while the validation sample consisted of 4444 patients discharged alive from hospital between April 1 the derivation sample consisted of 7156 patients discharged alive from hospital between April 1 while the validation sample consisted of 6818 patients discharged alive from hospital between April 1 the derivation and validation samples came from distinct time periods vital signs and physical examination at presentation and results of laboratory tests were collected for these samples the outcome was a continuous variable denoting the patient’s systolic blood pressure at the time of hospital discharge Differences in covariates between derivation and validation samples were tested using a t-test for continuous covariates and a Chi-squared test for binary variables The use of the data in this project is authorized under Section 45 of Ontario’s Personal Health Information Protection Act (PHIPA) and does not require review by a Research Ethics Board All research was performed in accordance with relevant guidelines and regulations the grid searches resulted in the following values for the hyper-parameters: boosted trees (interaction depth: 4; shrinkage/learning rate: 0.065) random forests (number of randomly sampled variables: 6; minimum terminal node size: 20) neural networks (5 neurons in the hidden layer from a grid search that considered the number of neurons ranging from 2 to 15 in increments of 1; weight decay parameter: 0.05) random forests (number of randomly sampled variables: 8; minimum terminal node size: 20) neural networks (6 neurons in the hidden layer from a grid search that considered the number of neurons ranging from 2 to 15 in increments of 1; weight decay parameter: 0) R2 was computed as the square of the Pearson correlation coefficient between observed and predicted discharge blood pressure while MSE and MAE were estimated as \(\frac{1}{N}\sum\limits_{i = 1}^{N} {(Y_{i} - \hat{Y}_{i} )^{2} }\) and \(\frac{1}{N}\sum\limits_{i = 1}^{N} {|Y_{i} - \hat{Y}_{i} |}\) where \(Y\) denotes the observed blood pressure and \(\hat{Y}\) denotes the estimated blood pressure we used implementations available in R statistical software (R version 3.6.1 For random forests we used the randomForest function from the randomForest package (version 4.6-14) The number of trees (500) was the default in this implementation For boosted trees we used the gbm function from the gbm package (version 2.5.1) The number of trees (100) was the default in this implementation We used the ols and rcs functions from the rms package (version 5.1-3.1) to estimate the OLS regression model incorporating restricted cubic regression splines Feed-forward (or multilayer perceptron) neural networks with a single hidden layer were fit using the nnet package (version 7.3-12) with a linear activation function Ridge regression and the lasso were implemented using the functions cv.glmnet (for estimating the λ parameter using tenfold cross-validation) and glmnet from the glmnet package (version 2.0-18) Performance in validation sample (Case study) random forests resulted in predictions with the highest R2 (23.7%); however differences between five of the six methods were minimal again (range: 22.2 to 23.7%) Random forests resulted in estimates with the lowest MSE while boosted trees resulted in estimates with the lowest MAE MAE did not vary meaningfully across five of the six methods (range: 15.0 to 15.2) the neural network had substantially worse performance than the other five methods across all three metrics When comparing the three linear model-based approaches neither of the two penalized approaches (lasso and ridge regression) had an advantage over conventional OLS regression in either disease samples the lasso and ridge regression had very similar performance to each other a tree-based machine learning method (either boosted trees or random forest) tended to result in estimates with the greatest predictive accuracy in the validation samples differences between five of the methods were minimal Neural networks resulted in estimates with substantially worse performance compared to the other five methods We considered six different data-generating processes for each of the two diseases (AMI and CHF) We describe the approach in detail for the AMI sample An identical approach was used with the CHF sample We used the derivation and validation samples described in the empirical analyses above We made one modification to the validation samples described above The validation sample used above consisted of 4444 subjects (AMI validation sample) and 6818 (CHF validation sample) In order to remove variation in external performance due to small sample sizes we sampled with replacement from each validation sample to create validation samples consisting of 100,000 subjects the method was fit in the derivation sample The fitted model was then applied to both the derivation sample and the validation sample Using the model/algorithm fit in the derivation sample a predicted outcome (discharge systolic blood pressure) was obtained for each subject in each of the two datasets (derivation and validation samples) we proceeded as follows: Using these predicted blood pressures at discharge a continuous outcome was simulated for each subject as follows a residual or prediction error was computed as the difference between the true observed discharge blood pressure and the estimated blood pressure obtained from the fitted model a residual was drawn with replacement from the empirical distribution of residuals estimated in the previous step the sampled residual was added to the estimated discharge blood pressure This quantity is the simulated outcome for the given patient This process was then repeated in the validation sample to obtain a simulated outcome for each subject in the validation sample Note that the given prediction model was only fit once (in the derivation sample) but was then applied in both the derivation and validation samples to obtain estimated values of discharge blood pressure These simulated outcomes were then used as the ‘true’ outcomes in all subsequent analyses The above process was used when the data-generating process was based on random forests When the data-generating process was based on OLS regression we used a modified version of this process Instead of sampling from the empirical distribution of residuals we sampled residuals from a normal distribution with mean zero and standard deviation equal to that estimated for error distribution from the OLS model These sampled residuals were then added to the estimated discharge blood pressure to produce simulated continuous outcomes For a given pair of derivation and validation samples we fit each of the six statistical/machine learning methods (boosted trees and OLS regression) in the derivation sample and then applied the fitted model to the validation sample an estimated discharge blood pressure for each of the six prediction methods The performance of the predictions obtained using each method was assessed using the three metrics described above (R2 for a given data-generating process and a given prediction method we obtained 1000 values of R2 when outcomes were simulated in the derivation and validation samples using random forests we assessed the predictive accuracy of boosted trees This process was repeated using the datasets in which outcomes were simulated using the five other data-generating processes Performance in AMI sample (External validation) Across all six data-generating processes and across all three performance metrics the use of neural networks tended to result in predictions with the lowest accuracy Even when outcomes were simulated using a neural network the other five methods tended to result in predictions with higher accuracy than did the use of neural networks The difference in performance between neural networks and that of the other five methods was substantially greater than the differences amongst the other five methods When outcomes were generated using boosted trees the use of boosted trees tended to result in estimates with the highest R2 while estimates obtained using OLS regression tended to result in estimates with comparable performance When outcomes were generated using an OLS regression model the use of OLS regression tended to result in estimates with the highest R2 The performance of OLS regression was followed by that of boosted trees and the two penalized regression methods When outcomes were generated using a penalized regression method the three linear regression models tended to result in estimates with the highest R2 when outcomes were generated using random forests the use of boosted trees and random forests tended to result in estimates with the highest R2 When considering the three linear regression-based approaches there was no advantage to using a penalized regression approach compared to using OLS regression the differences between the five non-neural network approaches tended to be minimal regardless of the data-generating processes the use of OLS regression tended to perform well and there were no meaningful benefits to using a different approach MSE and MAE of estimates obtained using neural networks displayed high variability across the 1000 simulation replicates Performance in CHF sample (External validation) when outcomes were simulated using random forests the use of random forests tended to result in estimates with the highest R2 although the performance of boosted trees was comparable When outcomes were generated using a linear regression-based approach then the three linear regression-based approaches tended to result in estimates with the highest R2 Similar results were observed when MSE and MAE were used to assess performance accuracy when considering the three linear regression-based estimation methods there were rarely meaningful benefits to using a penalized estimation method compared to using OLS regression There is a growing interest in comparing the relative performance of different machine and statistical learning methods for predicting patient outcomes To better understand differences in the relative performance of competing learning methods for predicting continuous outcomes we used two empirical comparisons and Monte Carlo simulations using six different data-generating processes each based upon a different learning method These simulations enabled us to examine the performance of methods different from those under which the data were generated compared to the method that was used to generate the data In both of the empirical analyses and in all six sets of Monte Carlo simulations the performance of neural networks was substantially poorer than that of the other five learning methods the number of subjects in both of our derivation samples and in both of our validation samples were substantially higher than those used in these previous studies An advantage to the current study was its use of simulations to compare the relative performance of different learning methods for predicting blood pressure A strength of the design of these simulations is that they were based on two real data sets each with a realistic correlation structure between predictors and with realistic associations between predictors and outcomes we were able to simulate datasets reflective of those that would be seen in specific clinical contexts both the sizes of the simulated dataset and the number of predictors that we considered are reflective of what is often encountered in clinical research Some might argue that the number of predictors (33 and 28 in the AMI and CHF studies respectively) is relatively high for conventional regression modeling and relatively low for modern machine learning techniques the use of boosting resulted in improved performance The objective of the current study was not to develop a new learning method nor was it to improve existing learning methods17 Our objective was to compare the relative performance of different learning methods for predicting a continuous outcome while there is a growing number of studies comparing different learning methods the large majority of these studies rely on empirical comparisons using a single dataset A strength of the current study is its use of Monte Carlo simulations to conduct these comparisons systematically A methodological contribution of the current study is providing a framework for Monte Carlo simulations that allows for a more informed comparison of different learning methods Because we knew which learning method was the true model that generated the outcomes the performance of each of the other five methods could be compared to that of the true method we demonstrated that when outcomes were generated using boosted trees the use of OLS regression had performance comparable to that of boosted trees for predicting blood pressure (in the AMI sample) we found that a default implementation of a neural network had substantially poorer performance compared to five other learning methods for predicting discharge systolic blood pressure in patients hospitalized with heart disease This finding was observed both in two sets of empirical analyses and in six sets of Monte Carlo simulations We also observed that there was no meaningful advantage to the use of penalized linear models (i.e. the lasso or ridge regression) compared to using OLS regression Boosted trees tended to have the best performance of the different machine learning methods for the number of covariates studied Investigators interested in predicting blood pressure may often be able to limit their attention to OLS regression and boosted trees and select the method that performs best in their specific context We encourage researchers to apply our simulation framework to other diseases and other empirical datasets to examine whether our findings persist across different settings and diseases The use of data in this project was authorized under Section 45 of Ontario’s Personal Health Information Protection Act which does not require review by a Research Ethics Board This study did not include experiments involving human subjects or tissue samples The data sets used for this study were held securely in a linked, de-identified form and analysed at ICES. While data sharing agreements prohibit ICES from making the data set publicly available, access may be granted to those who meet pre-specified criteria for confidential access, available at www.ices.on.ca/DAS If you are interested in requesting ICES Data & Analytic Services please contact ICES DAS (e-mail: das@ices.on.ca or at 1-888-480-1327) Random forest versus logistic regression: A large-scale benchmark experiment Comparison of artificial neural network and logistic regression models for prediction of outcomes in trauma patients: A systematic review and meta-analysis conventional statistical models for predicting heart failure readmission and mortality Effectiveness of public report cards for improving the quality of cardiac care: the EFFECT study: a randomized trial Regression trees for predicting mortality in patients with cardiovascular disease: what improvement is achieved by using ensemble-based methods? Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N and multivariate adaptive regression splines for predicting AMI mortality In Machine Learning: Proceedings of the Thirteenth International Conference 148–156 (Morgan Kauffman Additive logistic regression: A statistical view of boosting (with discussion) Propensity score estimation with boosted regression for evaluating causal effects in observational studies The Elements of Statistical Learning 2nd edn Machine learning compared with conventional statistical models for predicting myocardial infarction readmission and mortality: A systematic review A plea for neutral comparison studies in computational sciences Ten quick tips for machine learning in computational biology Introduction to Neural Networks with Java 2nd edn Predicting increased blood pressure using machine learning Predicting hypertension using machine learning: Findings from Qatar Biobank Study Predicting systolic blood pressure using machine learning In 7th International Conference on Information and Automation for Sustainability 1–6 (2014) Predicting blood pressure from physiological index data using the SVR algorithm Modern modelling techniques are data hungry: A simulation study for predicting dichotomous endpoints Random Forest vs Logistic Regression: Binary classification for heterogeneous datasets A comparison of machine learning techniques for customer churn prediction Predictive analytics in health care: How can we know it works? Download references which is funded by an annual grant from the Ontario Ministry of Health (MOH) and the Ministry of Long-Term Care (MLTC) As a prescribed entity under Ontario’s privacy legislation ICES is authorized to collect and use health care data for the purposes of health system analysis Secure access to these data is governed by policies and procedures that are approved by the Information and Privacy Commissioner of Ontario This research was supported by an operating grant from the Canadian Institutes of Health Research (CIHR) (PJT-166161) Austin is supported in part by Mid-Career Investigator awards from the Heart and Stroke Foundation Harrell's work on this paper was supported by CTSA award No UL1 TR002243 from the National Center for Advancing Translational Sciences Its contents are solely the responsibility of the authors and do not necessarily represent official views of the National Center for Advancing Translational Sciences or the National Institutes of Health The use of data in this project was authorized under section 45 of Ontario’s Personal Health Information Protection Act coded the simulations and wrote the first draft of the manuscript contributed to the design of the simulations provided clinical expertise and revised the manuscript The authors declare no competing interests Download citation DOI: https://doi.org/10.1038/s41598-022-13015-5 Sorry, a shareable link is not currently available for this article. International Journal of Data Science and Analytics (2024) Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma. Volume 4 - 2022 | https://doi.org/10.3389/fdgth.2022.923944 Artificial Intelligence and Statistics: Just the Old Wine in New Wineskins A Commentary on Artificial Intelligence and Statistics: Just the Old Wine in New Wineskins? by Faes, L., Sim, D. A., van Smeden, M., Held, U., Bossuyt, P. M., and Bachmann, L. M. (2022). Front. Digit. Health 4:833912. doi: 10.3389/fdgth.2022.833912 We write to expand on Faes's et al. recent publication “Artificial intelligence and statistics: Just the old wine in new wineskins?” (1) The authors rightly address a lack of consensus regarding terminology between the statistics and machine learning fields Guidance is needed to provide a more unified way of reporting and comparing study results between the different fields Major differences can be observed in the measures commonly used across these axes to evaluate predictive performance in the statistics and machine learning fields We here highlight key measures focusing on discriminative ability and clinical utility [or effectiveness (6)]. Table 1 provides a non-exhaustive overview All measures relate to the evaluation of probability predictions for binary outcomes They are derived from the 2 × 2 confusion matrix for specific or consecutive decision thresholds Evaluation measures from statistics and machine learning fields sensitivity (fraction true positive) and specificity (fraction true negative) can be considered as independent of event rate Some measures are considered outdated in the classic statistical learning field while still popular in the machine learning field Such a measure is the crude accuracy (the fraction of correct classifications) a 99% accuracy is the minimum for a setting with 1% event rate and classifying all subjects as “low risk.” Decision analytical approaches move away from pure discrimination and toward clinical utility. Net benefit is the most popular among some recently proposed measures for clinical utility (4, 5). It is derived from a decision analytical framework and weighs sensitivity and specificity by clinical consequences. Net benefit has a clear interpretation when compared to treat-all and treat-none strategies (4, 5) We recommend that the aim of the evaluation of a model should determine our focus at clinical performance (discrimination with quantification by appropriate measures All authors contributed to the article and approved the submitted version The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher Artificial intelligence and statistics: just the old wine in new wineskins Measures to summarize and compare the predictive capacity of markers PubMed Abstract | CrossRef Full Text | Google Scholar Calibration: the achilles heel of predictive analytics Decision curve analysis: a novel method for evaluating prediction models PubMed Abstract | CrossRef Full Text | Google Scholar Net benefit approaches to the evaluation of prediction models From biomarkers to medical tests: the changing landscape of test evaluation The need for reorientation toward cost-effective prediction: comments on 'Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond' by M Using relative utility curves to evaluate risk prediction The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets The relationship between Precision-Recall and ROC curves In: Proceedings of the 23rd International Conference on Machine Learning PA: Association for Computing Machinery (2006) CrossRef Full Text | Google Scholar van Calster B and Steyerberg EW (2022) Commentary: Artificial Intelligence and Statistics: Just the Old Wine in New Wineskins Received: 19 April 2022; Accepted: 03 May 2022; Published: 20 May 2022 Copyright © 2022 de Hond, van Calster and Steyerberg. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) distribution or reproduction in other forums is permitted provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited in accordance with accepted academic practice distribution or reproduction is permitted which does not comply with these terms *Correspondence: Anne A. H. de Hond, YS5hLmguZGVfaG9uZEBsdW1jLm5s Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher 94% of researchers rate our articles as excellent or goodLearn more about the work of our research integrity team to safeguard the quality of each article we publish The use of evidence from clinical trials to support decisions for individual patients is a form of “reference class forecasting”: implicit predictions for an individual are made on the basis of outcomes in a reference class of “similar” patients treated with alternative therapies Evidence based medicine has generally emphasized the broad reference class of patients qualifying for a trial Yet patients in a trial (and in clinical practice) differ from one another in many ways that can affect the outcome of interest and the potential for benefit is to narrow the reference class to yield more patient specific effect estimates to support more individualized clinical decision making This article will review fundamental conceptual problems with the prediction of outcome risk and heterogeneity of treatment effect (HTE) as well as the limitations of conventional (one-variable-at-a-time) subgroup analysis It will also discuss several regression based approaches to “predictive” heterogeneity of treatment effect analysis including analyses based on “risk modeling” (such as stratifying trial populations by their risk of the primary outcome or their risk of serious treatment-related harms) and analysis based on “effect modeling” (which incorporates modifiers of relative effect) It will illustrate these approaches with clinical examples and discuss their respective strengths and vulnerabilities Series explanation: State of the Art Reviews are commissioned on the basis of their relevance to academics and specialists in the US and internationally For this reason they are written predominantly by US authors Contributors: The concepts of this manuscript were discussed among all authors DMK prepared the initial draft of the manuscript Substantial revisions were made by all authors Funding: This work was partially supported through two Patient-Centered Outcomes Research Institute (PCORI) grants (the Predictive Analytics Resource Center (PARC) (SA.Tufts.PARC.OSCO.2018.01.25) and Methods Award (ME-1606-35555)) as well as by the National Institutes of Health (U01NS086294) Competing interests: All authors have read and understood BMJ policy on declaration of interests and declare no competing interests Provenance and peer review: Commissioned; externally peer reviewed Disclosures: All statements in this report are solely those of the authors and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute (PCORI) BMA Member Log In Subscribe and get access to all BMJ articles Subscribe Access this article for 1 day for:£50 / $60/ €56 (excludes VAT) You can download a PDF version for your personal record Buy this article Respond to this article Thank you for your interest in spreading the word about The BMJ NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it Read related article See previous polls Metrics details Early detection of severe asthma exacerbations through home monitoring data in patients with stable mild-to-moderate chronic asthma could help to timely adjust medication We evaluated the potential of machine learning methods compared to a clinical rule and logistic regression to predict severe exacerbations We used daily home monitoring data from two studies in asthma patients (development: n = 165 and validation: n = 101 patients) one class SVM) and a logistic regression model provided predictions based on peak expiratory flow and asthma symptoms These models were compared with an asthma action plan rule Severe exacerbations occurred in 0.2% of all daily measurements in the development (154/92,787 days) and validation cohorts (94/40,185 days) The AUC of the best performing XGBoost was 0.85 (0.82–0.87) and 0.88 (0.86–0.90) for logistic regression in the validation cohort The XGBoost model provided overly extreme risk estimates whereas the logistic regression underestimated predicted risks Sensitivity and specificity were better overall for XGBoost and logistic regression compared to one class SVM and the clinical rule We conclude that ML models did not beat logistic regression in predicting short-term severe asthma exacerbations based on home monitoring data Clinical application remains challenging in settings with low event incidence and high false alarm rates with high sensitivity and even fewer have been externally validated and (c) use of \(\upbeta \)2 reliever (No M&E = No Morning & Evening Yes M&E = Yes morning and evening) over time for three patients with no The case of no exacerbations (top figure) is most prevalent in the data ROC-curve for predictions from XGBoost and the logistic regression model The sensitivity and specificity of the one class SVM and clinical prediction rule are also plotted on the left curve On the left the points corresponding to the 0.001 (‘t = 0.001’) and 0.002 (‘t = 0.002’) probability thresholds are plotted for the XGBoost and logistic regression model On the right the points corresponding to the thresholds resulting in 138 positive predictions (‘t for 138 pos pred’ equaling the clinical rule positive predictions) are plotted for the XGBoost and logistic regression model For the 0.2% threshold, the XGBoost model obtained a sensitivity of 0.59, a specificity of 0.89, a positive predictive value (PPV) of 0.02, and a negative predictive value (NPV) of 1 (Table 3) With 138 positive predictions as for the clinical rule the XGBoost and logistic regression models again had a higher sensitivity and PPV The differences between the AUCs of the best performing logistic regression model with one lag and XGBoost model with five lags were still significant (p = 0.02 we aimed to assess the performance of ML techniques and classic models for short-term prediction of severe asthma exacerbations based on home monitoring data ML and logistic regression both reached higher discriminative performance than a previously proposed simple clinical rule Logistic regression provided slightly better discriminative performance than the XGBoost algorithm logistic regression still produced many false positives at high levels of sensitivity This finding may be explained by the (lack of) complexity of the data that was studied An advantage of ML techniques is the natural flexibility they offer to model complex (e.g versus logistic regression techniques that have the advantage of being easily interpretable Our findings illustrate that the flexibility provided by ML models may not always be needed to arrive at the best performing prediction model for medical data The benefits of ML methods may differ between settings and should be further investigated Improvement in discriminative ability may be achieved by reducing the noise in the exacerbation event at the time of data collection the recording of severe exacerbations in our dataset might have been incomplete or there might have been a delay between the recording of the exacerbations and their true onset better predicting variables of exacerbations may be needed Our findings form a counterexample by showing that inherently interpretable techniques such as logistic regression may outperform ML for certain application types and clinical settings Interpretability is especially relevant for clinical settings as physicians often prefer interpretable models to assist in clinical decision making Our findings therefore contribute to answering the question when and how to apply ML methods safely and effectively the data used in this study contained few missing values The quality of the data was therefore high This implies that the registration method is unlikely to affect our conclusions ML models may not outperform classical regression prediction model in predicting short-term asthma exacerbations based on home monitoring data A simple regression model outperforms a simple rule due to the high false alarm rate associated with the low probability thresholds required for high sensitivity All patients had stable mild-to-moderate chronic asthma Both studies were conducted in an asthma clinic in New Zealand on patients referred by their general practitioners patients recorded their peak expiratory flow and use of \(\upbeta \)2-reliever (yes/no) in the morning and evening of every trial day in diaries Nocturnal awakening (yes/no) was recorded in the morning (see below) All predictors were measured or calculated daily the average of morning and evening peak expiratory flow (PEF measured in liters per minute) and the use of \(\upbeta \)2-reliever in morning and evening (used in both morning and evening/used in morning or evening/not used in morning and evening) were considered as potential predictors maximum and minimum and added these as predictors This rolling window consisted of the current day and all 6 preceding days The PEF personal best was determined per patient during a run-in period of 4 weeks and added to the models we constructed and added first differences (the difference in today’s measurement with respect to yesterday’s measurement) and lags (yesterday’s measurement) for PEF Demographics and descriptive statistics of predictors (i.e. and use of \(\upbeta \)2-reliever) were calculated for each individual patient over their respective observational periods The XGBoost model estimates many decision-trees sequentially These decision tree predictions are combined into an ensemble model to arrive at the final predictions The sequential training makes the XGBoost model faster and more efficient than other tree-based algorithms tuning an XGBoost model may become increasingly difficult which is less of an issue with other tree-based models like random forest Second, we trained an outlier detection model (one class SVM with Radial Basis Kernel)34 The one class SVM aims to find a frontier that delimits the contours of the original distribution it can identify whether a new data point falls outside of the original distribution and should therefore be classified as ‘irregular’ An advantage of this model is that it is particularly apt at dealing with the low event rate in the asthma data A downside of this model is that it does not provide probability estimates like a regular support vector machine and we therefore must base its predictive performance on its classification metrics only (see below) it may however not provide the level of complexity needed to adequately model certain prediction problems which comes at a cost of the interpretability of these methods Confidence intervals were obtained through bootstrapping (based on a 1000 iterations) and negative predictive value (NPV) were calculated for all models at the following probability thresholds (the cut-off point at which probabilities are converted into binary outcomes): 0.1% and 0.2% These were chosen as they circle the prevalence rate of the outcome in our data For a fair comparison with the clinical rule we also calculated these performance metrics (sensitivity etc.) for the XGBoost and logistic regression models at the probability thresholds producing the same number of positive predictions as produced by the one class SVM and the clinical rule We performed a sensitivity analysis for predicting exacerbations within 4 and 8 days as opposed to 2 days (Table 4) This enabled us to study the effect of a variation in the length of the outcome window on the models’ discrimination and calibration capacities we performed a sensitivity analysis to assess the effect of the number of lags on model performance we varied the number of lags from 1 to 5 for the models predicting exacerbations within 2 days For the XGBoost and logistic regression model All analyses were performed in Python 3.8.0. with R 3.6.3 plug-ins to obtain calibration results. The key functions and libraries can be found in additional file 2 Ethics approval was obtained for the original data collection These studies were conducted in accordance with the principles of the Declaration of Helsinki on biomedical research The protocols were approved by the Otago and Canterbury ethics committees and all patients gave written informed consent prior to participation The datasets analyzed during the current study are not publicly available due to privacy restrictions but are available to reviewers on reasonable request Remote patient monitoring: A comprehensive study Honkoop, P. J., Taylor, D. R., Smith, A. D., Snoeck-Stroband, J. B. & Sont, J. K. Early detection of asthma exacerbations by using action points in self-management plans. Eur. Respir. J. 41, 53–59. https://doi.org/10.1183/09031936.00205911 (2013) Fine, M. J. et al. A prediction rule to identify low-risk patients with community-acquired pneumonia. N. Engl. J. Med. 336, 243–250. https://doi.org/10.1056/NEJM199701233360402 (1997) Derivation of a simple clinical model to categorize patients probability of pulmonary embolism: Increasing the models utility with the SimpliRED d-dimer British Thoraic Society. British Guideline on the Management of Asthmahttps://doi.org/10.1136/thx.2008.097741 (2019) Mak, R. H. et al. Use of crowd innovation to develop an artificial intelligence-based solution for radiation therapy targeting. JAMA Oncol. 5, 654–661. https://doi.org/10.1001/jamaoncol.2019.0159 (2019) Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118. https://doi.org/10.1038/nature21056 (2017) McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94. https://doi.org/10.1038/s41586-019-1799-6 (2020) Cearns, M., Hahn, T. & Baune, B. T. Recommendations and future directions for supervised machine learning in psychiatry. Transl. Psychiatry 9, 271. https://doi.org/10.1038/s41398-019-0607-2 (2019) Neuhaus, A. H. & Popescu, F. C. Sample size, model robustness, and classification accuracy in diagnostic multivariate neuroimaging analyses. Biol. Psychiatry 84, e81–e82. https://doi.org/10.1016/j.biopsych.2017.09.032 (2018) Chen, P.-H.C., Liu, Y. & Peng, L. How to develop machine learning models for healthcare. Nat. Mater. 18, 410–414. https://doi.org/10.1038/s41563-019-0345-0 (2019) Altman, D. G., Vergouwe, Y., Royston, P. & Moons, K. G. M. Prognosis and prognostic research: Validating a prognostic model. BMJ 338, b605. https://doi.org/10.1136/bmj.b605 (2009) Wynants, L., Smits, L. J. M. & Van Calster, B. Demystifying AI in healthcare. BMJ 370, m3505. https://doi.org/10.1136/bmj.m3505 (2020) In 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) Christodoulou, E. et al. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J. Clin. Epidemiol. 110, 12–22. https://doi.org/10.1016/j.jclinepi.2019.02.004 (2019) Gravesteijn, B. Y. et al. Machine learning algorithms performed no better than regression models for prognostication in traumatic brain injury. J. Clin. Epidemiol. 122, 95–107. https://doi.org/10.1016/j.jclinepi.2020.03.005 (2020) Nusinovici, S. et al. Logistic regression was as good as machine learning for predicting major chronic diseases. J. Clin. Epidemiol. 122, 56–69. https://doi.org/10.1016/j.jclinepi.2020.03.002 (2020) Martin, A. et al. Development and validation of an asthma exacerbation prediction model using electronic health record (EHR) data. J. Asthma 57, 1339–1346. https://doi.org/10.1080/02770903.2019.1648505 (2020) Sanders, S., Doust, J. & Glasziou, P. A systematic review of studies comparing diagnostic clinical prediction rules with clinical judgment. PLoS ONE 10, e0128233. https://doi.org/10.1371/journal.pone.0128233 (2015) Satici, C. et al. Performance of pneumonia severity index and CURB-65 in predicting 30-day mortality in patients with COVID-19. Int. J. Infect. Dis. 98, 84–89. https://doi.org/10.1016/j.ijid.2020.06.038 (2020) Obradović, D. et al. Correlation between the Wells score and the Quanadli index in patients with pulmonary embolism. Clin. Respir. J. 10, 784–790. https://doi.org/10.1111/crj.12291 (2016) Winters, B. D. et al. Technological distractions (Part 2): A summary of approaches to manage clinical alarms with intent to reduce alarm fatigue. Crit. Care Med. 46, 130–137. https://doi.org/10.1097/ccm.0000000000002803 (2018) Mori, T. & Uchihira, N. Balancing the trade-off between accuracy and interpretability in software defect prediction. Empir. Softw. Eng. 24, 779–825. https://doi.org/10.1007/s10664-018-9638-1 (2019) Johansson, U., Sönströd, C., Norinder, U. & Boström, H. Trade-off between accuracy and interpretability for predictive in silico modeling. Future Med. Chem. 3, 647–663. https://doi.org/10.4155/fmc.11.23 (2011) Wallace, B. C. & Dahabreh, I. J. Improving class probability estimates for imbalanced data. Knowl. Inf. Syst. 41, 33–52. https://doi.org/10.1007/s10115-013-0670-6 (2014) Van Calster, B. et al. Calibration: The Achilles heel of predictive analytics. BMC Med. 17, 230. https://doi.org/10.1186/s12916-019-1466-7 (2019) Honkoop, P. J. et al. MyAirCoach: The use of home-monitoring and mHealth systems to predict deterioration in asthma control and the occurrence of asthma exacerbations; study protocol of an observational study. BMJ Open 7, e013935. https://doi.org/10.1136/bmjopen-2016-013935 (2017) Finkelstein, J. & Jeong, I. C. Machine learning approaches to personalize early prediction of asthma exacerbations. Ann. N. Y. Acad. Sci. 1387, 153–165. https://doi.org/10.1111/nyas.13218 (2017) Sanchez-Morillo, D., Fernandez-Granero, M. A. & Leon-Jimenez, A. Use of predictive algorithms in-home monitoring of chronic obstructive pulmonary disease and asthma: A systematic review. Chron. Respir. Dis. 13, 264–283. https://doi.org/10.1177/1479972316642365 (2016) Smith, A. D., Cowan, J. O., Brassett, K. P., Herbison, G. P. & Taylor, D. R. Use of exhaled nitric oxide measurements to guide treatment in chronic asthma. N. Engl. J. Med. 352, 2163–2173. https://doi.org/10.1056/NEJMoa043596 (2005) Taylor, D. R. et al. Asthma control during long-term treatment with regular inhaled salbutamol and salmeterol. Thorax 53, 744–752. https://doi.org/10.1136/thx.53.9.744 (1998) Smith, A. E., Nugent, C. D. & McClean, S. I. Evaluation of inherent performance of intelligent medical decision support systems: Utilising neural networks as an example. Artif. Intell. Med. 27, 1–27. https://doi.org/10.1016/s0933-3657(02)00088-x (2003) Tree boosting with xgboost-why does xgboost win" every" machine learning competition In Proceedings of the International Joint Conference on Neural Networks Schober, P. & Vetter, T. R. Logistic regression in medical research. Anesth. Analg. 132, 365–366. https://doi.org/10.1213/ANE.0000000000005247 (2021) Clinical Prediction Models (Springer Nature Download references Taylor for contributing to the data collection Department of Information Technology and Digital Innovation Clinical AI Implementation and Research Lab analyzed the data and drafted the manuscript All authors read and approved the final manuscript Download citation DOI: https://doi.org/10.1038/s41598-022-24909-9 Sign up for the Nature Briefing newsletter — what matters in science Metrics details Clinical prediction models are often not evaluated properly in specific settings or updated These key steps are needed such that models are fit for purpose and remain relevant in the long-term We aimed to present an overview of methodological guidance for the evaluation (i.e. validation and impact assessment) and updating of clinical prediction models We systematically searched nine databases from January 2000 to January 2022 for articles in English with methodological recommendations for the post-derivation stages of interest Qualitative analysis was used to summarize the 70 selected guidance papers Key aspects for validation are the assessment of statistical performance using measures for discrimination (e.g. calibration-in-the-large and calibration slope) For assessing impact or usefulness in clinical decision-making recent papers advise using decision-analytic measures (e.g. the Net Benefit) over simplistic classification measures that ignore clinical consequences (e.g. Commonly recommended methods for model updating are recalibration (i.e. adjustment of intercept or baseline hazard and/or slope) re-estimation of individual predictor effects) Additional methodological guidance is needed for newer types of updating (e.g. meta-model and dynamic updating) and machine learning-based models Substantial guidance was found for model evaluation and more conventional updating of regression-based models An important development in model evaluation is the introduction of a decision-analytic framework for assessing clinical usefulness Consensus is emerging on methods for model updating Framework from derivation to implementation of clinical prediction models The focus of this systematic review is on model evaluation (validation Further clarification of terminologies and methods for model evaluation may benefit applied researchers We therefore aim to provide an overview of methodological guidance for the post-derivation stages of clinical prediction models we focus on methods for examining an existing model’s validity in specific settings we outline consensus on definitions to support the methodological discussion and we highlight gaps that require further research Articles were included if they 1) provided methodological “guidance” (i.e. or model updating; 2) were written in English; and 3) were published between January 2000 and January 2022 as well as papers that discussed only one statistical technique or provided guidance not generalizable outside of a specific disease area Initial selection based on title and abstract were conducted independently by two researchers (M.A.E.B and any discrepancies were resolved through consensus meetings methodological topic(s) discussed) were extracted and thematic analysis was used for summarization Full text assessment and data extraction were performed by one researcher (M.A.E.B.) The results were reviewed by three researchers (E.W.S. Ethics approval was not required for this review A summary of methodological guidance for model validation Internal validation is the minimum requirement for clinical prediction models External validation is recommended to evaluate model generalizability in different but plausibly related settings Designs for validation studies differ in strength (e.g. temporal validation is a weak form of validation Examination of two validation aspects (discrimination and calibration) is recommended for assessing statistical performance irrespective of the type of validation Clinical usefulness is a common area between validation and impact assessment and its examination is advised for assessing the clinical performance of models intended to be used for medical decision-making Several performance aspects can be examined in a validation study, with various measures proposed for each (see Additional file 3 for a more complete list): The minimum threshold for useful models can only be defined by examining decision-analytic measures (e.g. A summary of methodological guidance for the assessment of a model’s impact Potential impact can be examined through clinical performance measures (e.g. Decision Curve Analysis) or health economic analysis (e.g. Assessing actual impact requires comparative empirical studies such as cluster randomized trials or other designs (e.g. The literature distinguishes four types of model updating for regression-based models (Fig. 4). Updating methods for more computationally-intensive models (e.g., deep neural networks) were not identified. A summary of methodological guidance for model updating recalibration) is often sufficient when the differences between the derivation and new data are minimal partial to full revision) may be appropriate Model extension allows the incorporation of new markers in a model to develop a meta-model that can be further updated for a new dataset Updating can also be done periodically or continuously Clinical prediction models are evidence-based tools that can aid in personalized medical decision-making their applicability and usefulness are ideally evaluated prior to their clinical adoption Suboptimal performance may be improved by model adjustment or re-specification to incorporate additional information from a specific setting or to include new markers We aimed to provide a summary of contemporary methodological guidance for the evaluation (validation and impact assessment) and updating of clinical prediction models this is the first comprehensive review of guidance for these post-derivation stages Guidance for updating is limited to regression-based models only the validation of dynamic prediction models We did not identify caveats for model updating when the clinical setting is not ideal (e.g. very effective treatments are used for high-risk patients defined by the prediction model) We also did not identify methods for retiring or replacing predictors that may have lost their clinical significance over time Further research and additional guidance are necessary in these areas which may help standardize concepts and methods The post-derivation stages of clinical prediction models are important for optimizing model performance in new settings that may be contextually different from or beyond the scope of the initial model development Substantial methodological guidance is available for model evaluation (validation and impact assessment) and updating we found that performance measures based on decision analysis provide additional practical insight beyond statistical performance (discrimination and calibration) measures we identified various methods including recalibration Additional guidance is necessary for machine learning-based models and relatively new types of updating Our summary can be used as a starting point for researchers who want to perform post-derivation research or critique published studies of similar nature All data generated and analyzed during this review are included in the manuscript and its additional files Area Under the Receiver Operating Characteristic curve Preferred Reporting Items for Systematic Reviews and Meta-Analyses PREDICT: a new UK prognostic model that predicts survival following surgery for invasive breast cancer NABON. Dutch Guideline Breast Cancer (Landelijke richtlijn borstkanker). [Available from: https://richtlijnendatabase.nl/richtlijn/borstkanker/adjuvante_systemische_therapie.html] Prediction models for cardiovascular disease risk in the general population: systematic review Shared decision making: really putting patients at the centre of healthcare The predictive accuracy of PREDICT: a personalized decision-making tool for southeast Asian women with breast cancer Impact of provision of cardiovascular disease risk estimates to healthcare professionals and patients: a systematic review revision and combination of prognostic survival models Incorporating progesterone receptor expression into the PREDICT breast prognostic model Inclusion of KI67 significantly improves performance of the PREDICT prognostication and prediction model for early breast cancer PREDICT plus: development and validation of a prognostic model for early breast cancer that includes HER2 Prognosis research strategy (PROGRESS) 3: prognostic model research Prognostic models for breast cancer: a systematic review Prediction models for patients with esophageal or gastric cancer: a systematic review and meta-analysis Reporting of prognostic clinical prediction models based on machine learning methods in oncology needs to be improved TRIPOD statement: a preliminary pre-post analysis of reporting and methods of prediction models Poor reporting of multivariable prediction model studies: towards a targeted implementation strategy of the TRIPOD statement A systematic review finds prediction models for chronic kidney disease were poorly reported and often developed using inappropriate methods Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal Net reclassification improvement: computation and controversies: a literature review and clinician's guide The net reclassification index (NRI): a misleading measure of prediction improvement even with independent test data sets Net risk reclassification p values: valid or misleading A scoping review of interactive and personalized web-based clinical tools to support treatment decision making in breast cancer Moorthie S. What is clinical utility?: PHG Foundation - University of Cambridge. [Available from: https://www.phgfoundation.org/explainer/clinical-utility] step-by-step guide to interpreting decision curve analysis Prediction models for the risk of gestational diabetes: a systematic review Meta-analysis and aggregation of multiple published prediction models Validity of prediction models: when is a model clinically useful External validation is necessary in prediction research: a clinical example On criteria for evaluating models of absolute risk updating and impact of clinical prediction rules: a review Prognosis and prognostic research: application and impact of prognostic models in clinical practice Evaluating the prognostic value of new cardiovascular biomarkers Prognostic models: a methodological framework and review of models for breast cancer Traditional statistical methods for evaluating prediction models are uninformative as to clinical value: towards a decision analytic framework Assessing the performance of prediction models: a framework for traditional and novel measures Everything you always wanted to know about evaluating prediction models (but were too afraid to ask) and assessing the incremental value of a new (bio)marker Towards better clinical prediction models: seven steps for development and an ABCD for validation Risk prediction models: a framework for assessment External validation of a Cox prognostic model: principles and methods A new framework to enhance the interpretation of external validation studies of clinical prediction models Con: Most clinical risk scores are useless calibration: the Achilles heel of predictive analytics Key steps and common pitfalls in developing and validating risk models Methodological standards for the development and evaluation of clinical prediction rules: a review of the literature A framework for the evaluation of statistical prediction models Minimum sample size for external validation of a clinical prediction model with a continuous outcome External validation of prognostic models: what Minimum sample size calculations for external validation of a clinical prediction model with a time-to-event outcome Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review Translating clinical research into clinical practice: impact of using prediction rules to make decisions Assessing the incremental value of diagnostic and prognostic markers: a review and illustration Added predictive value of high-throughput molecular data to clinical data and its validation Assessing new biomarkers and predictive models for use in clinical practice: a clinician's guide A framework for quantifying net benefits of alternative prognostic models Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond Assessing the incremental predictive performance of novel biomarkers over standard predictors Framework for the impact analysis and implementation of clinical prediction rules (CPRs) Beyond diagnostic accuracy: the clinical utility of diagnostic tests Evaluating the impact of prediction models: lessons learned Criteria for evaluation of novel markers of cardiovascular risk: a scientific statement from the American Heart Association Ten steps towards improving prognosis research Good practice guidelines for the use of statistical regression models in economic evaluations A simple framework to identify optimal cost-effective risk thresholds for a single screen: comparison to decision curve analysis Updating methods improved the performance of a clinical prediction model in new patients Aggregating published prediction models with individual participant data: a comparison of different approaches A closed testing procedure to select an appropriate method for updating prediction models Updating risk prediction tools: a case study in prostate cancer Methods for updating a risk prediction model for cardiac surgery: a statistical primer Validation and updating of risk models based on multinomial logistic regression Improving prediction models with new markers: a comparison of updating strategies Individual participant data (IPD) meta-analyses of diagnostic and prognostic modeling studies: guidance on their use Improved prediction by dynamic modeling: an exploratory study in the adult cardiac surgery database of the Netherlands Association for Cardio-Thoracic Surgery Adaptation of clinical prediction models for application in local settings Dynamic models to predict health outcomes: current status and methodological challenges Updating clinical prediction models: an illustrative case study Comparison of dynamic updating strategies for clinical prediction models Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers The relationship between precision-recall and ROC curves; 2006 Evaluating a new marker for risk prediction using the test tradeoff: an update The summary test tradeoff: a new measure of the value of an additional risk prediction marker Evaluating a new marker for risk prediction: decision analysis to the rescue Two further applications of a model for binary regression Conditional logit analysis of qualitative choice behavior Regression modelling strategies for improved prognostic prediction A unified inference procedure for a class of measures to assess improvement in risk prediction systems with survival data Concordance probability and discriminatory power in proportional hazards regression Time-dependent ROC curves for censored survival data and a diagnostic marker Net reclassification improvement and integrated discrimination improvement require calibrated models: relevance from a marker and model perspective A note on the evaluation of novel biomarkers: do not rely on integrated discrimination improvement and net reclassification index Can machine-learning improve cardiovascular risk prediction using routine clinical data Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence Download references This work was financially supported by Health~Holland grant number LSHM19121 (https://www.health-holland.com) received by MKS the Netherlands Cancer Institute – Antoni van Leeuwenhoek Hospital Division of Psychosocial Research and Epidemiology The Netherlands Cancer Institute – Antoni van Leeuwenhoek Hospital critical revision of the manuscript; WS: design critical revision of the manuscript; EWS: design Description of methods and related information Overview of selected articles included in the review Summary of performance measures from the selected methodological literature Download citation DOI: https://doi.org/10.1186/s12874-022-01801-8 Metrics details Clinical prediction models (CPMs) are tools that compute the risk of an outcome given a set of patient characteristics and are routinely used to inform patients Although much hope has been placed on CPMs to mitigate human biases CPMs may potentially contribute to racial disparities in decision-making and resource allocation and scholars have called for eliminating race as a variable from CPMs others raise concerns that excluding race may exacerbate healthcare disparities and this controversy remains unresolved The Guidance for Unbiased predictive Information for healthcare Decision-making and Equity (GUIDE) provides expert guidelines for model developers and health system administrators on the transparent use of race in CPMs and mitigation of algorithmic bias across contexts developed through a 5-round modified Delphi process from a diverse 14-person technical expert panel (TEP) Deliberations affirmed that race is a social construct and that the goals of prediction are distinct from those of causal inference and emphasized: the importance of decisional context (e.g. shared decision-making versus healthcare rationing); the conflicting nature of different anti-discrimination principles (e.g. anticlassification versus antisubordination principles); and the importance of identifying and balancing trade-offs in achieving equity-related goals with race-aware versus race-unaware CPMs for conditions where racial identity is prognostically informative comprising 31 key items in the development and use of CPMs in healthcare and offers guidance for examining subgroup invalidity and using race as a variable in CPMs This GUIDE presents a living document that supports appraisal and reporting of bias in CPMs to support best practice in CPM development and use no direct guidance clarifies how prediction modelers should approach race as a candidate variable in CPMs nor how health systems and clinicians should consider the role of race in choosing and using CPMs either with individual patients or at the population level The purpose of this Guidance for Unbiased predictive Information for healthcare Decision-making and Equity (GUIDE) is to offer a set of practical recommendations to evaluate and address algorithmic bias (here defined as differential accuracy of CPMs across racial groups) and algorithmic fairness (here defined as clinical decision-making that does not systematically favor members of one protected class over another) with special attention to potential harms that may result from including or omitting race We approach this with a shared understanding that race is a social construct as well as an appreciation of the profound injuries that interpersonal and structural racism cause to individual and population health This guidance is meant to be responsive to widespread differences in health by race that are historically and structurally rooted which have been exacerbated by racial bias embedded in the U.S and offer a starting point for the development of best practices we provide consensus-based: (1) recommendations regarding the use of race in CPMs (2) guidance for model developers on identifying and addressing algorithmic bias (differential accuracy of CPMs by race) and (3) guidance for model developers and policy-makers on recognizing and mitigating algorithmic unfairness (differential access to care by race) Given the widespread impact of CPMs in healthcare the GUIDE is intended to provide a first step to assist CPM developers regulatory agencies and professional medical societies who share responsibility for use and implementation of CPMs Since different considerations apply where CPMs are either directly used to allocate scarce resources or used to align decisions with a patient’s own values and preferences separate guidelines were developed for these different contexts predictive effects of variables within a valid CPM may even have the opposite sign as the true causal effect “risk factors” measured in observational studies may associate with health outcomes for many reasons aside from direct causation Valid prediction only requires these associations are stable across other similarly selected population samples not that they correspond to causal effects causal modeling typically requires specification of a primary exposure variable-of-interest and a set of (often unverifiable) causal assumptions based on content knowledge external to the data we affirm that race is a social construct and can only cause outcomes indirectly through the health effects of racism it may be correlated with many unknown or poorly-measured variables that affect health outcomes (e.g. genetic ancestry) and might account for differences in outcomes in groups defined by self-identified race being an indirect cause of health outcomes via racism or acting as a proxy for other unknown/unmeasured causes of health outcomes) race is often empirically observed to be an important predictor of health outcomes Inappropriately racializes medicine: Herein for which there is now broad interdisciplinary consensus there is a long tradition of pseudoscientific biological determinism and racial essentialism that connects race to inherited biological distinctions—explaining or justifying differences in medical outcomes This perspective is seen as damaging to a decent and just society—creating a broad taboo against any use of ‘race’ that might be misconstrued to provide even indirect or accidental support for these racist notions Using race in CPMs may also serve to further entrench racialization and conflict with the goal of a post-racial future Though we know of no direct evidence of this incorporation of ‘race’ may undermine trust not only in prediction itself but more broadly in the medical system for patients of all races There is broad agreement that individuals with similar outcome risks should be treated similarly regardless of race We call this principle “equal treatment for equal risk.” When race has no prognostic information independent of relevant clinical characteristics since only characteristics contributing to prognosis are included in CPMs Controversy arises only when race is predictive of differences in outcome risk despite clinical characteristics that appear similar Omitting race systematically under-estimates diabetes risk in Black patients deprioritizing their care compared to Whites at similar risk Including race better aligns predicted with observed risks in Black patients supporting similar treatment for similar risk models restricted from using any prognostic candidate variable won’t be more accurate than models considering all available information the race-aware model may also be disparity-reducing compared to the race-unaware model If one were offering a lifestyle modification program to the top risk-quarter (>~10% diabetes-risk threshold) Black patients would comprise 31% of the treatment-prioritized group with a race-unaware model The race-unaware model would prioritize lower-risk White ahead of higher-risk Black patients While the causes of excess risk in some minorities may be unclear this excess risk is no less important for decision-making than the risk associated with other variables in the model when Black people are found to be at higher-risk than White people leaving race out of risk calculations does not treat Black and White people equally—it systematically ignores those (unknown/unmeasured) causes of greater risk that are more common in Black than White people Label bias is a particular concern because this bias is not detectable with the conventional set of performance metrics that attend to model fit Table 4 provides recommendations for the use of race in CPMs in non-polar (shared decision-making) contexts where predictive accuracy is the paramount modeling priority The TEP considered how to balance anticlassification principles (which preclude use of race) and antisubordination principles (which may require use of race to prevent minoritized groups from being disadvantaged in some circumstances) Given the importance of accurate predictions to enabling patient autonomy in decision-making (Item 16) the TEP found that inclusion of race may be justified when the predictive effects are statistically robust and go beyond other ascertainable attributes (Item 17) The precise threshold where the statistical benefits of improved calibration will outweigh anticlassification principles may differ across clinical contexts This ‘prevalence-sensitivity’ can be shown in simulations using prediction models that are known to have no model invalidity (i.e. they correspond exactly to the data-generating process) The Table below provides an illustration where the predictive performance of the data-generating model (i.e. a model with no model invalidity) is measured across two groups with different burdens of the same risk factors we propose that good calibration is a more appropriate and useful measure of subgroup invalidity This was shown to result in predictions that systematically under-estimate need in Black versus White patients minoritized communities (including Black and Asian patients) are at higher-risk for diabetes and some cancers than White patients (e.g. even when the causes of a risk difference or disparity is incompletely understood it is often implausible to attribute this difference in risk to label bias the plausible direction of any label bias is in the opposite direction of the disparity and there are many other potential explanations available for the observed risk differences and other key issues such as biased training data and the unique concerns of different decision-making contexts Our GUIDE targets these gaps with a set of consensual premises and actionable recommendations these are fundamentally causal definitions of fairness which are challenging to satisfy in practice because causality is generally unidentifiable in observational data (without strong unverifiable assumptions) and because race might be inadvertently reconstructed through proxies even when not explicitly encoded particularly when high-dimensional machine learning approaches are applied we offer a pragmatic approach based on an assessment of observable outcomes that seeks to maximize benefits for the population (utilitarianism) and at the same time to reduce disparities (egalitarianism) Future work should encourage more routine use of variables for which race may be a proxy—such as social determinants of health or genetic ancestry; better collection of more representative training data; and evolution in how health systems populate electronic health records and other healthcare databases to ensure these data consistently reflect self-reporting We note that CMS is putting regulatory pressure behind the collection of data on social drivers of health with quality measures that require screening in five domains: food insecurity the GUIDE provides a framework for identifying understanding and deliberating about the trade-offs inherent in these issues when developing CPMs We present it to support those developing or implementing CPMs in their goal of providing unbiased predictions to support fair decision-making and for the broader community to better understand these issues The project was approved by the Tufts Health Sciences Institutional Review Board Informed consent was obtained from participants Experts were invited based on professional expertise a TEP co-chair (JKP or KL) alongside a professional facilitator (Mark Adkins PhD) moderated consensus-building and voting KL) presented the topic with illustrative cases uniquely developed for that meeting Using a 5-point scale (1-Strongly Disagree to 5-Strongly Agree experts were asked to rate their level of agreement with the item’s importance and feasibility of assessment the vote (rating) was carried out anonymously using the MeetingSphere software after which ratings and comments were shared with the TEP in real time Deliberation and discussion followed the first round of voting at each meeting ratings on items had to have “broad agreement” or exceed the pre-specified supermajority threshold of 75% of the TEP endorsing the item as “agree” or “strongly agree” (4 or 5) was required to prevent the majority from eroding the influence of minority voices without requiring strict unanimity for all items Items without broad agreement were always discussed and revised and TEP members could nominate additional items to be considered and improvement of items; dissenting views were acknowledged and incorporated where possible Revised items were then voted on a second time experts had the opportunity to refine items and revise their judgments prior to subsequent meetings where re-rating occurred All analyses of item scores and comments were performed independently by the professional facilitator using MeetingSphere discussed and agreed to the content and final wording of the guidelines The final GUIDE represents points of convergence across the TEP who held diverse opinions and approaches especially to mitigating bias in shared decision-making contexts and these data provided insight into the values and reasoning underlying the opinions of patient stakeholders pertaining to inclusion of race in CPMs Patient stakeholder feedback was presented to the TEP for incorporation in the final GUIDE during the final meeting Further information on research design is available in the Nature Research Reporting Summary linked to this article Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study New creatinine- and cystatin C-based equations to estimate GFR without race Reconsidering the consequences of using race to estimate kidney function Assessment of adherence to reporting guidelines by commonly used clinical prediction models from a single vendor: a systematic review Race and genetic ancestry in medicine—a time for reckoning with racism Will precision medicine move us beyond race How to act upon racism-not race-as a risk factor Racial disparities in low-risk prostate cancer Geographic distribution of racial differences in prostate cancer mortality Diabetes screening by race and ethnicity in the United States: equivalent body mass index and age thresholds Racial and ethnic bias in risk prediction models for colorectal cancer recurrence when race and ethnicity are omitted as predictors Projecting individualized absolute invasive breast cancer risk in African American women Race adjustments in clinical algorithms can help correct forracial disparities in data quality Clinical implications of removing race-corrected pulmonary function tests for African American patients requiring surgery for lung cancer Methods for using race and ethnicity in prediction models for lung cancer screening eligibility Using prediction-models to reduce persistent racial/ethnic disparities in draft 2020 USPSTF lung-cancer screening guidelines Patient-centered appraisal of race-free clinical risk assessment Adding a coefficient for race to the 4K score improves calibration for black men Using measures of race to make clinical predictions: decision making Equity in essence: a call for operationalising fairness in machine learning for healthcare Addressing racism in preventive services: a methods project for the U.S (Agency for Healthcare Research and Quality Research Protocol: impact of healthcare algorithms on racial and ethnic disparities in health and healthcare Agency for Healthcare Research and Quality Assessing Algorithmic Bias and Fairness in Clinical Prediction Models for Preventive Services: A Health Equity Methods Project for the U.S Office for Civil Rights) (Office for Civil Rights Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities Dissecting racial bias in an algorithm used to manage the health of populations (Center for Applied Artificial Intelligence Statement on principles for responsible algorithmic systems How to regulate evolving AI health algorithms CMS Innovation Center Tackles Implicit Bias In Health Affairs Forefront (Health Affairs Leveraging Affordable Care Act Section 1557 to address racism in clinical algorithms in Health Affairs Forefront (Health Affairs HHS proposes revised ACA anti-discrimination rule Prevention of bias and discrimination in clinical practice algorithms a new company forms to vet models and root out weaknesses The Supreme Court’s rulings on race neutrality threaten progress in medicine and health National health care leaders will develop AI code of conduct Centers for Medicare & Medicaid Services Office of the Secretary & Department of Health and Human Services Nondiscrimination in health programs and activities Blueprint for an AI bill of rights: making automated systems work for the American people White House Office of Science and Technology Policy) (United States Government Conceptualising fairness: three pillars for medical algorithms and health equity Reporting guideline for the early stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI Evaluation and mitigation of racial bias in clinical machine learning models: scoping review Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare Implications of predicting race variables from medical images and Interoperability: Certification Program Updates and Information Sharing (HTI-1) Proposed Rule Office of the National Coordinator for Health Information Technology) (Office of the National Coordinator for Health Information Technology Removing structural racism in pulmonary function testing-why nothing is ever easy When personalization harms performance: reconsidering the use of group attributes in prediction Reevaluating the role of race and ethnicity in diabetes screening and the equitable allocation of scarce COVID-19 treatments Equal treatment for equal risk: should race be included in allocation algorithms for Covid-19 therapies Reassessment of the role of race in calculating the risk for urinary tract infection: a systematic review and meta-analysis Prediction of vaginal birth after cesarean delivery in term gestations: a calculator without race and ethnicity Flawed racial assumptions in eGFR have care implications in CKD In Am J Manag Care (The American Journal of Managed Care Implications of race adjustment in lung-function equations An ethical analysis of clinical triage protocols and decision-making frameworks: what do the principles of justice and a disability rights approach demand of us PROBAST: a tool to assess the risk of bias and applicability of prediction model studies Guidance for developers of health research reporting guidelines An experimental application of the delphi method to the use of experts Potential biases in machine learning algorithms using electronic health record data Comparison of methods to reduce bias from clinical prediction models of postpartum depression Ensuring fairness in machine learning to advance health equity Implementing machine learning in health care - addressing ethical challenges Addressing bias in artificial intelligence in health care Updated guidance on the reporting of race and ethnicity in medical and science journals Qualitative Research & Evaluation Methods (SAGE Publications The Coding Manual for Qualitative Researchers (SAGE Publications Office of Information and Regulatory Affairs & Executive Office of the President Revisions to OMB’s statistical policy directive no and presenting federal data on race and ethnicity Fair prediction with disparate impact: a study of bias in recidivism prediction instruments Inherent trade-offs in the fair determination of risk scores The American civil rights tradition: anticlassification or antisubordination Reflection on modern methods: generalized linear models for prognosis and intervention-theory practice and implications for machine learning A scoping review of their conflation within current observational research The Table 2 fallacy: presenting and interpreting confounder and modifier coefficients Differences in the patterns of health care system distrust between blacks and whites Perspectives on racism in health care among black veterans with chronic kidney disease Prior experiences of racial discrimination and racial differences in health care system distrust An electronic health record-compatible model to predict personalized treatment effects from the diabetes prevention program: a cross-evidence synthesis approach using clinical trial and real-world data Racial and ethnic disparities in diagnosis and treatment: a review of the evidence and a consideration of causes in Unequal Treatment: Confronting Racial and Ethnic Disparities in Health Care (eds R.) 417–454 (National Academies Press (US) Income and cancer overdiagnosis—when too much care is harmful Download references Research reported in this publication was funded through a “Making a Difference” and Presidential Supplement Awards from The Greenwall Foundation (PI Kent) The views presented in this publication are solely the responsibility of the authors and do not necessarily represent the views of the Greenwall Foundation The Greenwall Foundation was not involved in the design of the study; the collection and interpretation of the data; and the decision to approve publication of the finished manuscript Departments of Occupational Therapy and Community Health Predictive Analytics and Comparative Effectiveness Center Center for Individualized Medicine Bioethics Tufts Clinical and Translational Science Institute D.M.K.) contributed to development of conclusions and reviewed and contributed significantly to the final manuscript The authors declare the following competing interests: Dr Duru declares no Competing Financial Interests but the following Competing Non-Financial Interests as a consultant for ExactCare Pharmacy® research funding from the Patient Centered Outcomes Research Institute (PCORI) the Centers for Disease Control and Prevention (CDC) and the National Institutes of Health (NIH) Kent declares no Competing Financial Interests but Competing Non-Financial Interests in research funding from the Greenwall Foundation Patient Centered Outcomes Research Institute (PCORI) Ladin declares no Competing Financial Interests but Competing Non-Financial Interests in research funding from Paul Teschan Research Fund #2021-08 Steyerberg declares no Competing Financial Interests but Competing Non-Financial Interests in funding from the EU Horizon program (4D Picture project Ustun declares no Competing Financial Interests but Competing Non-Financial Interests in research funding from the National Science Foundation IIS 2040880 the NIH Bridge2AI Center Grant U54HG012510 All other authors declare no Competing Financial or Non-Financial Interests Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations Download citation DOI: https://doi.org/10.1038/s41746-024-01245-y Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research Metrics details Clinical prediction models are widely used in health and medical research The area under the receiver operating characteristic curve (AUC) is a frequently used estimate to describe the discriminatory ability of a clinical prediction model The AUC is often interpreted relative to thresholds with “good” or “excellent” models defined at 0.7 These thresholds may create targets that result in “hacking” where researchers are motivated to re-analyse their data until they achieve a “good” result We extracted AUC values from PubMed abstracts to look for evidence of hacking We used histograms of the AUC values in bins of size 0.01 and compared the observed distribution to a smooth distribution from a spline The distribution of 306,888 AUC values showed clear excesses above the thresholds of 0.7 0.8 and 0.9 and shortfalls below the thresholds The AUCs for some models are over-inflated which risks exposing patients to sub-optimal clinical decision-making Decisions guided by model probabilities or categories may rule out low-risk patients to reduce unnecessary treatments or identify high-risk patients for additional monitoring If the model has good discrimination and gives estimated risks for all patients with the outcome that are higher than all patients without If the model discrimination is no better than a coin toss Qualitative descriptors of model performance for AUC thresholds between 0.5 and 1 have been published including “area under the receiver operating characteristic curve” or the acronyms “AUC” and “AUROC” We included all AUCs regardless of the study’s aim and therefore included model development and validation studies We did not consider other commonly reported metrics for evaluating clinical prediction models We examined abstracts published in PubMed because it is a large international database that includes most health and medical journals. To indicate its size, there were over 1.5 million abstracts published on PubMed in 2022. The National Library of Medicine make the PubMed data freely and easily available for research. We downloaded the entire database in XML format on 30 July 2022 from https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/ We started with all the available PubMed data Entries with an empty abstract or an abstract of 10 words or fewer which often use area under the curve statistics to refer to dosages and volumes that are unrelated to prediction models as we were interested in original research Our inclusion criterion was abstracts with one or more AUC values We created a text-extraction algorithm to find AUC values using the team’s expertise and trial and error We validated the algorithm by randomly sampling 300 abstracts with a Medical Subject Heading (MeSH) of “Area under curve” that had an abstract available and quantifying the number of AUC values that were correctly extracted We also examined randomly selected results from the algorithm that equalled the thresholds of 0.7 We report the validation in more detail in the results but note here that the algorithm could not reliably extract AUC values that were exactly 1 AUC values equal to 1 were therefore excluded Challenges in extracting the AUC values from abstracts included the frequent use of long lists of statistics including the sensitivity and specificity; unrelated area under the curve statistics from pharmacokinetic studies; references to AUC values as a threshold (e.g “The AUC ranges between 0.5 and 1”); and the many different descriptors used AUC values reported as a percent were converted to 0 to 1 We removed any AUC values that were less than 0 or greater than or equal to 1 We categorised each AUC value as a mean or the lower or upper limit of the confidence interval “0.704 (95% CI 0.603 to 0.806)” would be a mean For the specific examples from published papers in the results we give the PubMed ID number (PMID) rather than citing the paper Our hypothesis was that there would be an excess of AUC values just above the thresholds 0.7 and an upper threshold that was + 0.01 greater 0.70] included every AUC greater than 0.69 and less than or equal to 0.70 We excluded AUCs with a decimal place of 1 (e.g as these results would create spikes in the histogram that were simply due to rounding We do not know what the distribution of AUC values from the health and medical literature would look like if there was no AUC-hacking we are confident that it should be relatively smooth with no inflexion points potentially caused by re-analysing the data to get a more publishable but inflated AUC we noted that many abstracts gave multiple AUC values from competing models we plotted the distribution using the highest AUC value per abstract This subgroup analysis examined whether the best presented models were often just above the thresholds We used a subgroup analyses that used only AUC values from the results section of structured abstracts This potentially increased the specificity of the extracted AUC values methods and discussion sections were more likely to be general references to the AUC rather than results The flow chart of included abstracts is shown in Fig. 1. The number of examined abstracts was over 19 million, and 96,986 (0.5%) included at least one AUC value. The use of AUC values has become more popular in recent years (see Additional file 2: Fig The median publication year for the AUC values was 2018 with first to third quartile of 2015 to 2018 For abstracts with at least one AUC value, the median number of AUC values was 2, with a first to third quartile of 1 to 4 (see Additional file 3: Fig There was a long tail in the distribution of AUC values with 1.1% of abstracts reporting 20 or more AUC values These high numbers were often from abstracts that compared multiple models The total number of included AUC values was 306,888 There were 92,529 (31%) values reported as lower or upper confidence limits and the remainder as means Histogram of AUC mean values (top panel) and residuals from a smooth fit to the histogram (bottom panel) The dotted line in the top panel shows the smooth fit The distribution from the largest AUC mean value per abstract excluding confidence intervals is shown in Fig. 3. The strong changes in the distribution at the thresholds observed in Fig. 2 remain. Histogram of the largest AUC mean value per abstract (top panel) and residuals from a smooth fit to the histogram (bottom panel) The distribution for AUC values published in PLOS ONE show a similar pattern to the full sample, with many more AUC values just above the 0.8 threshold (see Additional file 6: Fig For abstracts where either the algorithm or manual entry found one or more AUC values, we made a Bland–Altman plot of the number of AUC values extracted (see Additional file 7: Fig the algorithm missed more AUC values than the manual entry a discrepancy that was generally due to non-standard presentations as we would rather lean towards missing valid AUC values than wrongly including invalid AUC values We used a regression model to examine differences in the AUC values extracted by the algorithm and manual entry AUC values that were wrongly included by the algorithm were smaller on average than the AUC values that were correctly included This is because the values extracted were often describing other aspects of the prediction model The validation helped identify MESH terms that identified pharmacokinetic studies that were excluded from our main analysis we manually checked 100 randomly sampled abstracts that the algorithm identified as not having an AUC statistic and another 100 randomly sampled abstracts that the algorithm identified as having an AUC statistic All abstracts identified as not having an AUC statistic were correctly classified (95% confidence interval for negative predictive value: 0.964 to 1.000) All but one abstract identified as having an AUC statistic was correct (95% confidence interval for positive predictive value: 0.946 to 1.000) making for a larger gap between the thresholds and values exceeding the threshold which would be stronger evidence of poor practice To investigate the excess at (0.56, 0.57], we manually extracted the AUC values from 300 abstracts where our algorithm found an AUC value of 0.57 and another 300 from 0.58 as a nearby comparison with no excess. The error proportions from the algorithm were relatively low (see Additional file 7: Table S3) indicating that the excess at 0.57 was not due to errors with relatively low AUC values under 0.75 described as “excellent” (PMID35222547) with the inflated AUCs values regressing to the mean The widespread use of these poor practices creates a biased evidence base and is misinforming health policy We did not examine other commonly reported performance metrics used to evaluate clinical prediction model performance It is possible that values such as model sensitivity and specificity may also be influenced by “acceptable” thresholds It is likely that the highest AUC value presented in the abstract is also the highest in the full text so the “best” model would be captured in the abstract and the “best” AUC value is the one most likely to be created by hacking In addition to hacking, publication bias likely also plays a role in the selection of AUC values, with higher values more likely to be accepted by peer reviewers and journal editors. Our subgroup analysis of PLOS ONE abstracts (Additional file 6: Fig S6) provides some evidence that the “hacking” pattern in AUC values is due to author behaviour not journal behaviour We used an automated algorithm that provided a large and generalisable sample but did not perfectly extract all AUC values we were not able to reliably extract AUC values of 1 and this is an important value as it is the best possible result and could be a target for hacking We believe that the errors and exclusions in the data are not large enough to change our key conclusion Clinical prediction models are growing in popularity likely because of increased patient data availability and accessible software tools to build models many published models have serious flaws in their design and presentation as the AUCs for some models have been over-inflated Publishing overly optimistic models risks exposing patients to sub-optimal clinical decision-making An urgent reset is needed in how clinical prediction models are built Actionable steps towards greater transparency are as follows: the wider use of protocols and registered reports Area under the receiver operating characteristic curve Clinical prediction models: diagnosis versus prognosis Knaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: a severity of disease classification system. Crit Care Med. 1985;13(10):818–29. https://doi.org/10.1097/00003246-198510000-00009 Wynants L, van Smeden M, McLernon DJ, Timmerman D, Steyerberg EW, Calster BV. Three myths about risk thresholds for prediction models. BMC Med. 2019;17(1). https://doi.org/10.1186/s12916-019-1425-3 Search filters for finding prognostic and diagnostic prediction studies in Medline to enhance systematic reviews Hand DJ. Classifier technology and the illusion of progress. Stat Sci. 2006;21(1). https://doi.org/10.1214/088342306000000060 Collins GS, de Groot JA, Dutton S, Omar O, Shanyinde M, Tajar A, et al. External validation of multivariable prediction models: a systematic review of methodological conduct and reporting. BMC Med Res Methodol. 2014;14(1). https://doi.org/10.1186/1471-2288-14-40 Miller E, Grobman W. Prediction with conviction: a stepwise guide toward improving prediction and clinical care. BJOG. 2016;124(3):433. https://doi.org/10.1111/1471-0528.14187 Steyerberg EW, Uno H, Ioannidis JPA, van Calster B, Ukaegbu C, Dhingra T, et al. Poor performance of clinical prediction models: the harm of commonly applied methods. J Clin Epidemiol. 2018;98:133–43. https://doi.org/10.1016/j.jclinepi.2017.11.013 Riley RD, Ensor J, Snell KIE, Harrell FE, Martin GP, Reitsma JB, et al. Calculating the sample size required for developing a clinical prediction model. BMJ. 2020:m441. https://doi.org/10.1136/bmj.m441 Wynants L, Van Calster B, Collins GS, Riley RD, Heinze G, Schuit E, et al. Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal. BMJ. 2020;369. https://doi.org/10.1136/bmj.m1328 Methodological conduct of prognostic prediction models developed using machine learning in oncology: a systematic review Clinical prediction models in psychiatry: a systematic review of two decades of progress and challenges Trends in the conduct and reporting of clinical prediction model development and validation: a systematic review Prognosis Research Strategy (PROGRESS) 3: prognostic model research Mandrekar JN. Receiver operating characteristic curve in diagnostic test assessment. J Thorac Oncol. 2010;5(9):1315–6. https://doi.org/10.1097/jto.0b013e3181ec173d Khouli RHE, Macura KJ, Barker PB, Habba MR, Jacobs MA, Bluemke DA. Relationship of temporal resolution to diagnostic performance for dynamic contrast enhanced MRI of the breast. J Magn Reson Imaging. 2009;30(5):999–1004. https://doi.org/10.1002/jmri.21947 Pitamberwale A, Mahmood T, Ansari AK, Ansari SA, Limgaokar K, Singh L, et al. Biochemical parameters as prognostic markers in severely Ill COVID-19 patients. Cureus. 2022. https://doi.org/10.7759/cureus.28594 Calster BV, Steyerberg EW, Wynants L, van Smeden M. There is no such thing as a validated prediction model. BMC Med. 2023;21(1). https://doi.org/10.1186/s12916-023-02779-w de Hond AAH, Steyerberg EW, van Calster B. Interpreting area under the receiver operating characteristic curve. Lancet Digit Health. 2022;4(12):e853–5. https://doi.org/10.1016/s2589-7500(22)00188-1 Fraser H, Parker T, Nakagawa S, Barnett A, Fidler F. Questionable research practices in ecology and evolution. PLoS ONE. 2018;13(7):1–16. https://doi.org/10.1371/journal.pone.0200303 John LK, Loewenstein G, Prelec D. Measuring the Prevalence of questionable research practices with incentives for truth telling. Psychol Sci. 2012;23(5):524–32. https://doi.org/10.1177/0956797611430953 Stefan AM, Schönbrodt FD. Big little lies: a compendium and simulation of p-hacking strategies. R Soc Open Sci. 2023;10(2):220346. https://doi.org/10.1098/rsos.220346 Altman DG, Lausen B, Sauerbrei W, Schumacher M. Dangers of using “optimal” cutpoints in the evaluation of prognostic factors. J Natl Cancer Inst. 1994;86(11):829–35. https://doi.org/10.1093/jnci/86.11.829 Picard D. Torch.manual_seed(3407) is all you need: on the influence of random seeds in deep learning architectures for computer vision. CoRR. 2021. arXiv:2109.08203 An observational analysis of the trope “A p-value of\(< 0.05\) was considered statistically significant” and other cut-and-paste statistical methods Masicampo EJ, Lalande DR. A peculiar prevalence of p values just below .05. Q J Exp Psychol (Hove). 2012;65(11):2271–2279. https://doi.org/10.1080/17470218.2012.711335 Barnett AG, Wren JD. Examination of confidence intervals in health and medical journals from 1976 to 2019: an observational study. BMJ Open. 2019;9(11). https://doi.org/10.1136/bmjopen-2019-032506 Zwet EW, Cator EA. The significance filter, the winner’s curse and the need to shrink. Stat Neerl. 2021;75(4):437–52. https://doi.org/10.1111/stan.12241 Hussey I, Alsalti T, Bosco F, Elson M, Arslan RC. An aberrant abundance of Cronbach’s alpha values at .70. 2023. https://doi.org/10.31234/osf.io/dm8xn Regression modeling strategies: with applications to linear models R Core Team. R: a language and environment for statistical computing. Vienna; 2023. https://www.R-project.org/ Barnett AG. Code and data for our analysis of area under the curve values extracted from PubMed abstracts. 2023. https://doi.org/10.5281/zenodo.8275064 Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press; 2003. https://doi.org/10.1017/CBO9780511755453 Chiu K, Grundy Q, Bero L. ‘Spin’ in published biomedical literature: a methodological systematic review. PLoS Biol. 2017;15(9):e2002173. https://doi.org/10.1371/journal.pbio.2002173 Brodeur A, Cook N, Heyes A. Methods matter: p-hacking and publication bias in causal analysis in economics. Am Econ Rev. 2020;110(11):3634–60. https://doi.org/10.1257/aer.20190687 Adda J, Decker C, Ottaviani M. P-hacking in clinical trials and how incentives shape the distribution of results across phases. Proc Natl Acad Sci U S A. 2020;117(24):13386–92. https://doi.org/10.1073/pnas.1919906117 Rohrer JM, Tierney W, Uhlmann EL, DeBruine LM, Heyman T, Jones B, et al. Putting the self in self-correction: findings from the loss-of-confidence project. Perspect Psychol Sci. 2021;16(6):1255–69. https://doi.org/10.1177/1745691620964106 Moons KGM, Donders ART, Steyerberg EW, Harrell FE. Penalized maximum likelihood estimation to directly adjust diagnostic and prognostic prediction models for overoptimism: a clinical example. J Clin Epidemiol. 2004;57(12):1262–70. https://doi.org/10.1016/j.jclinepi.2004.01.020 Chambers CD, Tzavella L. The past, present and future of Registered Reports. Nat Hum Behav. 2021;6(1):29–42. https://doi.org/10.1038/s41562-021-01193-7 Penders B. Process and bureaucracy: scientific reform as civilisation. Bull Sci Technol Soc. 2022;42(4):107–16. https://doi.org/10.1177/02704676221126388 Chan AW, Hróbjartsson A, Haahr MT, Gøtzsche PC, Altman DG. Empirical evidence for selective reporting of outcomes in randomized trials. JAMA. 2004;291(20):2457. https://doi.org/10.1001/jama.291.20.2457 Mathieu S. Comparison of registered and published primary outcomes in randomized controlled trials. JAMA. 2009;302(9):977. https://doi.org/10.1001/jama.2009.1242 Goldacre B, Drysdale H, Powell-Smith A, Dale A, Milosevic I, Slade E, et al. The COMPare Trials Project. 2016. www.COMPare-trials.org Schwab S, Janiaud P, Dayan M, Amrhein V, Panczak R, Palagi PM, et al. Ten simple rules for good research practice. PLoS Comput Biol. 2022;18(6):1–14. https://doi.org/10.1371/journal.pcbi.1010139 Assessing the performance of prediction models: a framework for some traditional and novel measures Vickers AJ, Calster BV, Steyerberg EW. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ. 2016:i6. https://doi.org/10.1136/bmj.i6 Parsons R, Blythe RD, Barnett AG, Cramb SM, McPhail SM. predictNMB: an R package to estimate if or when a clinical prediction model is worthwhile. J Open Source Softw. 2023;8(84):5328. https://doi.org/10.21105/joss.05328 Stark PB, Saltelli A. Cargo-cult statistics and scientific crisis. Significance. 2018;15(4):40–3. https://doi.org/10.1111/j.1740-9713.2018.01174.x Christian K, ann Larkins J, Doran MR. We must improve conditions and options for Australian ECRs. Nat Hum Behav. 2023. https://doi.org/10.1038/s41562-023-01621-w Wang MQ, Yan AF, Katz RV. Researcher requests for inappropriate analysis and reporting: a U.S. survey of consulting biostatisticians. Ann Intern Med. 2018;169(8):554. https://doi.org/10.7326/m18-1230 Download references Thanks to the National Library of Medicine for making the PubMed data available for research Twitter handles: @nicolem\(\_\)white (Nicole White); @RexParsons8 (Rex Parsons); @GSCollins (Gary Collins) GSC was supported by Cancer Research UK (programme grant: C49297/A27294) Australian Centre for Health Services Innovation and Centre for Healthcare Transformation Rheumatology & Musculoskeletal Sciences All authors contributed to the interpretation of the results and critical revision of the manuscript The corresponding author attests that all listed authors meet the authorship criteria and that no others meeting the criteria have been omitted We used publicly available data that were published to be read and scrutinised by researchers and hence ethical approval was not required Examples of qualitative descriptors for AUC thresholds Number and proportion of abstracts with at least one AUC value over time Bar chart of the number of AUC values per abstract Distribution of AUC values and residuals from a smooth fit to the distribution using only AUC values that were in the results section of the abstract Histograms of AUC values that were lower or upper confidence limits and residuals from a smooth fit to the histograms Subgroup analysis of AUC values from the journal PLOS ONE Bland–Altman plot of the difference in the number of AUC values per abstract extracted manually and by the algorithm Box-plots of AUC values grouped by whether they were extracted by the algorithm or manual-check only Estimates from a linear regression model examining the differences in AUC values extracted by the algorithm and manual checking Proportion of correct AUC values from the algorithm for four selected AUC values Proportion of correct AUC values from the algorithm for two selected AUC values Download citation DOI: https://doi.org/10.1186/s12916-023-03048-6 Metrics details Baseline outcome risk can be an important determinant of absolute treatment benefit and has been used in guidelines for “personalizing” medical decisions We compared easily applicable risk-based methods for optimal prediction of individualized treatment effects We simulated RCT data using diverse assumptions for the average treatment effect the shape of its interaction with treatment (none and the magnitude of treatment-related harms (none or constant independent of the prognostic index) We predicted absolute benefit using: models with a constant relative treatment effect; stratification in quarters of the prognostic index; models including a linear interaction of treatment with the prognostic index; models including an interaction of treatment with a restricted cubic spline transformation of the prognostic index; an adaptive approach using Akaike’s Information Criterion We evaluated predictive performance using root mean squared error and measures of discrimination and calibration for benefit The linear-interaction model displayed optimal or close-to-optimal performance across many simulation scenarios with moderate sample size (N = 4,250; ~ 785 events) The restricted cubic splines model was optimal for strong non-linear deviations from a constant treatment effect particularly when sample size was larger (N = 17,000) The adaptive approach also required larger sample sizes These findings were illustrated in the GUSTO-I trial An interaction between baseline risk and treatment assignment should be considered to improve treatment effect predictions By assuming treatment effect is a function of baseline risk risk modeling methods impose a restriction on the shape of treatment effect heterogeneity With smaller sample sizes or limited information on effect modification can provide a good option for evaluating treatment effect heterogeneity with larger sample sizes and/or a limited set of well-studied strong effect modifiers treatment effect modeling methods can potentially result in a better bias-variance tradeoff the setting in which treatment effect heterogeneity is evaluated is crucial for the selection of the optimal approach even though treatment effect estimates at the risk subgroup level may be accurate these estimates may not apply to individual patients as homogeneity of treatment effects is assumed within risk strata With stronger overall treatment effect and larger variability in predicted risks patients assigned to the same risk subgroup may still differ substantially with regard to their benefits from treatment we aim to summarize and compare different risk-based models for predicting treatment effects We simulate different relations between baseline risk and treatment effects and also consider potential harms of treatment We illustrate the different models by a case study of predicting individualized effects of treatment for acute myocardial infarction in a large RCT We observe RCT data \(\left(Z,X,Y\right)\) where for each patient \({Z}_{i}=0,1\) is the treatment status \({Y}_{i}=0,1\) is the observed outcome and \({X}_{i}\) is a set of measured covariates Let \(\{{Y}_{i}\left(z\right),z=0,1\}\) denote the unobservable potential outcomes We observe \({Y}_{i}={Z}_{i}{Y}_{i}\left(1\right)+\left(1-{Z}_{i}\right){Y}_{i}\left(0\right)\) We are interested in predicting the conditional average treatment effect (CATE) Assuming that \(\left(Y\left(0\right),Y\left(1\right)\right)\perp Z|X\) comparing equally-sized treatment and control arms in terms of a binary outcome For each patient we generated 8 baseline covariates \({X}_{1},\dots ,{X}_{4}\sim N\left(0,1\right)\) and \({X}_{5},\dots ,{X}_{8}\sim B\left(1,0.2\right)\) Outcomes in the control arm were generated from Bernoulli variables with true probabilities following a logistic regression model including all baseline covariates \(P\left(Y\left(0\right)=1 | X=x\right)={\text{expit}}\left(l{p}_{0}\right)={e}^{l{p}_{0}}/\left(1+{e}^{l{p}_{0}}\right)\) with \(l{p}_{0}=l{p}_{0}\left(x\right)={x}^{t}\beta\) In the base scenarios coefficient values \(\beta\) were such that the control event rate was 20% and the discriminative ability of the true prediction model measured using Harrell’s c-statistic was 0.75 The c-statistic represents the probability that for a randomly selected discordant pair from the sample (patients with different outcomes) the prediction model assigns larger risk to the patient with the worse outcome For the simulations this was achieved by selecting \(\beta\) values such that the true prediction model would achieve a c-statistic of 0.75 in a simulated control arm with 500,000 patients We achieved a true c-statistic of 0.75 by setting \(\beta ={\left(-2.08,0.49,\dots ,0.49\right)}^{t}\) Outcomes in the treatment arm were first generated using 3 simple scenarios for a true constant odds ratio (OR): absent (OR = 1) moderate (OR = 0.8) or strong (OR = 0.5) constant relative treatment effect quadratic and non-monotonic deviations from constant treatment effects using: We compared different methods for predicting absolute treatment benefit that is the risk difference between distinct treatment assignments We use the term absolute treatment benefit to distinguish from relative treatment benefit that relies on the ratio of predicted risk under different treatment assignments Patients are stratified into equally-sized risk strata—in this case based on risk quartiles are estimated by the difference in event rate between control and treatment arm patients We considered this approach as a reference expecting it to perform worse than the other candidates as its objective is to provide an illustration of HTE rather than to optimize individualized benefit predictions we fitted a logistic regression model which assumes constant relative treatment effect (constant odds ratio) \(P\left(Y=1 | X=x,Z=z;\widehat{\beta }\right)={\text{expit}}\left({\widehat{lp}}_{0}+{\delta }_{1}z\right)\) absolute benefit is predicted from \(\tau \left(x;\widehat{\beta }\right)={\text{expit}}\left({\widehat{lp}}_{0}\right)-{\text{expit}}\left({\widehat{lp}}_{0}+{\delta }_{1}\right)\) where \({\delta }_{1}\) is the log of the assumed constant odds ratio and \({\widehat{lp}}_{0}={\widehat{lp}}_{0}\left(x;\widehat{\beta }\right)={x}^{t}\widehat{\beta }\) the linear predictor of the estimated baseline risk model we fitted a logistic regression model including treatment \(P\left(Y=1 | X=x,Z=z;\widehat{\beta }\right)={\text{expit}}\left({\delta }_{0}+{\delta }_{1}z+{\delta }_{2}{\widehat{lp}}_{0}+{\delta }_{3}z{\widehat{lp}}_{0}\right)\) Absolute benefit is then estimated from \(\tau \left(x;\widehat{\beta }\right)={\text{expit}}\left({\delta }_{0}+{\delta }_{2}{\widehat{lp}}_{0}\right)-{\text{expit}}\left({(\delta }_{0}+{\delta }_{1})+{(\delta }_{2}{+{\delta }_{3})\widehat{lp}}_{0}\right)\) We will refer to this method as the linear interaction approach we considered an adaptive approach using Akaike’s Information Criterion (AIC) for model selection we ranked the constant relative treatment effect model and 5 knots based on their AIC and selected the one with the lowest value The extra degrees of freedom were 1 (linear interaction) 3 and 4 (RCS models) for these increasingly complex interactions with the treatment effect We evaluated the predictive accuracy of the considered methods by the root mean squared error (RMSE): The observed benefits are regressed on the predicted benefits using a locally weighted scatterplot smoother (loess) The ICI-for-benefit is the average absolute difference between predicted and smooth observed benefit Values closer to 0 represent better calibration For each scenario we performed 500 replications within which all the considered models were fitted We simulated a super-population of size 500,000 for each scenario within which we calculated RMSE and discrimination and calibration for benefit of all the models in each replication We demonstrated the different methods using 30,510 patients with acute myocardial infarction (MI) included in the GUSTO-I trial 10,348 patients were randomized to tissue plasminogen activator (tPA) treatment and 20,162 were randomized to streptokinase The outcome of interest was 30-day mortality (total of 2,128 events) Predicted baseline risk is derived by setting the treatment indicator to 0 for all patients RMSE of the considered methods across 500 replications was calculated from a simulated super-population of size 500,000 The scenario with true constant relative treatment effect (panel A) had a true prediction c-statistic of 0.75 and sample size of 4250 The RMSE is also presented for strong linear (panel B) and non-monotonic (panel D) deviations from constant relative treatment effects Panels on the right side present the true relations between baseline risk (x-axis) and absolute treatment benefit (y-axis) and 97.5 percentiles of the risk distribution are expressed by the boxplot on the top and 97.5 percentiles of the true benefit distributions are expressed by the boxplots on the side of the right-handside panel RMSE of the considered methods across 500 replications calculated in simulated samples of size 17,000 rather than 4,250 in Fig. 1 RMSE was calculated on a super-population of size 500,000 RMSE of the considered methods across 500 replications calculated in simulated samples 4,250 Discrimination for benefit of the considered methods across 500 replications calculated in simulated samples of size 4,250 using the c-statistic for benefit The c-statistic for benefit represents the probability that from two randomly chosen matched patient pairs with unequal observed benefit the pair with greater observed benefit also has a higher predicted benefit Calibration for benefit of the considered methods across 500 replications calculated in a simulated sample of size 500,000 True prediction c-statistic of 0.75 and sample size of 4,250 Our main conclusions remained unchanged in the sensitivity analyses where correlations between baseline characteristics were introduced (Supplement The results from all individual scenarios can be explored online at https://mi-erasmusmc.shinyapps.io/HteSimulationRCT/. Additionally, all the code for the simulations can be found at https://github.com/mi-erasmusmc/HteSimulationRCT We used the derived prognostic index to fit a constant treatment effect a linear interaction and an RCS-3 model individualizing absolute benefit predictions an adaptive approach with the 3 candidate models was applied Individualized absolute benefit predictions based on baseline risk when using a constant treatment effect approach a linear interaction approach and RCS smoothing using 3 knots Risk stratified estimates of absolute benefit are presented within quartiles of baseline risk as reference 95% confidence bands were generated using 10,000 bootstrap resamples where the prediction model was refitted in each run to capture the uncertainty in baseline risk predictions we also provide 95% confidence intervals for the baseline risk quarter-specific average predicted risk over the 10,000 bootstrap samples The linear interaction and the RCS-3 models displayed very good performance under many of the considered simulation scenarios The linear interaction model was optimal in cases with moderate sample sizes (4.250 patients; ~ 785 events) and moderately performing baseline risk prediction models was better calibrated for benefit and had better discrimination for benefit even in scenarios with strong quadratic deviations In scenarios with true non-monotonic deviations the linear interaction model was outperformed by RCS-3 especially in the presence of treatment-related harms Increasing the sample size or the prediction model’s discriminative ability favored RCS-3 especially in scenarios with strong non-linear deviations from a constant treatment effect RCS-4 and RCS-5 were too flexible in all considered scenarios increased variability of discrimination for benefit and worse calibration of benefit predictions Even with larger sample sizes and strong quadratic or non-monotonic deviations these more flexible methods did not outperform the simpler RCS-3 approach Higher flexibility may only be helpful under more extreme patterns of HTE compared to the quadratic deviations considered here Considering interactions in RCS-3 models as the most complex approach often may be reasonable Our results can also be interpreted in terms of bias-variance trade-off The increasingly complex models considered allow for more degrees of freedom which increase the variance of our absolute benefit estimates this increased complexity did not always result in substantial decrease in bias especially with lower sample sizes and weaker treatment effects in most scenarios the simpler linear interaction model achieved the best bias-variance balance and outperformed the more complex RCS methods even in the presence of non-linearity in the true underlying relationship between baseline risk and treatment effect the simpler constant treatment effect model was often heavily biased and was outperformed by the other methods in the majority of the considered scenarios Increasing the discriminative ability of the risk model reduced RMSE for all methods Higher discrimination translates in higher variability of predicted risks allows the considered methods to better capture absolute treatment benefits better risk discrimination also led to higher discrimination between those with low or high benefit (as reflected in values of c-for-benefit) The adaptive approach had adequate median performance, following the “true” model in most scenarios. With smaller sample sizes it tended to miss the treatment-baseline risk interaction and selected simpler models (Supplement Sect This conservative behavior resulted in increased RMSE variability in these scenarios especially with true strong linear or non-monotonic deviations with smaller sample sizes the simpler linear interaction model may be a safer choice for predicting absolute benefits especially in the presence of any suspected treatment-related harms Even though the average error rates increased for all the considered methods due to the miss-specification of the outcome model the linear interaction model had the lowest error rates The constant treatment effect model was often biased especially with moderate or strong treatment-related harms Future simulation studies could explore the effect of more extensive deviations from risk-based treatment effects in all our simulation scenarios we assumed all covariates to be statistically independent the effect of continuous covariates to be linear and no interaction effects between covariates to be present This can be viewed as a limitation of our extensive simulation study as all our methods are based on the same fitted risk model we do not expect these assumptions to significantly influence their relative performance the linear interaction approach is a viable option with moderate sample sizes and/or moderately performing risk prediction models assuming a non-constant relative treatment effect plausible RCS-3 is a better option with more abundant sample size and when non-monotonic deviations from a constant relative treatment effect and/or substantial treatment-related harms are anticipated Increasing the complexity of the RCS models by increasing the number of knots does not improve benefit prediction Using AIC for model selection is attractive with larger sample size The dataset supporting the conclusions of this article is available in the Vanderbilt University repository maintained by the Biostatistics Department, https://hbiostat.org/data/gusto.rda Global Utilization of Streptokinase and Tissue plasminogen activator for Occluded coronary arteries Synergy Between PCI With Taxus and Cardiac Surgery A framework for the analysis of heterogeneity of treatment effect in patient-centered outcomes research Personalized evidence based medicine: predictive approaches to heterogeneous treatment effects Recursive partitioning for heterogeneous causal effects Some methods for heterogeneous treatment effect estimation in high dimensions Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Predictive approaches to heterogeneous treatment effects: a scoping review Assessing and reporting heterogeneity in treatment effects in clinical trials: a proposal Benefit and harm of intensive blood pressure treatment: Derivation and validation of risk models using data from the SPRINT and ACCORD trials Analysis of randomized comparative clinical trial data for personalized treatment selections Metalearners for estimating heterogeneous treatment effects using machine learning A robust method for estimating optimal treatment regimes Estimating Optimal Treatment Regimes from a Classification Perspective Simple subgroup approximations to optimal treatment regimes from randomized clinical trial data Regularized outcome weighted subgroup identification for differential treatment effects Models with interactions overestimated heterogeneity of treatment effects and were prone to treatment mistargeting A tutorial on individualized treatment effect prediction from randomized trials with a binary endpoint Simple risk stratification at admission to identify patients with reduced mortality from primary angioplasty Should Vitamin A injections to prevent bronchopulmonary dysplasia or death be reserved for high-risk infants Reanalysis of the National Institute of Child Health and Human Development Neonatal Research Network Randomized Trial Improving diabetes prevention with benefit based tailored treatment: risk based reanalysis of Diabetes Prevention Program The Predictive Approaches to Treatment effect Heterogeneity (PATH) statement The Predictive Approaches to Treatment effect Heterogeneity (PATH) Statement: explanation and elaboration Kent DM, Nelson J, Dahabreh IJ, Rothwell PM, Altman DG, Hayward RA. Risk and treatment effect heterogeneity: re-analysis of individual participant data from 32 large clinical trials. Int J Epidemiol. 2016;45(6):2075–88. https://doi.org/10.1093/ije/dyw118 using internally developed risk models to assess heterogeneity in treatment effects in clinical trials Endogenous stratification in randomized experiments Regression models in clinical studies: determining relationships between predictors and response The proposed `concordance-statistic for benefit’ provided a useful metric when modeling heterogeneous treatment effects The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models Selection of thrombolytic therapy for individual patients: development of a clinical model Clinical trials in acute myocardial infarction: should we adjust for baseline characteristics Can overall results of clinical trials be applied to all patients An evidence based approach to individualising treatment Treatment selections using risk–benefit profiles based on data from comparative randomized clinical trials with multiple endpoints Identification of biomarker-by-treatment interactions in randomized clinical trials with survival outcomes and high-dimensional spaces A bayesian approach to subgroup identification Athey S, Tibshirani J, Wager S. Generalized random forests. Annals Stat. 2019;47(2):1148–78. https://doi.org/10.1214/18-AOS1709 Estimating Individual Treatment Effect in Observational Data Using Random Forest Methods Anatomical and clinical characteristics to guide decision making between coronary artery bypass surgery and percutaneous coronary intervention for individual patients: development and validation of SYNTAX score II Redevelopment and validation of the SYNTAX score II to individualise decision making between percutaneous and surgical revascularisation in patients with complex coronary artery disease: secondary analysis of the multicentre randomised controlled SYNTAXES trial with external cohort validation Download references This work has been performed in the European Health Data and Evidence Network (EHDEN) project This project has received funding from the Innovative Medicines Initiative 2 Joint Undertaking (JU) under grant agreement No 806968 The JU receives support from the European Union’s Horizon 2020 research and innovation program and EFPIA Institute for Clinical Research and Health Policy Studies created the software used in this work and ran the analysis; All authors interpreted the results The author(s) read and approved the final manuscript Ethics approval was not required as the empirical illustration of this study was based on anonymized work for a research group that received/receives unconditional research grants from Yamanouchi None of these relate to the content of this paper The remaining authors have disclosed that they do not have any potential conflicts of interest Download citation DOI: https://doi.org/10.1186/s12874-023-01889-6 Metrics details there is an emergent need to develop a robust prediction model for estimating an individual absolute risk for all-cause mortality so that relevant assessments and interventions can be targeted appropriately evaluate and validate (internally and externally) a risk prediction model allowing rapid estimations of an absolute risk of all-cause mortality in the following 10 years data came from English Longitudinal Study of Ageing study which comprised 9154 population-representative individuals aged 50–75 years 1240 (13.5%) of whom died during the 10-year follow-up Internal validation was carried out using Harrell’s optimism-correction procedure; external validation was carried out using Health and Retirement Study (HRS) which is a nationally representative longitudinal survey of adults aged ≥50 years residing in the United States Cox proportional hazards model with regularisation by the least absolute shrinkage and selection operator where optimisation parameters were chosen based on repeated cross-validation was employed for variable selection and model fitting sensitivity and specificity were determined in the development and validation cohorts The model selected 13 prognostic factors of all-cause mortality encompassing information on demographic characteristics The internally validated model had good discriminatory ability (c-index=0.74) specificity (72.5%) and sensitivity (73.0%) the model’s prediction accuracy remained within a clinically acceptable range (c-index=0.69 The main limitation of our model is twofold: 1) it may not be applicable to nursing home and other institutional populations and 2) it was developed and validated in the cohorts with predominately white ethnicity A new prediction model that quantifies absolute risk of all-cause mortality in the following 10-years in the general population has been developed and externally validated It has good prediction accuracy and is based on variables that are available in a variety of care and research settings This model can facilitate identification of high risk for all-cause mortality older adults for further assessment or interventions which are now included in clinical guidelines for therapeutic management a prediction model for all-cause mortality in older people can be used to communicate risk to individuals and their families (if appropriate) and guide strategies for risk reduction we used data from England to develop our mortality model and data from United States to externally validate it To ensure that the cohorts employed were as representative of the general populations as possible we did not limit them based on their help and health statuses this sample was followed-up every two years wave 1 formed our baseline and follow-up data were obtained from wave 6 (2012–2013) To limit the overriding influence of age in a “cohort of survivors” we excluded participants who were > 75 years old A more detailed description of the HRS sample is provided in Supplementary Materials For the purpose of validating our mortality model we included information on mortalities that occurred from 30 January 2004 to 1 August 2015 giving us a 10-year follow-up period which is in line with the derivation cohort To make the external sample more consistent with the derivation data we further limited it to those who were aged 50–75 years old The outcome was all-cause mortality that occurred from 2002 to 2003 through to 2013 which was ascertained from the National Health Service central register which captures all deaths occurring in the UK All participants included in this study provided written consent for linkage to their official records Survival time was defined as the period from baseline when all ELSA participants were alive to the date when an ELSA participant was reported to have died during the 10-year follow-up For those who did not die during follow-up the survival time was calculated using the period spanning from baseline until the end of the study A more detailed description of these methods is provided in the Supplementary Methods estimates their effects and introduces parsimony Cox-Lasso automatically performs variable selection and deals with collinearity Selection of the tuning parameter λ optimising the model performance is described below Calibration plot presenting agreement between the predicted and observed survival rates at 10-years as estimated by our newly developed model Nomogram for Cox-Lasso regression which enables calculating individual normalized prognostic indexes (PI given by the linear predictor line) for all-cause mortality in the following 10 years Coefficients are based on the Lasso-Cox model as estimated by the final model for the all-cause mortality The nomogram allows computing the normalized prognostic index (PI) for a new individual The PI is a single-number summary of the combined effects of a patient’s risk factors and is a common method of describing the risk for an individual the PI is a linear combination of the risk factors with the estimated regression coefficients as weights The exponentiated PI gives the relative risk of each participant in comparison with a baseline participant (in this context the baseline participant would have value 0 for all the continuous covariates and being at the reference category for the categorical ones) The PI is normalized by subtracting the mean PI it can be used as a first-stage screening aid that might prolong life-expectancy by alerting to an individual’s heightened risk profile and a need for more targeted evaluation and prevention It could also be used by non-professionals to improve self-awareness of their health status and by governmental and health organisations to decrease the burden of certain risk factors in the general population of older people the consideration of these factors will help identify high-risk groups who might otherwise be under-detected based on prognostic factors chosen through multiple sequential hypothesis testing Specificity of the externally validated 10-item index was also considerably lower (64.4%) compared to our externally validated model (70.5%) implying it is likely to falsely classify a higher proportion of older adults as high risk for all-cause mortality in the following 10 years using baseline variables reflects the real-life clinical information available to a physician and a participant when they need to make decisions on the likely risk of all-cause mortality for an individual during the next 10 years it would be of interest to include potential interaction with a smaller set of candidate predictors in the future studies Having employed modern statistical learning algorithms and addressed the weaknesses of previous models a new mortality model achieved good discrimination and calibration as shown by its performance in a separate validation cohort which are available by patient report in a variety of care and research settings It allows rapid estimations of an individual’s risk of all-cause mortality based on an individual risk profile These characteristics suggest that our model may be useful for clinical The English Longitudinal Study of Ageing (ELSA) was developed by a team of researchers based at University College London, the Institute for Fiscal Studies and the National Centre for Social Research. The datasets generated and/or analysed during the current study are available in UK Data Services and can be accessed at: https://discover.ukdataservice.ac.uk No administrative permissions were required to access these data Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis guidelines Prediction of coronary heart disease using risk factor categories Validation studies for models projecting the risk of invasive and total breast cancer incidence and evaluation of a new QRISK model to estimate lifetime risk of cardiovascular disease: cohort study using QResearch database Predicting 10-year mortality for older adults Development and validation of a prognostic index for 4-year mortality in older adults Development and validation of a prognostic index for 1-year mortality in older adults after hospitalization The development and validation of an index to predict 10-year mortality risk in a longitudinal cohort of older English adults Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets Predictive analytics in information systems research Trends in life expectancy and age-specific mortality in England and Wales in comparison with a set of 22 high-income countries: an analysis of vital statistics data Second Edition ed: Springer Nature Switzerland; 2019 Cohort profile: the English longitudinal study of ageing Cohort profile: the health and retirement study (HRS) A 10-year follow-up of the health and retirement study Development and validation of a prediction model to estimate the risk of liver cirrhosis in primary care patients with abnormal liver blood test results: protocol for an electronic health record study in clinical practice research Datalink Calculating the sample size required for developing a clinical prediction model MissForest—non-parametric missing value imputation for mixed-type data A Bayesian missing value estimation method for gene expression profile data The lasso method for variable selection in the Cox model Validation of prediction models based on lasso regression with multiply imputed data A selective overview of variable selection in high dimensional feature space The elements of statistical learning: data mining A review and suggested modifications of methodological standards Classifier technology and the illusion of Progress Multivariable prognostic models: issues in developing models Three myths about risk thresholds for prediction models The inconsistency of "optimal" cutpoints obtained using two criteria based on the receiver operating characteristic curve Assessment of predictive performance in incomplete data by combining internal validation and multiple imputation Nomograms in oncology: more than meets the eye Comparisons of nomograms and urologists' predictions in prostate cancer Guidelines on preventing cardiovascular disease in clinical practice Long-term effects of wealth on mortality and self-rated health status Purpose in life is associated with mortality among community-dwelling older persons The association between self-rated health and mortality in a well-characterized sample of coronary artery disease patients Regularization and variable selection via the elastic net Screening for prediabetes using machine learning models Predicting urinary tract infections in the emergency department with machine learning Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints Using the outcome for imputation of missing predictor values was preferred Multiple imputation in the presence of high-dimensional data Estimation of ten-year risk of fatal cardiovascular disease in Europe: the SCORE project Cardiovascular risk prediction models for people with severe mental illness: results from the prediction and management of cardiovascular risk in people with severe mental illnesses (PRIMROSE) research program Download references The English Longitudinal Study of Ageing is funded by the National Institute on Aging (grant RO1AG7644) and by a consortium of UK government departments coordinated by the Economic and Social Research Council (ESRC) is further funded by the National Institute for Health Research (NIHR) (NIHR Post-Doctoral Fellowship - PDF-2018-11-ST2–020) DS and DA were part funded part funded by the National Institute for Health Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London receive salary support from the National Institute for Health Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust the NIHR Maudsley BRC The views expressed in this publication are those of the authors and not necessarily those of the NHS the National Institute for Health Research or the Department of Health and Social Care The Health and Retirement Study is funded by the National Institute on Aging (NIA U01AG009740) and the US Social Security Administration M receive salary support from the National Institute on Aging (NIA U01AG009740) The sponsors had no role in the design and conduct of the study; collection and interpretation of the data; preparation or approval of the manuscript; and decision to submit the manuscript for publication Department of Behavioural Science and Health Department of Biostatistics & Health Informatics Experimental Biomedicine and Clinical Neuroscience (BIONEC) OA had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis AS and OA conceived the idea for the study RMC and JF conducted data preparation and management OA wrote the first draft of the manuscript DA and DS edited the manuscript and approved the final version The authors read and approved the final manuscript The London Multicentre Research Ethics Committee granted ethical approval for the ELSA (MREC/01/2/91) and informed consent was obtained from all participants This manuscript is approved by all authors for publication and is an editor of Psychological Medicine Journal All other authors declare that they have no conflict of interest Outlines a list of all variables considered in the analyses and whether they have been included or excluded from the model building Distribution of missing and observed variables included in the analyses in ELSA Sample calculations for Survival outcomes (Cox prediction models) Distributions of the variables at baseline before and after multiple imputations Apparent coefficients for the Cox-LASSO regression for all-cause mortality during the 10-year follow-up Apparent models’ performance in prediction the 10-year risk of all-cause mortality in older adults Optimism-corrected models’ performance in prediction the 10-year risk of all-cause mortality in older adults Apparent models’ discrimination in prediction the 10-year risk of all-cause mortality in older adults Internally validated though optimism-correction models’ discrimination for prediction the 10-year risk of all-cause mortality in older adults Histogram depicting distribution of prognostic index (PI) estimated based on 13 variables included in the model in the development cohort and external cohort The distribution of survival probabilities estimated based on 13 variables included in the model in the development and validation cohorts Distributions of the variables included in the final all-cause mortality model in derivation cohort (ELSA) and validation cohort (HRS) Download citation DOI: https://doi.org/10.1186/s12874-020-01204-7 Metrics details Recent evidence suggests that there is often substantial variation in the benefits and harms across a trial population We aimed to identify regression modeling approaches that assess heterogeneity of treatment effect within a randomized clinical trial We performed a literature review using a broad search strategy complemented by suggestions of a technical expert panel The approaches are classified into 3 categories: 1) Risk-based methods (11 papers) use only prognostic factors to define patient subgroups relying on the mathematical dependency of the absolute risk difference on baseline risk; 2) Treatment effect modeling methods (9 papers) use both prognostic factors and treatment effect modifiers to explore characteristics that interact with the effects of therapy on a relative scale These methods couple data-driven subgroup identification with approaches to prevent overfitting such as penalization or use of separate data sets for subgroup identification and effect estimation 3) Optimal treatment regime methods (12 papers) focus primarily on treatment effect modifiers to classify the trial population into those who benefit from treatment and those who do not we also identified papers which describe model evaluation methods (4 papers) Three classes of approaches were identified to assess heterogeneity of treatment effect including both simulations and empirical evaluations is required to compare the available methods in different settings and to derive well-informed guidance for their application in RCT analysis is the cornerstone of precision medicine; its goal is to predict the optimal treatments at the individual level accounting for an individual’s risk for harm and benefit outcomes In this scoping review [9] we aim to identify and categorize the variety of regression-based approaches for predictive heterogeneity of treatment effects analysis Predictive approaches to HTE analyses are those that provide individualized predictions of potential outcomes in a particular patient with one intervention versus an alternative or that can predict which of 2 or more treatments will be better for a particular patient taking into account multiple relevant patient characteristics We distinguish these analyses from the typical one-variable-at-a-time subgroup analyses that appear in forest plots of most major trial reports and from other HTE analyses which explore or confirm hypotheses regarding whether a specific covariate or biomarker modifies the effects of therapy To guide future work on individualizing treatment decisions we aimed to summarize the methodological literature on regression modeling approaches to predictive HTE analysis Titles, abstracts and full texts were retrieved and double-screened by six independent reviewers against eligibility criteria. Disagreements were resolved by group consensus in consultation with a seventh senior expert reviewer (DMK) in meetings. Treatment effect modeling methods use both the main effects of risk factors and covariate-by-treatment interaction terms (on the relative scale) to estimate individualized benefits. They can be used either for making individualized absolute benefit predictions or for defining patient subgroups with similar expected treatment benefits (Table 2 Publications included in the review from 1999 until 2019 Numbers inside the bars indicate the method-specific number of publications made in a specific year In a range of plausible scenarios evaluating HTE when considering binary endpoints simulations showed that studies were generally underpowered to detect covariate-by-treatment interactions but adequately powered to detect risk-by-treatment interactions even when a moderately performing prediction model was used to stratify patients risk stratification methods can detect patient subgroups that have net harm even when conventional methods conclude consistency of effects across all major subgroups Primarily binary or time-to-event outcomes were considered Researchers should demonstrate how relative and absolute risk reduction vary by baseline risk and test for HTE with interaction tests Externally validated prediction models should be used this approach may not be optimal for risk-based assessment of HTE where accurate ranking of risk predictions is of primary importance for the calibration of treatment benefit predictions Their proportional interactions model assumes that the effects of prognostic factors in the treatment arm are equal to their effects in the control arm multiplied by a constant Testing for an interaction along the linear predictor amounts to testing that the proportionality factor is equal to 1 If high risk patients benefit more from treatment (on the relative scale) and disease severity is determined by a variety of prognostic factors the proposed test results in greater power to detect HTE on the relative scale compared to multiplicity-corrected subgroup analyses Even though the proposed test requires a continuous response it can be readily implemented in large clinical trials with binary or time-to-event endpoints For model selection an all subsets approach combined with a modified Bonferroni correction method can be used This approach accounts for correlation among nested subsets of considered proportional interactions models thus allowing the assessment of all possible proportional interactions models while controlling for the familywise error rate They compared different Cox regression models for the prediction of treatment benefit: 1) a model without any risk factors; 2) a model with risk factors and a constant relative treatment effect; 3) a model with treatment a prognostic index and their interaction; and 4) a model including treatment interactions with all available prognostic factors fitted both with conventional and with penalized ridge regression Benefit predictions at the individual level were highly dependent on the modeling strategy with treatment interactions improving treatment recommendations under certain circumstances They compared 12 different approaches in a high-dimensional setting with survival outcomes Their methods ranged from a straightforward univariate approach as a baseline where Wald tests accounting for multiple testing were performed for each treatment-covariate interaction to different approaches for dealing with hierarchy of effects—whether they enforce the inclusion of the respective main effects if an interaction is selected—and also different magnitude of penalization of main and interaction effects by assigning outcomes into meaningful ordinal categories Overfitting can be avoided by randomly splitting the sample into two parts; the first part is used to select and fit ordinal regression models in both the treatment and the control arm the models that perform best in terms of a cross-validated estimate of concordance between predicted and unobservable true treatment difference— defined as the difference in probability of observing a worse outcome under control compared to treatment and the probability of observing a worse outcome under treatment compared to control—are used to define treatment benefit scores for patients Treatment effects conditional on the treatment benefit score are then estimated through a non-parametric kernel estimation procedure focusing on the identification of a subgroup that benefits from treatment They repeatedly split the sample population based on the first-stage treatment benefit scores and estimate the treatment effect in subgroups above different thresholds These estimates are plotted against the score thresholds to assess the adequacy of the selected scoring rule This method could also be used for the evaluation of different modeling strategies by selecting the one that identifies the largest subgroup with an effect estimate above a desired threshold They also start by fitting separate outcome models within treatment arms rather than using these models to calculate treatment benefit scores they imputed individualized absolute treatment effects defined as the difference between the observed outcomes and the expected counterfactual (potential) outcomes based on model predictions two separate regression models—one in each treatment arm—are fitted to the imputed treatment effects they combined these two regression models for a particular covariate pattern by taking a weighted average of the expected treatment effects the binary cadit is 1 when a treated patient has a good outcome or when an untreated patient does not the dependent variable implicitly codes treatment assignment and outcome simultaneously They first demonstrated that the absolute treatment benefit equals 2 × P(cadit = 1) − 1 and then they derived patient-specific treatment effect estimates by fitting a logistic regression model to the cadit A similar approach was described for continuous outcomes with the continuous cadit defined as − 2 and 2 times the centered outcome the outcome minus the overall average outcome The approach identifies single covariates likely to modify treatment effect along with the expected individualized treatment effect The authors also extended their methodology to include two covariates simultaneously allowing for the assessment of multivariate subgroups Real-valued (continuous or binary) are considered without considering censoring if common approaches for the assessment of model fit had been examined They argue that if adequately fitting outcome models had been thoroughly sought the extra modeling required for the robust methods of Zhang et al they recursively update non-parametric estimates of the treatment-covariate interaction function from baseline risk estimates and vice-versa until convergence The estimates of absolute treatment benefit are then used to restrict treatment to a contiguous sub-region of the covariate space Starting from continuous responses they generalized their methodology to binary and time-to-event outcomes Using LASSO regression to reduce the space of all possible combinations of covariates and their interaction with treatment to a limited number of covariate subsets their approach selects the optimal subset of candidate covariates by assessing the increase in the expected response from assigning based on the considered treatment effect model versus the expected response of treating everyone with the treatment found best from the overall RCT result The considered criterion also penalizes models for their size providing a tradeoff between model complexity and the increase in expected response The method focuses solely on continuous outcomes suggestions are made on its extension to binary type of outcomes The GEM is defined as the linear combination of candidate effect modifiers and the objective is to derive their individual weights This is done by fitting linear regression models within treatment arms where the independent variable is a weighted sum of the baseline covariates while keeping the weights constant across treatment arms The intercepts and slopes of these models along with the individual covariate GEM contributions are derived by maximizing the interaction effect in the GEM model or by maximizing the statistical significance of an F-test for the interaction effects—a combination of the previous two The authors derived estimates that can be calculated analytically the subgroup that is assigned treatment based on the OTR Their methodology returns an estimate of the population level effect of treating based on the OTR compared to treating no one μ-risk metrics evaluate the ability of models to predict the outcome of interest conditional on treatment assignment Treatment effect is either explicitly modeled by treatment interactions or implicitly by developing separate models for each treatment arm τ-risk metrics focus directly on absolute treatment benefit since absolute treatment benefit is unobservable Value-metrics originate from OTR methods and evaluate the outcome in patients that were assigned to treatment in concordance with model recommendations The method relies on the expression of disease-related harms and treatment-related harms on the same scale The minimum absolute benefit required for a patient to opt for treatment (treatment threshold) can be viewed as the ratio of treatment-related harms and harms from disease-related events Net benefit is then calculated as the difference between the decrease in the proportion of disease-related events and the proportion of treated patients multiplied by the treatment threshold The latter quantity can be viewed as harms from treatment translated to the scale of disease-related harms the net benefit of a considered prediction model at a specific treatment threshold can be derived from a patient-subset where treatment received is congruent with treatment assigned based on predicted absolute benefits and the treatment threshold The model’s clinical relevance is derived by comparing its net benefit to the one of a treat-all policy A model’s ability to discriminate between patients with higher or lower benefits is challenging since treatment benefits are unobservable in the individual patient (since only one of two counterfactual potential outcomes can be observed) Under the assumption of uncorrelated counterfactual outcomes the authors matched patients from different treatment arms by their predicted treatment benefit The difference of the observed outcomes between the matched patient pairs (0 0: harm) acts as a proxy for the unobservable absolute treatment difference The c-statistic for benefit can then be defined on the basis of this tertiary outcome as the proportion of all possible pairs of patient pairs in which the patient pair observed to have greater treatment benefit was predicted to do so they link observed outcomes to unobservable quantities they derive posterior probability estimates of false inclusion or false exclusion in the final model for the considered covariates Following the definition of an outcome-space sub-region that is considered beneficial individualized posterior probabilities of belonging to that beneficial sub-region can be derived as a by-product of the proposed methodology while 2) risk stratification analyzes treatment effects within strata of predicted risk This approach is straightforward to implement and may provide adequate assessment of HTE in the absence of strong prior evidence for potential effect modification The approach might better be labeled ‘benefit magnification’ since benefit increases by higher baseline risk and a constant relative risk Treatment effect modeling methods focus on predicting the absolute benefit of treatment through the inclusion of treatment-covariate interactions alongside the main effects of risk factors modeling such interactions can result in serious overfitting of treatment benefit especially in the absence of well-established treatment effect modifiers Penalization methods such as LASSO regression ridge regression or a combination (elastic net penalization) can be used as a remedy when predicting treatment benefits in other populations Staging approaches starting from—possibly overfitted— “working” models predicting absolute treatment benefits that can later be used to calibrate predictions in groups of similar treatment benefit provide another alternative While these approaches should yield well calibrated personalized effect estimates when data are abundant it is yet unclear how broadly applicable these methods are in conventionally sized randomized RCTs the additional discrimination of benefit of these approaches compared to the less flexible risk modeling approaches remains uncertain Simulations and empirical studies should be informative regarding these questions Because prognostic factors do not affect the sign of the treatment effect several OTR methods rely primarily on treatment effect modifiers when treatments are associated with adverse events or treatment burdens (such as costs) that are not captured in the primary outcome—as is often the case—estimates of the magnitude of treatment effect are required to ensure that only patients above a certain expected net benefit threshold (i.e outweighing the harms and burdens of therapy) are treated these classification methods do not provide comparable opportunity for incorporation of patient values and preferences for shared decision making which prediction methods do While there is an abundance of proposed methodological approaches examples of clinical application of HTE prediction models remain quite rare This may reflect the fact that all these approaches confront the same fundamental challenges These challenges include the unobservability of individual treatment response the curse of dimensionality from the large number of covariates the lack of prior knowledge about the causal molecular mechanisms underlying variation in treatment effects and the relationship of these mechanism to observable variables and the very low power in which to explore interactions Because of these challenges there might be very serious constraints on the usefulness of these methods as a class; while some methods may be shown to have theoretical advantages the practical import of these theoretical advantages may not be ascertainable it is uncertain whether any of these approaches will add value to the more conventional EBM approach of using an overall estimate of the main effect or to the risk magnification approach of applying that relative estimate to a risk model our review is descriptive and did not compare the approaches for their ability to predict individualized treatment effects or to identify patient subgroups with similar expected treatment benefits we identified a large number of methodological approaches for the assessment of heterogeneity of treatment effects in RCTs developed in the past 20 years which we managed to divide into 3 broad categories Extensive simulations along with empirical evaluations are required to assess those methods’ relative performance under different settings and to derive well-informed guidance for their implementation This may allow these novel methods to inform clinical practice and provide decision makers with reliable individualized information on the benefits and harms of treatments While we documented an exuberance of new methods we do note a marked dearth of comparative studies in the literature Future research could shed light on advantages and drawbacks of methods in terms of predictive performance in different settings Users’ guides to the medical literature: II How to use an article about therapy or prevention a Explanatory and pragmatic attitudes in therapeutical trials Evidence based medicine: concerns of a clinical neurologist Limitations of applying summary results of clinical trials to individual patients: the need for risk stratification Enhancing the scoping study methodology: a large inter-professional team’s experience with Arksey and O’Malley’s framework Harrell F. Viewpoints on Heterogeneity of Treatment Effect and Precision Medicine [Internet]. Statistical Thinking. 2018 [cited 2020 Jun 14]. Available from: https://fharrell.com/post/hteview/ Rothman K, Greenland S, Lash TL. Modern Epidemiology, 3rd Edition. 2007 31 [cited 2020 Jul 27]; Available from: https://www.rti.org/publication/modern-epidemiology-3rd-edition The predictive approaches to treatment effect heterogeneity (PATH) statement The predictive approaches to treatment effect heterogeneity (PATH) statement: explanation and elaboration Using group data to treat individuals: understanding heterogeneous treatment effects in the age of precision medicine and patient-centred evidence Estimating treatment effects for individual patients based on the results of randomised clinical trials Method for evaluating prediction models that apply the results of randomized trials to individual patients Profile-specific survival estimates: making reports of clinical trials more patient-relevant Selection of thrombolytic therapy for individual patients: development of a clinical model GUSTO-I Investigator Multivariable risk prediction can greatly enhance the statistical power of clinical trial subgroup analysis Implications of heterogeneity of treatment effect for reporting and analysis of randomized trials in critical care Using internally developed risk models to assess heterogeneity in treatment effects in clinical trials Risk and treatment effect heterogeneity: re-analysis of individual participant data from 32 large clinical trials Baseline characteristics predict risk of progression and response to combined medical therapy for benign prostatic hyperplasia (BPH) Improving diabetes prevention with benefit based tailored treatment: risk based reanalysis of diabetes prevention program Multistate Model to Predict Heart Failure Hospitalizations and All-Cause Mortality in Outpatients With Heart Failure With Reduced Ejection Fraction: Model Derivation and External Validation Explicit inclusion of treatment in prognostic modeling was recommended in observational and randomized settings A multivariate test of interaction for use in clinical trials Assessing heterogeneity of treatment effect in a clinical trial with the proportional interactions model Percutaneous coronary intervention versus coronary-artery bypass grafting for severe coronary artery disease Estimates of absolute treatment benefit for individual patients required careful modeling of statistical interactions Benefit and harm of intensive blood pressure treatment: derivation and validation of risk models using data from the SPRINT and ACCORD trials Action to Control Cardiovascular Risk in Diabetes Study Group Effects of intensive glucose lowering in type 2 diabetes Treatment selections using risk-benefit profiles based on data from comparative randomized clinical trials with multiple endpoints Effectively selecting a target population for a future comparative study Post hoc subgroups in clinical trials: anathema or analytics A Bayesian approach to subgroup identification Performance guarantees for individualized treatment rules Reader reaction to “a robust method for estimating optimal treatment regimes” by Zhang et al Estimating optimal treatment regimes from a classification perspective A simple method for estimating interactions between a treatment and a large number of covariates and combining moderators of treatment on outcome after randomized clinical trials: a parametric approach A novel approach for developing and interpreting treatment moderator profiles in randomized clinical trials Advancing personalized medicine: application of a novel statistical method to identify treatment moderators in the coordinated anxiety learning and management study Variable selection for qualitative interactions in personalized medicine while controlling the family-wise error rate Generated effect modifiers (GEM’s) in randomized clinical trials Statistical Inference For The Mean Outcome Under A Possibly Non-Unique Optimal Treatment Strategy Targeted learning of the mean outcome under an optimal dynamic treatment rule Inference about the expected performance of a data-driven dynamic treatment regime Evaluating the impact of treating the optimal subgroup Discussion of “Dynamic treatment regimes: Technical challenges and applications” Schuler A, Baiocchi M, Tibshirani R, Shah N. A comparison of methods for model selection when estimating individual treatment effects. arXiv:180405146 [cs, stat] [Internet]. 2018 13 [cited 2020 Jun 14]; Available from: http://arxiv.org/abs/1804.05146 The proposed “concordance-statistic for benefit” provided a useful metric when modeling heterogeneous treatment effects Bayesian variable selection with joint modeling of categorical and survival outcomes: an application to individualizing chemotherapy treatment in advanced colorectal cancer Measuring the performance of markers for guiding treatment decisions The Fundamental Difficulty With Evaluating the Accuracy of Biomarkers for Guiding Treatment Assessing treatment-selection markers using a potential outcomes framework Statistical and practical considerations for clinical evaluation of predictive biomarkers Harrell F. EHRs and RCTs: Outcome Prediction vs. Optimal Treatment Selection [Internet]. Statistical Thinking. 2017 [cited 2020 Jun 14]. Available from: https://fharrell.com/post/ehrs-rcts/ Estimating individualized treatment rules using outcome weighted learning Doubly robust learning for estimating individualized treatment with censored data Louizos C, Shalit U, Mooij J, Sontag D, Zemel R, Welling M. Causal Effect Inference with Deep Latent-Variable Models. arXiv:170508821 [cs, stat] [Internet]. 2017 [cited 2020 Jun 14]; Available from: http://arxiv.org/abs/1705.08821 Use of open access platforms for clinical trial data Clinical research data sharing: what an open science world means for researchers involved in evidence synthesis Overview and experience of the YODA Project with clinical trial data sharing after 5 years Download references We acknowledge support from the Innovative Medicines Initiative (IMI) and helpful comments from Victor Talisa Data Scientist from the University of Pittsburgh and writing for this work were supported by a Patient Centered Outcomes Research Institute (PCORI) contract the Predictive Analytics Resource Center [SA.Tufts.PARC.OSCO.2018.01.25] Predictive Analytics and Comparative Effectiveness (PACE) Center Institute for Clinical Research and Health Policy Studies (ICRHPS) DK and DVK contributed to the conception and design of the work DK and DVK contributed to the acquisition of the data DK and DVK contributed to the interpretation of the data All authors have approved the submitted version Download citation DOI: https://doi.org/10.1186/s12874-020-01145-1 Metrics details The number of clinician burnouts is increasing and has been linked to a high administrative burden Automatic speech recognition (ASR) and natural language processing (NLP) techniques may address this issue by creating the possibility of automating clinical documentation with a “digital scribe” We reviewed the current status of the digital scribe in development towards clinical practice and present a scope for future research We performed a literature search of four scientific databases (Medline and Arxiv) and requested several companies that offer digital scribes to provide performance data We included articles that described the use of models on clinical conversational data either automatically or manually transcribed three described ASR models for clinical conversations The other 17 articles presented models for entity extraction or summarization of clinical conversations Two studies examined the system’s clinical validity and usability while the other 18 studies only assessed their model’s technical validity on the specific NLP task The most promising models use context-sensitive word embeddings in combination with attention-based neural networks the studies on digital scribes only focus on technical validity while companies offering digital scribes do not publish information on any of the research phases Future research should focus on more extensive reporting iteratively studying technical validity and clinical validity and usability and investigating the clinical utility of digital scribes This digital scribe uses techniques such as automatic speech recognition (ASR) and natural language processing (NLP) to automate (parts of) clinical documentation The proposed structure for a digital scribe includes a microphone that records a conversation an ASR system that transcribes this conversation and a set of NLP models to extract or summarize relevant information and present it to the physician or use the extracted information for diagnosis support A scoping review of current evidence is needed to determine the current status of the digital scribe and to make recommendations for future research researchers can find a suitable dataset or collect data themselves Researchers should also check if the dataset contains any unintended bias or underrepresented groups the researchers should prospectively study the model in clinical practice the model might run in clinical practice without showing the output to the end-users end-users analyze the output to identify any errors a prospective study can be set up to determine clinical impact The purpose of the present study is to perform a scoping review of the literature and contact companies on the current status of digital scribes in healthcare Which methods are being used to develop (part of) a digital scribe Have any of these methods been evaluated in clinical practice These companies were requested to provide unpublished performance data for their digital scribe Our definition of a digital scribe is any system that uses a clinical conversation as input and automatically extracts information that can be used to generate an encounter note We included articles that describe the performance of either ASR or NLP on clinical conversational data A clinical conversation was defined as a conversation—in real life or via chat—between at least one patient and one healthcare professional Because ASR and NLP are different fields of expertize and will often be described in separate studies we chose to include studies that only focused on part of a digital scribe Studies that described NLP models that were not aimed at creating an encounter note but extracted information for research purposes Articles written in any language other than English were excluded Because of the rapidly evolving research field and the time lag for publications and S.A.C.) independently screened all articles on title and abstract using the inclusion and exclusion criteria The selected articles were assessed for eligibility by reading the full text The first reviewer extracted information from the included articles and the unpublished data provided by companies The second reviewer verified the extracted information The following aspects were extracted and assessed: The four phases of article selection following the PRISMA-ScR statement We were unable to obtain performance data from other companies None of the studies investigated the clinical utility WER: This metric counts the number of substitutions and insertions in the automatic transcript F1 score: The F1 score is the harmonic mean between the precision (or positive predictive value) and the recall (or sensitivity) ROUGE: This is a score that measures the similarity between the automatic summary and the gold standard summary or the longest common subsequence (ROUGE-L) The ROUGE-L score considers sentence-level structure while the ROUGE-1 and ROUGE-2 scores only examine if a uni- or bigram occurs in both the automatic and gold standard summary Scope of the different aspects and techniques of the included digital scribes Highest F1 scores per entity extraction task One study33 tested their classification model on manually transcribed data and automatically transcribed data The model performed better on the manually transcribed data with a difference in F1 score ranging from 0.03 to 0.06 although they did not mention if the difference was significant They formed 18 disadvantaged and advantaged groups based on gender there was a statistically significant difference in favor of the advantaged group The main reason for the disparity is a difference in the type of medical visit “blood” is a strong lexical cue to classify a sentence as important for the “Plan” section of the summary but this word is said less often in conversations with Asian patients The F1 score for the latter study was 0.61 limiting the comparability with the other studies When using the same model with automatically extracted noteworthy utterances Physicians found that 80% of the summaries included “all” or “most” relevant facts The study did not specify which parts were deemed relevant or not or if the model missed specific information DeepScribe did not provide information on the models used for summarization but included how often a summary needed to be adjusted in practice They report that 77% of their summaries do not need modification by a medical scribe before being sent to the physician 74% of their summaries do not need modification from a medical scribe or a physician before being accepted as part of the patient’s record Attention-based neural networks: These models specifically take the sequence of the words into account only passing the relevant subset of the input to the next layer and has an attention mechanism to focus on the relevant parts of the input It first identifies the relevant parts of the text and then classifies those relevant parts into symptoms that are or are not present The relation-span-attribute tagging model (R-SAT) is a variant of the SAT that focuses on relations between attributes The added value of a PGNet is that it has the ability to generate new words or copy words from the text This scoping review provides an overview of the current state of the development Although the digital scribe is still in an early research phase there appears to be a substantial research body testing various techniques in different settings The first results are promising: state-of-the-art models are trained on vast corpora of annotated clinical conversations Although the performance of these models varies per task the results give a clear view of which tasks and which models yield high performance Reports of clinical validity and usability These approaches are promising new ways to decrease the WER what is most important is whether the WER is good enough to extract all the relevant information the NLP models trained on manually transcribed data outperform those trained on automatically transcribed data which means there is room for improvement of the WER the diverseness in both tasks and underlying models was large The classification models focused mainly on extracting metadata such as relevance or structure induction of an utterance and used various models ranging from logistic regression to neural networks The entity extraction models were more homogeneous in models but extracted many different entities whereas the summarization task was mostly uniform One notable aspect of the NLP tasks overall is the use of word embeddings Only one study did not use word embeddings but this was a study from 2006 when context-sensitive word embeddings were not yet available All the other studies were published after 2019 and used various word embeddings as input The introduction of context-sensitive word embeddings has been essential for extracting entities and summarizing clinical conversations led to better performance than more general tasks such as extracting symptoms and their properties An explanation for this is the heterogeneity in These properties can be phrased in various ways which will be much more homogeneous in phrasing this homogeneity leads to many more annotations per entity which describe the decrease in neural networks’ performance with increased input length the model knows which parts of the text are important for its task Adding attention not only improves performance; it also decreases the amount of training data needed which is useful in a field such as healthcare where gathering large datasets can be challenging including how they dealt with ambiguity and labeling errors it would have been interesting to include error analyses to investigate the models’ blind spots we believe it is vital to improve the ASR for clinical conversations further and use them as input for NLP models A remarkable finding was that most studies used manually transcribed conversations as input to their NLP model These manual transcripts may outperform automatically transcribed conversations regarding data quality leading to an overestimation of the results NLP models that require manual transcription may increase administrative burden when implemented in clinical practice which should be the basis for reporting on digital scribes as well where physicians qualitatively analyze the model’s output These results lead to new insights for improving technical validity Studying these two research phases iteratively leads to a solution that is well-suited for clinical practice These studies should be the starting point for researchers and developers working on a digital scribe The current work is the first effort to review all available literature on developing a digital scribe We believe our search strategy was complete leading to a comprehensive and focused scope of the digital scribe’s current research body we create a broader overview than just the digital scribe’s scientific status which means we have to trust the company in providing us with legitimate data We hope this review is an encouragement for other companies to study their digital scribes scientifically One limitation is the small amount of journal papers included in this review as opposed to the amount of Arxiv preprints and workshop proceedings These types of papers are often refereed very loosely only including journal papers would not lead to a complete scope of this quickly evolving field Contacting various digital scribe companies was a first step towards gaining insight into implemented digital scribes and their performance on the different ASR and NLP tasks we believe it is a valuable addition to this review It indicates that their implemented digital scribe does not differ significantly in techniques or performance from the included studies’ models while already saving physicians’ time it highlights the gap between research and practice The studies published by companies all describe techniques that are not part of a fully functional digital scribe (yet) none of the companies offering digital scribes have published about the technical validity Although the digital scribe field has only recently started to accelerate the presented techniques achieve promising results while companies offering digital scribes do not publish on any of the research phases Any data generated or analyzed are included in this article and the Supplementary Information files Aggregate data analyzed in this study are available from the corresponding author on reasonable request Changes in burnout and satisfaction with work-life integration in physicians and the general US working population between 2011 and 2017 Taking Action Against Clinician Burnout: A Systems Approach to Professional Well-Being (The National Academies Press Tethered to the EHR: primary care physician workload assessment using EHR event log data and time-motion observations Allocation of physician time in ambulatory practice: a time and motion study in 4 specialties Electronic health record logs indicate that physicians split time evenly between seeing patients and desktop medicine The impact of administrative burden on academic physicians “It’s like texting at the dinner table”: a qualitative analysis of the impact of electronic health records on patient-physician interaction in hospitals Electronic health record effects on work-life balance and burnout within the I3 population collaborative Physician stress and burnout: the impact of health information technology Impact of scribes on physician satisfaction and charting efficiency: a randomized controlled trial Association of medical scribes in primary care with physician workflow and patient experience Challenges of developing a digital scribe to reduce clinical documentation burden Ambient clinical intelligence: the exam of the future has arrived. Nuance Communications (2019). Available at: https://www.nuance.com/healthcare/ambient-clinical-intelligence.html Amazon comprehend medical. Amazon Web Services, Inc (2018). Available at: https://aws.amazon.com/comprehend/medical/ Robin Healthcare | automated clinic notes, coding and more. Robin Healthcare (2019). Available at: https://www.robinhealthcare.com Reimagining clinical documentation with artificial intelligence PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation Speech recognition for medical conversations properties and their relations from clinical conversations of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 4979–4990 (Association for Computational Linguistics Joint speech recognition and speaker diarization via sequence transduction Extracting relevant information from physician-patient dialogues for automated clinical note taking of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI) 65–74 (Association for Computational Linguistics A systematic comparison of contemporary automatic speech recognition engines for conversational clinical speech 683–689 (American Medical Informatics Association Automatic analysis of medical dialogue in the home hemodialysis domain: structure induction and summarization Automatically charting symptoms from patient-physician conversations using machine learning Medication regimen extraction from medical conversations of International Workshop on Health Intelligence of the 34th AAAI Conference on Artificial Intelligence (Association for Computational Linguistics The medical scribe: corpus development and model performance analyses of the 12th Language Resources and Evaluation Conference (European Language Resources Association summarize: global summarization of medical dialogue by exploiting local structures In Findings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 3755–3763 (Association for Computational Linguistics Topic-aware pointer-generator networks for summarizing spoken conversations IEEE Automatic Speech Recognition Understanding Workshop 2019 Extracting Structured Data from Physician-Patient Conversations by Predicting Noteworthy Utterances (eds) Explainable AI in Healthcare and Medicine vol 914 (Springer International Publishing Generating SOAP notes from doctor-patient conversations MedFilter: improving extraction of task-relevant utterances through integration of discourse structure and ontological knowledge of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 7781–7797 (Association for Computational Linguistics Towards an automated SOAP note: classifying utterances from medical conversations Towards fairness in classifying medical conversations into SOAP sections In To be presented at AAAI 2021 Workshop: Trustworthy AI for Healthcare (AAAI Press Weakly supervised medication regimen extraction from medical conversations of the 3rd Clinical Natural Language Processing Workshop 178–193 (Association for Computational Linguistics Towards understanding ASR error correction for medical conversations of the First Workshop on Natural Language Processing for Medical Conversations 7–11 (Association for Computational Linguistics Generating medical reports from patient-doctor conversations using sequence-to-sequence models 22–30 (Association for Computational Linguistics Extracting symptoms and their status from clinical conversations of the 57th Annual Meeting of the Association for Computational Linguistics 915–925 (Association for Computational Linguistics DeepScribe - AI-Powered Medical Scribe. DeepScribe (2020). Available at: https://www.deepscribe.ai Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension Transcribing videos | Cloud speech-to-text documentation. Google Cloud (2016). Available at: https://cloud.google.com/speech-to-text/docs/video-model Watson speech to text - Overview. IBM (2021). Available at: https://www.ibm.com/cloud/watson-speech-to-text Kaldi ASR. Kaldi (2015). Available at: https://kaldi-asr.org mozilla/DeepSpeech. GitHub (2020). Available at: https://github.com/mozilla/DeepSpeech Speech-to-text: automatic speech recognition | Google Cloud. Google Cloud (2016). Available at: https://cloud.google.com/speech-to-text Jhu aspire system: robust LVCSR with TDNNs In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) Deliberation model based two-pass end-to-end speech recognition In IEEE International Conference on Acoustics Neural machine translation by jointly learning to align and translate Learning phrase representations using RNN encoder–decoder for statistical machine translation of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1724–1734 (Association for Computational Linguistics MINIMAR (MINimum Information for Medical AI Reporting): developing reporting standards for artificial intelligence in health care The practical implementation of artificial intelligence technologies in medicine Do no harm: a roadmap for responsible machine learning for health care Envisioning an artificial intelligence documentation assistant for future primary care consultations: a co-design study with general practitioners Identifying relevant information in medical conversations to summarize a clinician-patient encounter Regulatory frameworks for development and evaluation of artificial intelligence–based diagnostic imaging models: summary and recommendations Gender and dialect bias in YouTube’s automatic captions of the First ACL Workshop on Ethics in Natural Language Processing 53–59 (Association for Computational Linguistics DECIDE-AI: new reporting guidelines to bridge the development-to-implementation gap in clinical artificial intelligence Sequence to sequence learning with neural networks of the 27th International Conference on Neural Information Processing Systems (NIPS) 2 Get to the point: summarization with pointer-generator networks of the 55th Annual Meeting of the Association for Computational Linguistics 1073–1083 (Association for Computational Linguistics Distributed representations of words and phrases and their compositionality of the 26th International Conference on Neural Information Processing Systems (NIPS) 2 of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL) 1 2227–2237 (Association for Computational Linguistics BERT: pre-training of deep bidirectional transformers for language understanding of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL) 4171–4186 (Association for Computational Linguistics Download references Department of Information Technology & Digital Innovation Department of Quality & Patient Safety contributed to design and critical revision of the manuscript All authors gave their final approval and accepted accountability for all aspects of the work Download citation DOI: https://doi.org/10.1038/s41746-021-00432-5 Metrics details Report cards on the health care system increasingly report provider-specific performance on indicators that measure the quality of health care delivered A natural reaction to the publishing of hospital-specific performance on a given indicator is to create ‘league tables’ that rank hospitals according to their performance many indicators have been shown to have low to moderate rankability meaning that they cannot be used to accurately rank hospitals Our objective was to define conditions for improving the ability to rank hospitals by combining several binary indicators with low to moderate rankability Monte Carlo simulations to examine the rankability of composite ordinal indicators created by pooling three binary indicators with low to moderate rankability We considered scenarios in which the prevalences of the three binary indicators were 0.05 and 0.25 and the within-hospital correlation between these indicators varied between − 0.25 and 0.90 Creation of an ordinal indicator with high rankability was possible when the three component binary indicators were strongly correlated with one another (the within-hospital correlation in indicators was at least 0.5) When the binary indicators were independent or weakly correlated with one another (the within-hospital correlation in indicators was less than 0.5) the rankability of the composite ordinal indicator was often less than at least one of its binary components The rankability of the composite indicator was most affected by the rankability of the most prevalent indicator and the magnitude of the within-hospital correlation between the indicators Pooling highly-correlated binary indicators can result in a composite ordinal indicator with high rankability the composite ordinal indicator may have lower rankability than some of its constituent components It is recommended that binary indicators be combined to increase rankability only if they represent the same concept of quality of care or length of stay) or a process of care (e.g. discharge prescribing of evidence-based medications in specific patient populations) that is used to assess the quality of health care A common practice is to report hospital-specific means of health care indicators (e.g. the proportion of patients who died in each hospital or mean length of stay) Crude (or unadjusted) or risk-adjusted estimates of hospital performance on specific indicators can be reported They found that rankability ranged from 0.01 for patients with osteoarthritis undergoing total hip arthroplasty/total knee arthroplasty to 0.71 following hospitalization for stroke A question when developing indicators for assessing quality of health care is whether several binary indicators reflecting outcomes of increasing severity which individually have poor to moderate rankability can be combined into an ordinal indicator to increase rankability The objective of the current study was to examine how the rankability of composite ordinal indicators compared to the rankabilities of the component binary indicators The paper is structured as follows: In Section 2 we provide background and formally define rankability we conduct a series of Monte Carlo simulations to examine the relationship between the rankability of a binary indicator and the intraclass correlation coefficient (ICC) of that indicator across hospitals (as a measure of the between-hospital variation) we conduct a series of Monte Carlo simulations to examine the relationship between the rankability of a composite ordinal indicator and the rankabilities of the individual binary indicators from which it was formed in Section 5 we summarize our findings and place them in the context of the existing literature Let Y denote a binary indicator that is used to assess the performance of a health care provider (e.g. we will refer to the hospital as the provider but the methods are equally applicable to other healthcare providers (e.g. physicians or health care administrative regions) Yij = 1 denote that the indicator was positive or present (e.g. the patient died or SSI occurred) for the ith patient at the jth hospital while Yij = 0 denotes that the indicator was negative for this patient (e.g. the patient did not die or SSI did not occur) Let Xij denote a vector of covariates measured on the ith patient at the jth hospital (e.g. A random effects logistic regression model can be fit to model the variation in the indicator: we used the above definition because it appears to be the most frequently used definition in the context of multilevel analysis Instead of fitting a random effects model to model variation in the indicator one could replace the hospital-specific random effects by fixed hospital effects: where there are k-1 indicator or dummy variables to represent the fixed effects of the k hospitals Let sj denote the standard error of the estimated hospital effect for the jth hospital These standard errors denote the precision with which the hospital-specific fixed effects are estimated The rankability relates the total variation from the random effects model to the uncertainty of the individual hospital effects from the fixed effects model It can be interpreted as the proportion of the variation between hospitals that is not due to chance We conducted a series of Monte Carlo simulations to examine the relationship between ICC and the rankability of a single binary indicator Let X and Y denote a continuous risk score and a binary indicator The following random effects model relates the continuous risk score to the presence of the binary indicator: The hospital-specific random effects follow a normal distribution: α0j~N(α0 determines the overall prevalence of the binary indicator determines the magnitude of the strength of the relationship between the risk score and the presence of the binary indicator Fixing the standard deviation of the random effects distribution at \( \tau =\pi \sqrt{\frac{\mathrm{ICC}}{3\left(1-\mathrm{ICC}\right)}} \) will result in a model with the desired value of the ICC We then simulated a binary outcome for the indicator from a Bernoulli distribution with subject-specific parameter Pr(Yij = 1) We designed the simulations so that hospital volume was fixed across hospitals This was done to remove any effect of varying hospital volume on rankability We allowed the following three factors to vary: (i) the ICC; (ii) the average intercept (α0); (iii) the fixed slope (α1) The ICC was allowed to take on 13 values from 0 to 0.24 in increments of 0.02 These values were selected as they range from no effect of clustering (ICC = 0) to a strong effect of clustering The average intercept was allowed to take on four values: − 3 The fixed slope was allowed to take on three values: − 0.25 and thus considered 156 different scenarios In each of the 156 different scenarios we simulated 100 datasets we estimated the rankability of the binary indicator using the methods described in Section 2 (in each simulated dataset rankability was estimated using the estimated variance of the random effects we then computed the average rankability across the 100 simulated datasets for that scenario The simulations were conducted using the R statistical programming language (version 3.5.1) The random effects logistic regression models were fit using frequentist methods using the glmer function from the lme4 package for R We used an extensive series of Monte Carlo simulations to examine whether combining three binary indicators into an ordinal indicator resulted in an ordinal indicator with greater rankability compared to that of its binary components We examined scenarios with three binary indicators: Y1 The following three random effects models relate an underlying continuous risk factor to the presence of each of the three binary indicators: we assumed that the hospital-specific random effects followed a normal distribution: \( {\alpha}_{0 kj}\sim N\left({\alpha}_{0k},{\tau}_{kk}^2\right) \) We assumed that the distribution of the triplet of hospital-specific random effects followed a multivariate normal distribution: This indicator would have an overall prevalence of 25% by construction The ordinal indicator in our study was defined as: Thus, a subject had the most severe/serious level of the composite ordinal indicator (5) if the most serious of the binary indicators (Y1) was present, regardless of whether or not any of the other two indicators had occurred. A subject had the least severe/serious level of the composite ordinal indicator (1) if none of the binary indicators was present We computed the rankability of the ordinal indicator The mean rankability of each of the three binary indicators and the one ordinal indicator was determined over 100 iterations for each scenario For each of the 16 combinations of the above two factors we considered three different sets of rankability values for the three binary indicators The ordinal logistic regression model was fit using the polr function from the MASS package while the random effects ordinal logistic regression model was fit using the clmm function from the ordinal package for R and third binary indicators across the 48 scenarios were 0.05 and third binary indicators across the 48 scenarios were 0.36 (range 0.22 to 0.43) Rankability of binary and ordinal indicators restricting the analysis to those scenarios in which the correlation between hospital-specific random effects was less than or equal to 0.5 The use of 100 replications in each of the 48 scenarios in the Monte Carlo simulations allowed us to estimate rankability with relatively good precision For each scenario and for each of the indicators we computed the standard deviation of the rankability across the 100 replications for that scenario The mean standard deviation of the rankability of the first binary indicator was 0.067 across the 48 scenarios (ranging from 0.062 to 0.074) The mean standard deviation of the rankability of the second binary indicator was 0.058 across the 48 scenarios (ranging from 0.046 to 0.069) The mean standard deviation of the rankability of the third binary indicator was 0.056 across the 48 scenarios (ranging from 0.037 to 0.072) The mean standard deviation of the rankability of the composite ordinal indicator was 0.057 across the 48 scenarios (ranging from 0.032 to 0.078) Rankability of binary and ordinal indicators (equal prevalences) We conducted a series of simulations to examine whether combining three binary indicators reflecting outcomes with increasing severity which individually had low or moderate rankability could produce an ordinal indicator with high rankability We found that this was feasible when the three binary indicators had at least moderate rankability and were strongly correlated with one another When the binary indicators were independent or weakly correlated with one another the rankability of the composite ordinal indicator was often less than that of at least one of its binary components There is an increasing interest in many countries and jurisdictions in reporting on the quality and outcomes of health care delivery Public reporting of hospital-specific performance on indicators of health care quality can lead to the production of ‘league tables’ in which hospitals are ranked according to their performance The rankability of an indicator denotes its ability to allow for the accurate ranking of hospitals many indicators have been shown to have poor to moderate rankability Our focus was on pooling binary indicators reflecting outcomes of increasing severity to create a composite ordinal indicator that described a gradient from lowest (least severe/serious) to highest (most severe/serious) We did not consider other methods of creating composite indicators such as summing up the number of positive binary indicators Such an approach would not necessarily preserve the ordering of severity present in the individual indicators For instance given three indicators of differing severity (e.g. then a subject who died (and who was not readmitted and who had a short length of hospital stay) and a subject who had a long hospital stay (but who did not die and who was not readmitted) would both have one positive indicator they would have very different severity of the underlying binary indicators Our composite ordinal indicator reflects this ordering of severity/seriousness while counting the number of positive indicators would not Our research has shown that rankability is increased when individual indicators are combined with other indicators with which they are highly correlated Individual indicators underlying the same concepts of (quality) of care can thereby be combined to produce a more reliable ranking with the added advantage of showing a more complete picture of quality of care indicators that are not correlated might represent other important quality domains although their limited rankability should be taken into account in the interpretation of potential differences between hospitals The finding that combining binary outcomes that are negatively correlated into an ordinal outcome decreases rankability is a result of violation of the proportional odds assumption The proportional odds model assumes that the effect of the parameter of interest in this case the hospital-specific random effects on the outcome is comparable across the cut-offs of the ordinal scale If the binary indicators are not correlated this assumption is not satisfied when a specific hospital has a low mortality rate (meaning a negative random effect estimate on one cut-off) but high readmission rate (positive random effect estimate on other cut-off) these random effect estimates average out This reduces the variation of the hospital-specific random effects to obtain a composite ordinal indicator with high rankability the proportional odds assumption must be met to some extent in order for a composite indicator to provide information on which a hospital can take action it would be reasonable to combine indicators that address aspects of health care quality for the same set of patients (e.g. that pertain to the same surgical procedure or to the treatment of the same set of patients) Identifying indicators that satisfy these requirements may be challenging in some settings when binary indicators have low to moderate within-hospital correlation It is recommended that related binary indicators be combined in order to increase rankability which reflects that they represent the same concept of quality of care The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request Cardiac Surgery in New Jersey in 2002: A Consumer Report Annual Report of the California Hospital Outcomes Project California Office of Statewide Health Planning and Development Adult Coronary Artery Bypass Graft Surgery in the Commonwealth of Massachusetts: Fiscal Year 2010 Report Pennsylvania Health Care Cost Containment Council Consumer Guide to Coronary Artery Bypass Graft Surgery Focus on heart attack in Pennsylvania: research methods and results Coronary artery bypass graft surgery in New York State 1989-1991 the Cardiac Care Network Steering Committee Outcomes of Coronary Artery Bypass Surgery in Ontario Cardiovascular Health and Services in Ontario: An ICES Atlas Toronto: Institute for Clinical Evaluative Sciences; 1999 Acute Myocardial Infarction Outcomes in Ontario League Tables and Their Limitations: Statistical Issues in Comparisons of Institutional Performance van Houwelingen, H. C., Brand, R., and Louis, T. A. Empirical Bayes Methods for Monitoring Health Care Quality https://www.lumc.nl/sub/3020/att/EmpiricalBayes.pdf (Accessed May 8 Lingsma HF, Eijkemans MJ, Steyerberg EW. Incorporating natural variation into IVF clinic league tables: The Expected Rank. BMC.Med.Res.Methodol. 2009;9:53. https://doi.org/10.1186/1471-2288-9-53 Dimick JB, Staiger DO, Birkmeyer JD. Ranking hospitals on surgical mortality: the importance of reliability adjustment. Health ServRes. 2010;45(6 Pt 1):1614–29. https://doi.org/10.1111/j.1475-6773.2010.01158.x Verburg IW, de Keizer NF, Holman R, Dongelmans D, de Jonge E, Peek N. Individual and clustered Rankability of ICUs according to case-mix-adjusted mortality. Crit Care Med. 2016;44(5):901–9. https://doi.org/10.1097/CCM.0000000000001521 Reliability adjustment: a necessity for trauma center ranking and benchmarking Henneman D, van Bommel AC, Snijders A, Snijders HS, Tollenaar RA, Wouters MW, Fiocco M. Ranking and rankability of hospital postoperative mortality rates in colorectal cancer surgery. Ann.Surg. 2014;259(5):844–9. https://doi.org/10.1097/SLA.0000000000000561 van Dishoeck AM, Koek MB, Steyerberg EW, van Benthem BH, Vos MC, Lingsma HF. Use of surgical-site infection rates to rank hospital performance across several types of surgery. Br.J.Surg. 2013;100(5):628–36. https://doi.org/10.1002/bjs.9039 Lingsma HF, Steyerberg EW, Eijkemans MJ, Dippel DW, Scholte Op Reimer WJ, van Houwelingen HC. Comparing and ranking hospitals based on outcome: results from the Netherlands stroke survey. QJM. 2010;103(2):99–108. https://doi.org/10.1093/qjmed/hcp169 van Dishoeck AM, Lingsma HF, Mackenbach JP, Steyerberg EW. Random variation and rankability of hospitals using outcome indicators. BMJ Qual.Saf. 2011;20(10):869–74. https://doi.org/10.1136/bmjqs.2010.048058 Lawson EH, Ko CY, Adams JL, Chow WB, Hall BL. Reliability of evaluating hospital quality by colorectal surgical site infection type. Ann.Surg. 2013;258(6):994–1000. https://doi.org/10.1097/SLA.0b013e3182929178 Hofstede SN, Ceyisakar IE, Lingsma HF, Kringos DS, Marang-van de Mheen PJ. Ranking hospitals: do we gain reliability by using composite rather than individual indicators? BMJ Qual.Saf. 2019;28(2):94–102. https://doi.org/10.1136/bmjqs-2017-007669 Roozenbeek B, Lingsma HF, Perel P, Edwards P, Roberts I, Murray GD, Maas AI, Steyerberg EW. The added value of ordinal analysis in clinical trials: an example in traumatic brain injury. Crit Care. 2011;15(3):R127. https://doi.org/10.1186/cc10240 Bath PM, Gray LJ, Collier T, Pocock S, Carpenter J. Can we improve the statistical analysis of stroke trials? Statistical reanalysis of functional outcomes in stroke trials. Stroke. 2007;38(6):1911–5. https://doi.org/10.1161/STROKEAHA.106.474080 Multilevel analysis: an introduction to basic and advanced multilevel modeling Partitioning variation in generalised linear multilevel models Wu S, Crespi CM, Wong WK. Comparison of methods for estimating the intraclass correlation coefficient for binary responses in cancer prevention cluster randomized trials. Contemp.Clin.Trials. 2012;33(5):869–80. https://doi.org/10.1016/j.cct.2012.05.004 Download references which is funded by an annual grant from the Ontario Ministry of Health and Long-Term Care (MOHLTC) results and conclusions reported in this paper are those of the authors and are independent from the funding sources No endorsement by ICES or the Ontario MOHLTC is intended or should be inferred This research was supported by operating grant from the Canadian Institutes of Health Research (CIHR) (MOP 86508) Austin is supported in part by a Mid-Career Investigator award from the Heart and Stroke Foundation of Ontario and PM contributed to the design of the simulations PA coded the simulations and conducted the statistical analyses and PM contributed to revising the manuscript and PM read and approved the final manuscript The study consisted of Monte Carlo simulations that used simulated data No ethics approval or consent to participate was necessary Consent for publication was not required as only simulated data were used Download citation DOI: https://doi.org/10.1186/s12874-019-0769-x Metrics details This article has been updated Computed tomography (CT) is presently a standard procedure for the detection of distant metastases in patients with oesophageal or gastric cardia cancer We aimed to determine the additional diagnostic value of alternative staging investigations We included 569 oesophageal or gastric cardia cancer patients who had undergone CT neck/thorax/abdomen Sensitivity and specificity were first determined at an organ level (results of investigations and then at a patient level (results for combinations of investigations) considering that the detection of distant metastases is a contraindication to surgery we compared three strategies for each organ: CT alone CT plus another investigation if CT was negative for metastases (one-positive scenario) and CT plus another investigation if CT was positive but requiring that both were positive for a final positive result (two-positive scenario) life expectancy and quality adjusted life years (QALYs) were compared between different diagnostic strategies CT showed sensitivities for detecting metastases in celiac lymph nodes which was higher than the sensitivities of US abdomen (44% for celiac lymph nodes and 65% for liver metastases) US neck showed a higher sensitivity for the detection of malignant supraclavicular lymph nodes than CT (85 vs 28%) sensitivity for detecting distant metastases was 66% and specificity was 95% if only CT was performed A higher sensitivity (86%) was achieved when US neck was added to CT (one-positive scenario) This strategy resulted in lower costs compared to CT only at an almost similar (quality adjusted) life expectancy Slightly higher specificities (97–99%) were achieved if liver and/or lung metastases found on CT were confirmed by US abdomen or chest X-ray These strategies had only slightly higher QALYs The combination of CT neck/thorax/abdomen and US neck was most cost-effective for the detection of metastases in patients with oesophageal or gastric cardia cancer whereas the performance of CT only had a lower sensitivity for metastases detection and higher costs which may be due to the low number of M1b celiac lymph nodes detected in this series It remains to be determined whether the application of positron emission tomography will further increase sensitivities and specificities of metastases detection without jeopardising costs and QALYs The presence of distant metastases from oesophageal or gastric cardia cancer is usually investigated by more than one modality CT neck/thorax/abdomen is a standard investigation and chest X-ray are also necessary for assessing the presence of distant metastases in these patients we aimed to determine the diagnostic value of EUS and chest X-ray in addition to CT in patients with oesophageal or gastric cardia cancer We evaluated these diagnostic procedures both at an organ level and at a patient level for the detection of metastases The assumption was that the finding of distant metastases in patients with oesophageal or gastric cardia cancer would eliminate the option of a curative surgical treatment We used a prospectively collected database with information on 1088 patients with oesophageal or gastric cardia cancer who were diagnosed and treated between January 1994 and October 2003 at the Erasmus MC – University Medical Center Rotterdam Data that were collected included general patient characteristics which was not present in the database but necessary for this study was obtained from the electronic hospital information system We assessed which preoperative investigations had been performed in these 1088 patients FNA was performed if the result could change the treatment decision If multiple suspicious lesions were present FNA of the most suspicious lesion was performed The results of the investigations were compared with the gold standard which was postoperative pathological TNM stage or a radiological finding in the relevant organ with ⩾6 months of follow-up the latter was repeated to determine whether the lesion could be found using the CT information and to evaluate whether FNA could be performed we did not use the results of this repeated investigation but used the result of the initial US neck or abdomen false-positive and false-negative results of CT for the detection of metastases in the various organs were calculated The combined results were calculated twice the result was considered positive for metastases if at least one of two investigations that were performed for a particular organ was positive and negative if both investigations were negative (one-positive scenario) This is a strategy that uses the possible additional diagnostic information of the second investigation in case of a negative CT the result of another investigation is irrelevant in this strategy because the final result will remain positive irrespective of the result of the other investigation the result was considered positive if both CT and another investigation were positive and negative if at least one of the investigations was negative (two-positive scenario) This is a strategy that uses additional diagnostic investigations to confirm a positive CT finding the performance of another investigation is unnecessary using this strategy because the final result will remain negative irrespective of the result of the other investigation the number of false-positive and false-negative results was also calculated for the combination of CT In addition to analyses at the organ level we considered analyses at the patient level we assessed whether distant metastases (M1b) were present in liver whether a curative oesophageal resection should have been performed or not on the basis of combinations of staging investigations using the data of 264 patients who had undergone all investigations The assumption was that an oesophageal resection should only be performed if no distant metastases are detected Similarly to the analyses at the organ level and the one-positive and two-positive scenarios for the detection of metastases in liver 81 different combinations of investigations were possible (3 strategies for 4 organs) Sensitivities and specificities for the detection of distant metastases at the patient level were calculated for each combination We plotted sensitivity against one-specificity in a receiver operating characteristic (ROC) curve for a visual comparison of the accuracy of combinations of staging investigations using the data of 264 patients who had undergone all investigations Sensitivity is the proportion of patients who are correctly identified as having distant metastases (true positive results) and one-specificity is the proportion of patients in whom the gold standard is negative for distant metastases and who are incorrectly identified as positive by the staging investigation (false-positive results) ROC curves were made for the detection of distant metastases (M1b) with CT and the combination of CT and another investigation (both the two-positive and one-positive scenarios) in an organ whereas in the other organs we only included the CT result to assess whether both CT and US abdomen should be performed to determine whether liver metastases were present we compared three different strategies: (1) combination of CT and US abdomen in the two-positive scenario for the liver and CT for the other organs; (2) combination of CT and US abdomen in the one-positive scenario for the liver and CT for the other organs; (3) CT for all organs All P-values were based on two-sided tests of significance A P-value<0.05 was considered as statistically significant Life expectancy was assumed to be 2.41 and 1.00 year for local/regional disease with and without resection and 0.42 and 0.37 year for distant disease with and without resection QALYs were estimated to be 1.45 and 0.70 for local/regional disease A cost-effectiveness plane was constructed in which the differences in costs between strategies (Δ costs) were plotted against the differences in QALY (Δ QALY) Costs were expressed per $1000 (k$) for easier interpretation In Table 1 patient and tumour characteristics are shown for all 569 patients who had undergone both CT neck/thorax/abdomen and at least one other investigation for the 264 patients who had undergone all investigations and for the 305 patients who had undergone some diagnostic investigations χ2 testing revealed that the differences between the patients with all (n=264) or some (n=305) diagnostic investigations were statistically not significant In Table 2 the gold standard diagnoses are shown per organ Positive gold standard diagnoses were confirmed by FNA or resection in the majority of cases (92/135 whereas such confirmation could not be used in the remaining cases A reason for this was that several patients had two or more suspicious lesions and FNA had already been performed for one of these lesions which confirmed the presence of a distant metastasis FNA of the other suspicious lesions was therefore not indicated in these patients Sensitivity for the detection of liver metastases was higher for CT than for US abdomen, but this was statistically not significant (73 vs 65%, P=0.63; Table 3) Sensitivity for celiac lymph node metastases was higher for CT than for US abdomen (69 vs 44% Sensitivity for supraclavicular lymph node metastases was higher for US neck than for CT (85 vs 28% Sensitivity for lung metastases was slightly higher for CT than for chest X-ray but this was statistically not significant (90 vs 68% the results of EUS for the detection of malignant celiac lymph nodes were inferior than for CT and US abdomen EUS was considered to be less relevant for the detection of distant metastases and was not included in the part of the analyses concerning patient level ROC curves for the detection of metastases with CT and the combination of CT and another investigation (one-positive and two-positive scenario) in an organ whereas for the other organs only the result of CT was included vs the gold standard combination of CT and another investigation for the investigated region with a positive result if at least one investigation is positive (one-positive) with a positive result if both investigations are positive (two-positive) The sensitivity for detecting distant metastases was 66% and specificity was 95% if only CT was performed for all organs (Table 4) Higher sensitivities and specificities could be obtained by the addition of one or more other staging investigations which could be obtained with 12 of the 81 different combinations of staging investigations whereas for 6 other combinations the specificity was only slightly higher (94.9%) The lowest number of investigations for a sensitivity of 86% and a specificity of 94.9% was the combination of CT plus US neck for the detection of supraclavicular lymph node metastases (one-positive scenario) and CT only for the detection of metastases in celiac lymph nodes A slightly higher specificity of 97% was achieved by the addition of US abdomen for liver metastases When chest X-ray (two-positive scenario) for the detection of lung metastases was added Sensitivity declined with increasing specificity meaning that more patients would have undergone a curative treatment option in the presence of distant metastases (more false-negative results) The addition of US abdomen for the detection of malignant celiac lymph nodes did not result in better results; however only 3/264 patients had M1b celiac lymph nodes whereas 49 other patients had M1a celiac lymph nodes that did not preclude a resection The average results obtained from the data with imputation of missing values (n=569) were roughly equal compared to the results obtained from the complete data of patients who had undergone all staging investigations (n=264 patients; Table 4) Marginal cost-effectiveness plane calculated in patients with oesophageal or gastric cardia cancer who had undergone all staging investigations (n=264) and using the five completed data sets (n=569) The combination of CT and US neck for the detection of supraclavicular lymph node metastases (one-positive scenario) liver and lung was considered as reference strategy CT=computed tomography; CXR=chest X-ray; QALY quality adjusted life year; USa=ultrasound abdomen; USn=ultrasound neck Surgery is presently the only established curative treatment option for patients with oesophageal or gastric cardia cancer with a substantial risk of morbidity and mortality adequate staging is of outmost importance to select patients without distant metastases for undergoing surgery we assessed which traditional staging investigations should be performed in patients with oesophageal or gastric cardia cancer to determine whether distant metastases were present and Our findings demonstrated that the performance of CT only was not sensitive enough for the detection of distant metastases The addition of US neck to CT for the detection of supraclavicular lymph node metastases resulted in the highest sensitivity For a slightly higher specificity (less false positives) but this required that both CT and these investigations were positive for metastases to define the result as positive (two-positive scenario) A higher specificity would however result in a decline in sensitivity and consequently in more resections in patients with distant metastases We recognise that the requirement of two staging procedures being positive is not a common clinical strategy is sometimes already used to confirm the suspicion of metastases on CT no single combination of investigations was more cost-effective than the combination CT and US neck only patients who had undergone CT neck/thorax/abdomen and one or more other investigations no statistically significant differences were found within the whole group of patients (n=569) according to whether all or some investigations had been performed the optimal strategy to stage patients with oesophageal or gastric cardia cancer is not automatically the combination of CT and US neck as sensitivities and specificities of combinations of investigations largely depend on the quality of the staging investigations in a centre This quality is determined by both experience of the investigator and quality of the equipment further studies need to determine the exact role of PET in the staging of oesophageal or gastric cardia cancer the combination of CT neck and US neck for the detection of supraclavicular lymph node metastases and CT thorax/abdomen for the detection of metastases in celiac lymph nodes and lung is a cost-effective strategy for the detection of distant metastases in patients with oesophageal or gastric cardia cancer US abdomen and chest X-ray have only limited additional value in the detection of distant metastases in these patients These staging investigations should only be performed for specific indications in patients with oesophageal or gastric cardia cancer as the treatment decision is not improved in most of the patients if these investigations are added to the diagnostic work-up The role of EUS for the detection of distant metastases seems also be limited which may be particularly due to the low number of M1b celiac lymph nodes in the present study This paper was modified 12 months after initial publication to switch to Creative Commons licence terms Gignoux M (1996) Contribution of cervical ultrasound and ultrasound fine-needle aspiration biopsy to the staging of thoracic oesophageal carcinoma Sivak Jr MV (1999) Evaluation of metastatic celiac axis lymph nodes in patients with esophageal carcinoma: accuracy of EUS Hayabuthi N (1996) Comparative analysis of imaging modalities in the preoperative assessment of nodal metastasis in esophageal cancer Hoffman BJ (2001) The utility of EUS and EUS-guided fine needle aspiration in detecting celiac lymph node metastasis in patients with esophageal cancer: a single-center experience Henson DE (1997) AJCC Cancer Staging Manual Weinstein MC (1996) Cost-Effectiveness in Health and Medicine Metreweli C (2000) Neck ultrasound in staging squamous oesophageal carcinoma – a high yield technique DeMeester TR (2001) Curative resection for esophageal adenocarcinoma: analysis of 100 en bloc esophagectomies Junginger T (2003) Positron emission tomography for staging esophageal cancer: does it lead to a different therapeutic approach Knottnerus JA (2001) The Evidence Base of Clinical Diagnosis Leccisotti L (2006) Positron emission tomography in the staging of esophageal cancer Siewert JR (1988) Assessment of resectability of esophageal cancer by computed tomography and magnetic resonance imaging Kulkarni KG (2005) Role of endoscopic ultrasonography in the staging and follow-up of esophageal cancer McConnell DB (1993) Role of computed tomographic scans in the staging of esophageal and proximal gastric malignancies Aikou T (1999) Assessment of cervical lymph node metastasis in esophageal carcinoma using ultrasonography Waxman I (2002) Clinical impact of endoscopic ultrasound-guided fine needle aspiration of celiac axis lymph nodes (M1a disease) in esophageal cancer Ann Thorac Surg 73: 916–920; discussion 920–921 Gross BH (1985) Esophageal carcinoma: CT findings Hawes RH (1999) Esophageal cancer staging: improved accuracy by endoscopic ultrasound of celiac lymph nodes Ann Thorac Surg 67: 319–321; discussion 322 Siewert JR (1998) Preoperative bronchoscopic assessment of airway invasion by esophageal cancer: a prospective study Kuhl H (2006) Staging and follow-up of gastrointestinal tumors with PET/CT Little RJA (2002) Statistical Analysis with Missing Data Graham JW (2002) Missing data: our view of the state of the art Siewert JR (2001) Esophageal cancer: patient evaluation and pre-treatment staging Yamaguchi H (1994) Neck ultrasonography for thoracic esophageal carcinoma Korobkin M (1983) Computed tomography for staging esophageal and gastroesophageal cancer: reevaluation Schutte HE (1993) Improved assessment of supraclavicular and abdominal metastases in oesophageal and gastro-oesophageal junction carcinoma with the combination of ultrasound and computed tomography Schutte HE (1992) Assessment of distant metastases with ultrasound-guided fine-needle aspiration biopsy and cytologic study in carcinoma of the esophagus and gastroesophageal junction Siersema PD (2006a) A comparison between low-volume referring regional centers and a high-volume referral center in quality of preoperative metastasis detection in esophageal carcinoma Siersema PD (2006b) Staging of esophageal carcinoma in a low-volume EUS center compared with reported results from high-volume centers Wiersema MJ (2001) Impact of EUS-guided fine-needle aspiration on lymph node staging in patients with esophageal carcinoma Alderson D (1998) Oesophageal cancer staging using endoscopic ultrasonography Reed CE (2002) An analysis of multiple staging management strategies for carcinoma of the esophagus: computed tomography ultrasound and computed tomography in cancer of the oesophagus and gastric cardia: a prospective comparison for detecting intra-abdominal metastases Briggs AH (2006) Statistical Analysis Of cost-Effectiveness Data Morifuji H (1985) Ultrasonic detection of lymph node metastases in the region around the celiac axis in esophageal and gastric cancer Download references We are grateful to Mrs Conny Vollebregt for collecting the data of the database EPMV was funded by a grant from the ‘Doelmatigheidsonderzoek’ fund of Erasmus MC Rotterdam Department of Gastroenterology and Hepatology Erasmus MC – University Medical Center Rotterdam From twelve months after its original publication this work is licensed under the Creative Commons Attribution-NonCommercial-Share Alike 3.0 Unported License visit http://creativecommons.org/licenses/by-nc-sa/3.0/ Download citation DOI: https://doi.org/10.1038/sj.bjc.6603960