BACKGROUND: We investigate which procedure selects the most trustworthy predictive model to explain the effect of an intervention and support decision-making.
METHODS: We study a large variety of model selection procedures in practical settings: finite samples settings and without a theoretical assumption of well-specified models. Beyond standard cross-validation or internal validation procedures, we also study elaborate causal risks. These build proxies of the causal error using "nuisance" reweighting to compute it on the observed data. We evaluate whether empirically estimated nuisances, which are necessarily noisy, add noise to model selection and compare different metrics for causal model selection in an extensive empirical study based on a simulation and 3 health care datasets based on real covariates.
RESULTS: Among all metrics, the mean squared error, classically used to evaluate predictive modes, is worse. Reweighting it with a propensity score does not bring much improvement in most cases. On average, the $R\text{-risk}$, which uses as nuisances a model of mean outcome and propensity scores, leads to the best performances. Nuisance corrections are best estimated with flexible estimators such as a super learner.
CONCLUSIONS: When predictive models are used to explain the effect of an intervention, they must be evaluated with different procedures than standard predictive settings, using the $R\text{-risk}$ from causal inference.