Generalized Stepwise Regression for Prediction Models in Clustered Data

Generalized stepwise regression for obtaining a prediction model that is validated with (stepwise) internal-external cross-validation, in or to obtain adequate performance across data sets. Requires data from individuals in multiple studies.

Usage

metapred(
  data,
  strata,
  formula,
  estFUN = "glm",
  scope = NULL,
  retest = FALSE,
  max.steps = 1000,
  center = FALSE,
  recal.int = FALSE,
  cvFUN = NULL,
  cv.k = NULL,
  metaFUN = NULL,
  meta.method = NULL,
  predFUN = NULL,
  perfFUN = NULL,
  genFUN = NULL,
  selFUN = "which.min",
  gen.of.perf = "first",
  ...
)

Arguments

data: data.frame containing the data. Note that metapred removes observations with missing data listwise for all variables in formula and scope, to ensure that the same data is used in each model in each step. The outcome variable should be numeric or coercible to numeric by as.numeric().
strata: Character to specify the name of the strata (e.g. studies or clusters) variable
formula: formula of the first model to be evaluated. metapred will start at formula and update it using terms of scope. Defaults to full main effects model, where the first column in data is assumed to be the outcome and all remaining columns (except strata) predictors. See formula for formulas in general.
estFUN: Function for estimating the model in the first stage. Currently "lm", "glm" and "logistfirth" are supported.
scope: formula. The difference between formula and scope defines the range of models examined in the stepwise search. Defaults to NULL, which leads to the intercept-only model. If scope is not nested in formula, this implies backwards selection will be applied (default). If scope is nested in formula, this implies forward selection will be applied. If equal, no stepwise selection is applied.
retest: Logical. Should added (removed) terms be retested for removal (addition)? TRUE implies bi-directional stepwise search.
max.steps: Integer. Maximum number of steps (additions or removals of terms) to take. Defaults to 1000, which is essentially as many as it takes. 0 implies no stepwise selection.
center: logical. Should numeric predictors be centered around the cluster mean?
recal.int: Logical. Should the intercept be recalibrated in each validation?
cvFUN: Cross-validation method, on the study (i.e. cluster or stratum) level. "l1o" for leave-one-out cross-validation (default). "bootstrap" for bootstrap. Or "fixed", for one or more data sets which are only used for validation. A user written function may be supplied as well.
cv.k: Parameter for cvFUN. For cvFUN="bootstrap", this is the number of bootstraps. For cvFUN="fixed", this is a vector of the indices of the (sorted) data sets. Not used for cvFUN="l1o".
metaFUN: Function for computing the meta-analytic coefficient estimates in two-stage MA. By default, rma.uni, from the metafor package is used. Default settings are univariate random effects, estimated with "REML". Method can be passed trough the meta.method argument.
meta.method: Name of method for meta-analysis. Default is "REML". For more options see rma.uni.
predFUN: Function for predicting new values. Defaults to the predicted probability of the outcome, using the link function of glm() or lm().
perfFUN: Function for computing the performance of the prediction models. Default: mean squared error (perfFUN="mse", aka Brier score for binomial outcomes).Other options are "var.e" (variance of prediction error), "auc" (area under the curve), "cal_int" (calibration intercept), and "cal_slope" (multiplicative calibration slope) and "cal_add_slope" (additive calibration slope), or a list of these, where only the first is used for model selection.
genFUN: Function or list of named functions for computing generalizability of the performance. Default: rema, summary statistic of a random effects meta-analysis. Choose "rema_tau" for heterogeneity estimate of a random effects meta-analysis, genFUN="abs_mean" for (absolute) mean, coefficient_of_variation for the coefficient of variation. If a list containing these, only the first is used for model selection.
selFUN: Function for selecting the best method. Default: lowest value for genFUN. Should be set to "which.max" if high values for genFUN indicate a good model.
gen.of.perf: For which performance measures should generalizability measures be computed? "first" (default) for only the first. "respective" for matching the generalizability measure to the performance measure on the same location in the list. "factorial" for applying all generalizability measures to all performance estimates.
...: To pass arguments to estFUN (e.g. family = "binomial"), or to other FUNctions.

Value

A list of class metapred, containing the final model in global.model, and the stepwise tree of estimates of the coefficients, performance measures, generalizability measures in stepwise.

Details

Use subset.metapred to obtain an individual prediction model from a metapred object.

Note that formula.changes is currently unordered; it does not represent the order of changes in the stepwise procedure.

metapred is still under development, use with care.

References

Debray TPA, Moons KGM, Ahmed I, Koffijberg H, Riley RD. A framework for developing, implementing, and evaluating clinical prediction models in an individual participant data meta-analysis. Stat Med. 2013;32(18):3158-80.

de Jong VMT, Moons KGM, Eijkemans MJC, Riley RD, Debray TPA. Developing more generalizable prediction models from pooled studies and large clustered data sets. Stat Med. 2021;40(15):3533--59.

Riley RD, Tierney JF, Stewart LA. Individual participant data meta-analysis: a handbook for healthcare research. Hoboken, NJ: Wiley; 2021. ISBN: 978-1-119-33372-2.

Schmid CH, Stijnen T, White IR. Handbook of meta-analysis. First edition. Boca Raton: Taylor and Francis; 2020. ISBN: 978-1-315-11940-3.

Author

Valentijn de Jong <Valentijn.M.T.de.Jong@gmail.com>

Examples

data(DVTipd)

if (FALSE) {
# Explore heterogeneity in intercept and assocation of 'ddimdich'
glmer(dvt ~ 0 + cluster + (ddimdich|study), family = binomial(), data = DVTipd)
}

# Scope
f <- dvt ~ histdvt + ddimdich + sex + notraum

# Internal-external cross-validation of a pre-specified model 'f'
fit <- metapred(DVTipd, strata = "study", formula = f, scope = f, family = binomial)
fit
#> Call: metapred(data = DVTipd, strata = "study", formula = f, scope = f, 
#>     family = binomial)
#> 
#> Started with model:
#> dvt ~ histdvt + ddimdich + sex + notraum
#> <environment: 0x559b1011bad8>
#> 
#> Generalizability:
#>   unchanged
#> 1 0.1484983
#> 
#> Cross-validation stopped after 0 steps, as no changes were requested. Final model:
#> Meta-analytic model of prediction models estimated in 4 strata. Coefficients: 
#> (Intercept)     histdvt    ddimdich         sex     notraum 
#>  -4.1180636   0.6174010   1.6962441   0.9647970   0.3761707 

# Let's try to simplify model 'f' in order to improve its external validity
metapred(DVTipd, strata = "study", formula = f, family = binomial)
#> Call: metapred(data = DVTipd, strata = "study", formula = f, family = binomial)
#> 
#> Started with model:
#> dvt ~ histdvt + ddimdich + sex + notraum
#> <environment: 0x559b1011bad8>
#> 
#> Generalizability:
#>   unchanged
#> 1 0.1484983
#> 
#> Generalizability:
#>   ddimdich   histdvt notraum      sex
#> 1 0.136086 0.1375105 0.12977 0.141173
#> 
#> Continued with model:
#> dvt ~ histdvt + ddimdich + sex
#> <environment: 0x559b1011bad8>
#> 
#> Generalizability:
#>    ddimdich   histdvt       sex
#> 1 0.1366828 0.1279623 0.1319755
#> 
#> Continued with model:
#> dvt ~ ddimdich + sex
#> <environment: 0x559b1011bad8>
#> 
#> Generalizability:
#>    ddimdich       sex
#> 1 0.1355548 0.1303254
#> 
#> Cross-validation stopped after 3 steps, as no improvement was possible. Final model:
#> Meta-analytic model of prediction models estimated in 4 strata. Coefficients: 
#> (Intercept)    ddimdich         sex 
#>  -3.6187987   1.7130967   0.8784071 

# We can also try to build a generalizable model from scratch

if (FALSE) {
# Some additional examples:
metapred(DVTipd, strata = "study", formula = dvt ~ 1, scope = f, family = binomial) # Forwards
metapred(DVTipd, strata = "study", formula = f, scope = f, family = binomial) # no selection
metapred(DVTipd, strata = "study", formula = f, max.steps = 0, family = binomial) # no selection
metapred(DVTipd, strata = "study", formula = f, recal.int = TRUE, family = binomial)
metapred(DVTipd, strata = "study", formula = f, meta.method = "REML", family = binomial)
}
# By default, metapred assumes the first column is the outcome.
newdat <- data.frame(dvt=0, histdvt=0, ddimdich=0, sex=1, notraum=0)
fitted <- predict(fit, newdata = newdat)