Skip to contents

Generalized stepwise regression for obtaining a prediction model that is validated with (stepwise) internal-external cross-validation, in or to obtain adequate performance across data sets. Requires data from individuals in multiple studies.

Usage

metapred(
  data,
  strata,
  formula,
  estFUN = "glm",
  scope = NULL,
  retest = FALSE,
  max.steps = 1000,
  center = FALSE,
  recal.int = FALSE,
  cvFUN = NULL,
  cv.k = NULL,
  metaFUN = NULL,
  meta.method = NULL,
  predFUN = NULL,
  perfFUN = NULL,
  genFUN = NULL,
  selFUN = "which.min",
  gen.of.perf = "first",
  ...
)

Arguments

data

data.frame containing the data. Note that metapred removes observations with missing data listwise for all variables in formula and scope, to ensure that the same data is used in each model in each step. The outcome variable should be numeric or coercible to numeric by as.numeric().

strata

Character to specify the name of the strata (e.g. studies or clusters) variable

formula

formula of the first model to be evaluated. metapred will start at formula and update it using terms of scope. Defaults to full main effects model, where the first column in data is assumed to be the outcome and all remaining columns (except strata) predictors. See formula for formulas in general.

estFUN

Function for estimating the model in the first stage. Currently "lm", "glm" and "logistfirth" are supported.

scope

formula. The difference between formula and scope defines the range of models examined in the stepwise search. Defaults to NULL, which leads to the intercept-only model. If scope is not nested in formula, this implies backwards selection will be applied (default). If scope is nested in formula, this implies forward selection will be applied. If equal, no stepwise selection is applied.

retest

Logical. Should added (removed) terms be retested for removal (addition)? TRUE implies bi-directional stepwise search.

max.steps

Integer. Maximum number of steps (additions or removals of terms) to take. Defaults to 1000, which is essentially as many as it takes. 0 implies no stepwise selection.

center

logical. Should numeric predictors be centered around the cluster mean?

recal.int

Logical. Should the intercept be recalibrated in each validation?

cvFUN

Cross-validation method, on the study (i.e. cluster or stratum) level. "l1o" for leave-one-out cross-validation (default). "bootstrap" for bootstrap. Or "fixed", for one or more data sets which are only used for validation. A user written function may be supplied as well.

cv.k

Parameter for cvFUN. For cvFUN="bootstrap", this is the number of bootstraps. For cvFUN="fixed", this is a vector of the indices of the (sorted) data sets. Not used for cvFUN="l1o".

metaFUN

Function for computing the meta-analytic coefficient estimates in two-stage MA. By default, rma.uni, from the metafor package is used. Default settings are univariate random effects, estimated with "REML". Method can be passed trough the meta.method argument.

meta.method

Name of method for meta-analysis. Default is "REML". For more options see rma.uni.

predFUN

Function for predicting new values. Defaults to the predicted probability of the outcome, using the link function of glm() or lm().

perfFUN

Function for computing the performance of the prediction models. Default: mean squared error (perfFUN="mse", aka Brier score for binomial outcomes).Other options are "var.e" (variance of prediction error), "auc" (area under the curve), "cal_int" (calibration intercept), and "cal_slope" (multiplicative calibration slope) and "cal_add_slope" (additive calibration slope), or a list of these, where only the first is used for model selection.

genFUN

Function or list of named functions for computing generalizability of the performance. Default: rema, summary statistic of a random effects meta-analysis. Choose "rema_tau" for heterogeneity estimate of a random effects meta-analysis, genFUN="abs_mean" for (absolute) mean, coefficient_of_variation for the coefficient of variation. If a list containing these, only the first is used for model selection.

selFUN

Function for selecting the best method. Default: lowest value for genFUN. Should be set to "which.max" if high values for genFUN indicate a good model.

gen.of.perf

For which performance measures should generalizability measures be computed? "first" (default) for only the first. "respective" for matching the generalizability measure to the performance measure on the same location in the list. "factorial" for applying all generalizability measures to all performance estimates.

...

To pass arguments to estFUN (e.g. family = "binomial"), or to other FUNctions.

Value

A list of class metapred, containing the final model in global.model, and the stepwise tree of estimates of the coefficients, performance measures, generalizability measures in stepwise.

Details

Use subset.metapred to obtain an individual prediction model from a metapred object.

Note that formula.changes is currently unordered; it does not represent the order of changes in the stepwise procedure.

metapred is still under development, use with care.

References

Debray TPA, Moons KGM, Ahmed I, Koffijberg H, Riley RD. A framework for developing, implementing, and evaluating clinical prediction models in an individual participant data meta-analysis. Stat Med. 2013;32(18):3158-80.

de Jong VMT, Moons KGM, Eijkemans MJC, Riley RD, Debray TPA. Developing more generalizable prediction models from pooled studies and large clustered data sets. Stat Med. 2021;40(15):3533–59.

Riley RD, Tierney JF, Stewart LA. Individual participant data meta-analysis: a handbook for healthcare research. Hoboken, NJ: Wiley; 2021. ISBN: 978-1-119-33372-2.

Schmid CH, Stijnen T, White IR. Handbook of meta-analysis. First edition. Boca Raton: Taylor and Francis; 2020. ISBN: 978-1-315-11940-3.

See also

forest.metapred for generating a forest plot of prediction model performance

Author

Valentijn de Jong <Valentijn.M.T.de.Jong@gmail.com>

Examples

data(DVTipd)

if (FALSE) { # \dontrun{
# Explore heterogeneity in intercept and assocation of 'ddimdich'
glmer(dvt ~ 0 + cluster + (ddimdich|study), family = binomial(), data = DVTipd)
} # }

# Scope
f <- dvt ~ histdvt + ddimdich + sex + notraum

# Internal-external cross-validation of a pre-specified model 'f'
fit <- metapred(DVTipd, strata = "study", formula = f, scope = f, family = binomial)
fit
#> Call: metapred(data = DVTipd, strata = "study", formula = f, scope = f, 
#>     family = binomial)
#> 
#> Started with model:
#> dvt ~ histdvt + ddimdich + sex + notraum
#> <environment: 0x55b0cdf1b1e8>
#> 
#> Generalizability:
#>   unchanged
#> 1 0.1484983
#> 
#> Cross-validation stopped after 0 steps, as no changes were requested. Final model:
#> Meta-analytic model of prediction models estimated in 4 strata. Coefficients: 
#> (Intercept)     histdvt    ddimdich         sex     notraum 
#>  -4.1180636   0.6174010   1.6962441   0.9647970   0.3761707 

# Let's try to simplify model 'f' in order to improve its external validity
# Estimate multiple performance measures and generalizability functions simultaneously.
# The first performance and generalizabilty functions are used for model selection.
fit2 <- metapred(DVTipd, strata = "study", formula = f, scope = f, family = binomial, 
perfFUN = list("mse", "auc", "calibration_intercept", "calibration_slope"),
genFUN = list("rema", "rema.tau"),
gen.of.perf = "factorial")
#> Waiting for profiling to be done...
#> Waiting for profiling to be done...
#> Waiting for profiling to be done...
#> Waiting for profiling to be done...

# Use perf() to get performance estimates per stratum. 0 to get all. 
# Performance measures may also be selected by name.
perf(fit2, perfFUN = 0) 
#> $mse
#>         val.strata  estimate         se          var      ci.lb     ci.ub
#> b, c, d          a 0.1382770 0.02183757 0.0004768795 0.09511596 0.1814381
#> a, c, d          b 0.1796124 0.02738163 0.0007497535 0.12529467 0.2339302
#> a, b, d          c 0.1235316 0.02782818 0.0007744077 0.06840936 0.1786539
#> a, b, c          d 0.1515685 0.01585625 0.0002514208 0.12020968 0.1829273
#>             measure   n   class
#> b, c, d Performance 146 mp.perf
#> a, c, d Performance 102 mp.perf
#> a, b, d Performance 116 mp.perf
#> a, b, c Performance 136 mp.perf
#> 
#> $auc
#>         val.strata  estimate         se         var     ci.lb     ci.ub
#> b, c, d          a 0.7724868 0.04382224 0.001920388 0.6865968 0.8583768
#> a, c, d          b 0.8261728 0.03796369 0.001441242 0.7517654 0.9005803
#> a, b, d          c 0.6528125 0.06875306 0.004726984 0.5180590 0.7875660
#> a, b, c          d 0.6945565 0.07988254 0.006381220 0.5379896 0.8511233
#>             measure   n class
#> b, c, d Performance 146   auc
#> a, c, d Performance 102   auc
#> a, b, d Performance 116   auc
#> a, b, c Performance 136   auc
#> 
#> $calibration_intercept
#>         val.strata   estimate        se        var       ci.lb      ci.ub
#> b, c, d          a  0.5489667 0.2231726 0.04980600  0.09306444  0.9711619
#> a, c, d          b  0.9479472 0.2408201 0.05799434  0.46157190  1.4090583
#> a, b, d          c  1.0462912 0.2799524 0.07837332  0.46250622  1.5668693
#> a, b, c          d -1.7469612 0.3184523 0.10141187 -2.42009937 -1.1610224
#>             measure   n class
#> b, c, d Performance 146 recal
#> a, c, d Performance 102 recal
#> a, b, d Performance 116 recal
#> a, b, c Performance 136 recal
#> 
#> $calibration_slope
#>         val.strata  estimate        se        var       ci.lb     ci.ub
#> b, c, d          a 1.3964493 0.3573210 0.12767827  0.68995356 2.1029449
#> a, c, d          b 1.8168089 0.4361715 0.19024558  0.95298610 2.6806318
#> a, b, d          c 0.5415312 0.3451164 0.11910532 -0.14389220 1.2269546
#> a, b, c          d 0.4283140 0.2316077 0.05364211 -0.03258457 0.8892125
#>             measure   n   class
#> b, c, d Performance 146 mp.perf
#> a, c, d Performance 102 mp.perf
#> a, b, d Performance 116 mp.perf
#> a, b, c Performance 136 mp.perf
#> 

# Use gen() to get generalizability estimates. 0 to get all. 
# Generalizability estimates may also be selected by name.
gen(fit2, genFUN = 0)
#>         rema     rema.tau         rema     rema.tau         rema     rema.tau 
#> 1.484983e-01 2.147531e-06 7.493582e-01 9.149179e-02 2.093243e-01 1.616255e+00 
#>         rema     rema.tau 
#> 9.897523e-01 3.148730e-01 

# Use ma() to perform a (new) meta-analysis of performance estimates.
ma(fit2, perfFUN = "auc")
#>         est        se     ci.lb     ci.ub       tau2   se.tau2     pi.lb
#> 1 0.7493582 0.2086202 0.6061769 0.8530988 0.09149179 0.1454303 0.3808868
#>       pi.ub method test
#> 1 0.9356058   REML knha
ma(fit2, perfFUN = "calibration_intercept")
#>         est        se     ci.lb    ci.ub     tau2  se.tau2     pi.lb    pi.ub
#> 1 0.2093243 0.6535412 -1.870536 2.289184 1.616255 1.378178 -5.941169 6.359817
#>   method test
#> 1   REML knha
ma(fit2, perfFUN = "calibration_slope")
#>         est        se       ci.lb    ci.ub     tau2   se.tau2     pi.lb   pi.ub
#> 1 0.9897523 0.3311921 -0.06424865 2.043753 0.314873 0.3528566 -1.813786 3.79329
#>   method test
#> 1   REML knha

# We can also try to build a generalizable model from scratch

if (FALSE) { # \dontrun{
# Some additional examples:
metapred(DVTipd, strata = "study", formula = dvt ~ 1, scope = f, family = binomial) # Forwards
metapred(DVTipd, strata = "study", formula = f, scope = f, family = binomial) # no selection
metapred(DVTipd, strata = "study", formula = f, max.steps = 0, family = binomial) # no selection
metapred(DVTipd, strata = "study", formula = f, recal.int = TRUE, family = binomial)
metapred(DVTipd, strata = "study", formula = f, meta.method = "REML", family = binomial)
} # }
# By default, metapred assumes the first column is the outcome.
newdat <- data.frame(dvt=0, histdvt=0, ddimdich=0, sex=1, notraum=0)
fitted <- predict(fit, newdata = newdat)