Imputation via Joint Modelling

Thomas Debray, PhD

It is common to assume that all data follow a multivariate Normal or Student-T distribution

\((π‘₯_1,π‘₯_2,𝑦) \sim MVN (πœ‡,"Ξ£" )\)

For now, we consider that

  • The mean vector ΞΌ is known
  • The covariance matrix Ξ£ is known

How to use the conditional mean?

  • \(πœ‡_1^βˆ—\) is the most likely value for \(π‘₯_1\), and can thus be used as imputation.

  • \(πœ‡_1^βˆ—\) is equal to the mean of \(π‘₯_1\) plus an adjustment.

    • If there is little correlation between the predictors and outcomes, then the best guess for \(π‘₯_1\) is the mean of \(π‘₯_1\).

Multivariate distribution of patients presenting with lower respiratory tract infections in primary care:

  • Age, years: mean = 52, SD = 16
  • C-reactive protein: mean = 53, SD = 62
  • Temperature, CΒ°: mean = 37.5, SD = 0.8
  • Cor (Age, CRP) = 0.09
  • Cor(Age, Temperature) = -0.15
  • Cor(CRP, Temperature) = 0.25

How to use the conditional variance?

  • β€œVar” \((π‘₯_1β”‚π‘₯_2, 𝑦)\) quantifies the variance of \(πœ‡_1^βˆ—\) , and can be used to draw multiple imputations. In particular, we can sample an imputed value from a Normal distribution with mean \(πœ‡_1^βˆ—\) and variance β€œVar” \((π‘₯_1β”‚π‘₯_2, 𝑦)\)

  • β€œVar” \((π‘₯_1β”‚π‘₯_2, 𝑦)\) is equal to the variance of \(π‘₯_1\) minus an adjustment. If there is little correlation between the predictors and outcomes, then the variance of imputed values for \(π‘₯_1\) is equal to the variance of \(π‘₯_1\) in the original population.

So, how to generate an imputed dataset?

An iterative procedure is needed:

  • Estimate ΞΌ and Ξ£
  • Use ΞΌ and Ξ£ to impute the missing values
  • Update estimates of ΞΌ and Ξ£ using the imputed values
  • Continue until estimates of ΞΌ and Ξ£ stabilize

This approach is known as the Gibbs sampler

A natural choice for the initial estimates of ΞΌ and Ξ£ is to derive them directly using the complete data only.

Illustration of the Gibbs sampler

  • We impute missing values using the new estimates for ΞΌ and β€œΞ£β€, and re-estimate ΞΌ and β€œΞ£β€ .
  • We iterate this procedure many times.
  • Eventually, we should obtain reliable estimates for ΞΌ and β€œΞ£β€ , and thus also obtain imputations that properly reflect their uncertainty

How to ensure that we end up in the posterior distribution?

  • Allow for sufficient imputation cycles!

  • Repeat the whole process from different starting points

    • E.g. Estimate the initial version of ΞΌ and Ξ£ in a dataset where all missing values have been replaced by a random value

Final considerations

  • Normality assumptions may not always be realistic

    • Non-continuous data
    • Skewed data
    • Nonlinear data
    • Clustered data
  • Several extensions have been proposed to accommodate for this