Imputation via Joint Modelling

Thomas Debray, PhD

Imputation via Joint Modelling

Imputation via Joint Modelling

It is common to assume that all data follow a multivariate Normal or Student-T distribution

\((π‘₯_1,π‘₯_2,𝑦) \sim MVN (πœ‡,"Ξ£" )\)

For now, we consider that

  • The mean vector ΞΌ is known
  • The covariance matrix Ξ£ is known

Imputation via Joint Modelling

Imputation via Joint Modelling

How to use the conditional mean?

  • \(πœ‡_1^βˆ—\) is the most likely value for \(π‘₯_1\), and can thus be used as imputation.

  • \(πœ‡_1^βˆ—\) is equal to the mean of \(π‘₯_1\) plus an adjustment.

    • If there is little correlation between the predictors and outcomes, then the best guess for \(π‘₯_1\) is the mean of \(π‘₯_1\).

Imputation via Joint Modelling

Example

Multivariate distribution of patients presenting with lower respiratory tract infections in primary care:

  • Age, years: mean = 52, SD = 16
  • C-reactive protein: mean = 53, SD = 62
  • Temperature, CΒ°: mean = 37.5, SD = 0.8
  • Cor (Age, CRP) = 0.09
  • Cor(Age, Temperature) = -0.15
  • Cor(CRP, Temperature) = 0.25

Imputation via Joint Modelling

Imputation via Joint Modelling

Imputation via Joint Modelling

Imputation via Joint Modelling

Imputation via Joint Modelling

How to use the conditional variance?

  • β€œVar” \((π‘₯_1β”‚π‘₯_2, 𝑦)\) quantifies the variance of \(πœ‡_1^βˆ—\) , and can be used to draw multiple imputations. In particular, we can sample an imputed value from a Normal distribution with mean \(πœ‡_1^βˆ—\) and variance β€œVar” \((π‘₯_1β”‚π‘₯_2, 𝑦)\)

  • β€œVar” \((π‘₯_1β”‚π‘₯_2, 𝑦)\) is equal to the variance of \(π‘₯_1\) minus an adjustment. If there is little correlation between the predictors and outcomes, then the variance of imputed values for \(π‘₯_1\) is equal to the variance of \(π‘₯_1\) in the original population.

Imputation via Joint Modelling

Imputation via Joint Modelling

Imputation via Joint Modelling

So, how to generate an imputed dataset?

An iterative procedure is needed:

  • Estimate ΞΌ and Ξ£
  • Use ΞΌ and Ξ£ to impute the missing values
  • Update estimates of ΞΌ and Ξ£ using the imputed values
  • Continue until estimates of ΞΌ and Ξ£ stabilize

This approach is known as the Gibbs sampler

A natural choice for the initial estimates of ΞΌ and Ξ£ is to derive them directly using the complete data only.

Imputation via Joint Modelling

Imputation via Joint Modelling

Imputation via Joint Modelling

Imputation via Joint Modelling

Imputation via Joint Modelling

Imputation via Joint Modelling

Imputation via Joint Modelling

Illustration of the Gibbs sampler

  • We impute missing values using the new estimates for ΞΌ and β€œΞ£β€, and re-estimate ΞΌ and β€œΞ£β€ .
  • We iterate this procedure many times.
  • Eventually, we should obtain reliable estimates for ΞΌ and β€œΞ£β€ , and thus also obtain imputations that properly reflect their uncertainty

Imputation via Joint Modelling

Imputation via Joint Modelling

How to ensure that we end up in the posterior distribution?

  • Allow for sufficient imputation cycles!

  • Repeat the whole process from different starting points

    • E.g. Estimate the initial version of ΞΌ and Ξ£ in a dataset where all missing values have been replaced by a random value

Imputation via Joint Modelling

Imputation via Joint Modelling

Imputation via Joint Modelling

## Imputation via Joint Modelling

Imputation via Joint Modelling

Final considerations

  • Normality assumptions may not always be realistic

    • Non-continuous data
    • Skewed data
    • Nonlinear data
    • Clustered data
  • Several extensions have been proposed to accommodate for this