Mastering Multiple Imputations using R. Part I: Grouped imputations

Intro

Multiple imputation is one of the great ideas in statistical science. The technique is simple, elegant and powerful. It is simple because it fills the holes in the data with plausible values. It is elegant because the uncertainty about the unknown data is coded in the data itself. And it is powerful because it can solve “other” problems that are actually missing data problems in disguise (Stef van Buuren).

Multiple imputation has become a standard for the treatment of missing data. Unlike conventional imputation methods (mean, median imputation etc.) that merely replace the missing values with a single value, “Rubin’s (1987) multiple imputation procedure replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute” (see Yang C. Yuan). According to Wikipedia, “the primary method of multiple imputation is multiple imputation by chained equations (MICE).” In R, the MICE method is implemented by the mice package.

In this blog post, I show how to conduct multiple imputation on a numeric variable grouped by another variable indicating the meaning of the numeric values. The “grouping” is done using a function of the miceadds package and works similar to the group_by() function of the dplyr package.

Data and packages

library(tidyverse)
library(mice)
library(miceadds)
library(labelled)

The dataset we use in this blog post comes along with the mice package and is called brandsma. the dataset contains “data from 4106 pupils attending 216 schools. This dataset includes all pupils and schools with missing data.” (see mice package description). With the following code, we select the variables we need for multiple imputations and for calculating statistical models. The variables are:

  • sch (School number)
  • pup (Pupil ID)
  • sex (Sex of pupil)
  • min (Minority member)
  • iqv (IQ verbal)
  • iqp (IQ performal)
  • lpr (language score PRE)
  • lpo (language score POST)
  • apr (Arithmetic score PRE)
  • apo (Arithmetic score POST)
df.RAW <- mice::brandsma %>%
  select(sch, pup, sex, min, iqv, iqp, lpr, lpo, apr, apo) %>%
  filter(!is.na(sex)) %>%
  set_variable_labels(.labels = c(
    "School number", "Pupil ID",
    "Sex of pupil", "Minority member",
    "IQ verbal", "IQ performal",
    "language score PRE", "language score POST",
    "Arithmetic score PRE", "Arithmetic score POST"
  ))
glimpse(df.RAW)
## Observations: 4,096
## Variables: 10
## $ sch <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,…
## $ pup <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, …
## $ sex <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ min <int> 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0,…
## $ iqv <dbl> -1.3535094, 2.1464906, 3.1464906, 2.6464906, -2.3535094, -0.8535094, -3.8535094, -2…
## $ iqp <dbl> -3.72274979, 3.27725021, 1.27391680, -1.05608313, -0.05608313, -1.05608313, -4.3900…
## $ lpr <dbl> 33, 44, 36, 36, 33, 29, 19, 22, 20, 44, 34, 31, 18, 36, 31, 34, 31, 32, 23, 20, 27,…
## $ lpo <dbl> NA, 50, 46, 45, 33, 46, 20, 30, 30, 57, 36, 36, 29, 40, 41, 47, 33, 37, 29, 26, 37,…
## $ apr <dbl> 10, 18, 14, 12, 10, 13, 8, 8, 7, 17, 10, 14, 11, 10, 10, 12, 9, 13, 9, 9, 7, 16, 16…
## $ apo <dbl> 12, 30, 24, 19, 24, 26, 9, 13, 13, 30, 23, 22, 19, 23, 18, 22, 15, 21, 13, 12, 11, …

Using the pivot_longer() function of the tidyr package, we transform the original data into a long table. The values of the quantitative variables are stored in the new variable VALUE. The ENDPOINT variable indicates to which variable each value belongs. Furthermore, we create a new variable (IMPUTED) which indicates missing values that need to be imputed. In addition, we assign value labels to the variables sex and min.

df.RAW <- df.RAW %>%
  pivot_longer(names_to = 'ENDPOINT', values_to = 'VALUE', cols = c(iqv:apo)) %>%
  mutate(IMPUTED = is.na(VALUE),
         sex = factor(sex, levels = c(0, 1), labels = c('f', 'm')),
         min = factor(min, levels = c(0, 1), labels = c('n', 'y')))
glimpse(df.RAW)
## Observations: 24,576
## Variables: 7
## $ sch      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pup      <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, …
## $ sex      <fct> m, m, m, m, m, m, m, m, m, m, m, m, f, f, f, f, f, f, f, f, f, f, f, f, f, f, …
## $ min      <fct> y, y, y, y, y, y, y, y, y, y, y, y, n, n, n, n, n, n, y, y, y, y, y, y, n, n, …
## $ ENDPOINT <chr> "iqv", "iqp", "lpr", "lpo", "apr", "apo", "iqv", "iqp", "lpr", "lpo", "apr", "…
## $ VALUE    <dbl> -1.35350942, -3.72274979, 33.00000000, NA, 10.00000000, 12.00000000, 2.1464905…
## $ IMPUTED  <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…

Specifying the imputation model

init <- mice(df.RAW, maxit = 0)
meth <- init$method
predM <- init$predictorMatrix

# groupwise imputation of variable VALUE
meth["VALUE"] <- "bygroup"

# remove as predictor
predM[, c('IMPUTED')] <- 0
# specify name of the grouping variable ('ENDPOINT') and imputation method ('pmm')
group <- list( "VALUE" = "ENDPOINT" )
imputationFunction <- list("VALUE" = "pmm" )

Running multiple imputations

## grouped imputation

mids_endpoint <- df.RAW %>%
  mice(.,
       method = meth,
       predictorMatrix = predM,
       m = 3,
       seed = 2020,
       printFlag = FALSE,
       group = group,
       imputationFunction = imputationFunction 
  )

Creating complete data

df.COMPLETE <- mice::complete(mids_endpoint, include = FALSE) %>% 
  mutate(ENDPOINT = factor(ENDPOINT, levels = c('iqv', 'iqp', 'apr', 'apo', 'lpr', 'lpo')))

Plotting observed and imputed values

ggplot(df.COMPLETE, aes(x = ENDPOINT, y = VALUE)) +
  geom_jitter(aes(colour = IMPUTED, alpha = IMPUTED), size = 0.5) +
  geom_violin(alpha=0) +
  scale_colour_manual(values = c('grey', 'red3')) +
  theme_bw() +
  theme(legend.position = 'none') +
  labs(colour = NULL, alpha = NULL, x = NULL, y = NULL,
       title = 'The distribution of observed and imputed values',
       subtitle = 'Red dots indicate imputed values') 

plot of chunk imp-plot

Author: norbert

Biometrician at Clinical Trial Centre, Leipzig University (GER), with degrees in sociology (MA) and public health (MPH).

2 thoughts on “Mastering Multiple Imputations using R. Part I: Grouped imputations”

  1. Nice! Must admit I haven’t worked with the mice package yet, only read about it. Should give it a try some time. So far I sometimes used caret to impute missings. Can be done via random forests or knn (k nearest neighbors), which can lead to plausible results, taking other variables into account.

    Recently stumbled upon the naniar package by Nick Tierney to visualize missing data – very neat, great ideas behind that approach (e. g. geom_miss_point). In the vignette, the simputation package is recommended for imputing missing values. States clearly that mean / median imputation is almost never a good idea …
    https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.