# Mastering Multiple Imputations using R. Part I: Grouped imputations

## Intro

Multiple imputation is one of the great ideas in statistical science. The technique is simple, elegant and powerful. It is simple because it fills the holes in the data with plausible values. It is elegant because the uncertainty about the unknown data is coded in the data itself. And it is powerful because it can solve “other” problems that are actually missing data problems in disguise (Stef van Buuren).

Multiple imputation has become a standard for the treatment of missing data. Unlike conventional imputation methods (mean, median imputation etc.) that merely replace the missing values with a single value, “Rubin’s (1987) multiple imputation procedure replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute” (see Yang C. Yuan). According to Wikipedia, “the primary method of multiple imputation is multiple imputation by chained equations (MICE).” In R, the MICE method is implemented by the `mice` package.

In this blog post, I show how to conduct multiple imputation on a numeric variable grouped by another variable indicating the meaning of the numeric values. The “grouping” is done using a function of the `miceadds` package and works similar to the `group_by()` function of the `dplyr` package.

## Data and packages

```library(tidyverse)
library(mice)
library(labelled)
```

The dataset we use in this blog post comes along with the `mice` package and is called `brandsma`. the dataset contains “data from 4106 pupils attending 216 schools. This dataset includes all pupils and schools with missing data.” (see `mice` package description). With the following code, we select the variables we need for multiple imputations and for calculating statistical models. The variables are:

• `sch` (School number)
• `pup` (Pupil ID)
• `sex` (Sex of pupil)
• `min` (Minority member)
• `iqv` (IQ verbal)
• `iqp` (IQ performal)
• `lpr` (language score PRE)
• `lpo` (language score POST)
• `apr` (Arithmetic score PRE)
• `apo` (Arithmetic score POST)
```df.RAW <- mice::brandsma %>%
select(sch, pup, sex, min, iqv, iqp, lpr, lpo, apr, apo) %>%
filter(!is.na(sex)) %>%
set_variable_labels(.labels = c(
"School number", "Pupil ID",
"Sex of pupil", "Minority member",
"IQ verbal", "IQ performal",
"language score PRE", "language score POST",
"Arithmetic score PRE", "Arithmetic score POST"
))
glimpse(df.RAW)
```
```## Observations: 4,096
## Variables: 10
## \$ sch <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,…
## \$ pup <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, …
## \$ sex <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## \$ min <int> 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0,…
## \$ iqv <dbl> -1.3535094, 2.1464906, 3.1464906, 2.6464906, -2.3535094, -0.8535094, -3.8535094, -2…
## \$ iqp <dbl> -3.72274979, 3.27725021, 1.27391680, -1.05608313, -0.05608313, -1.05608313, -4.3900…
## \$ lpr <dbl> 33, 44, 36, 36, 33, 29, 19, 22, 20, 44, 34, 31, 18, 36, 31, 34, 31, 32, 23, 20, 27,…
## \$ lpo <dbl> NA, 50, 46, 45, 33, 46, 20, 30, 30, 57, 36, 36, 29, 40, 41, 47, 33, 37, 29, 26, 37,…
## \$ apr <dbl> 10, 18, 14, 12, 10, 13, 8, 8, 7, 17, 10, 14, 11, 10, 10, 12, 9, 13, 9, 9, 7, 16, 16…
## \$ apo <dbl> 12, 30, 24, 19, 24, 26, 9, 13, 13, 30, 23, 22, 19, 23, 18, 22, 15, 21, 13, 12, 11, …
```

Using the `pivot_longer()` function of the `tidyr` package, we transform the original data into a long table. The values of the quantitative variables are stored in the new variable `VALUE`. The `ENDPOINT` variable indicates to which variable each value belongs. Furthermore, we create a new variable (IMPUTED) which indicates missing values that need to be imputed. In addition, we assign value labels to the variables `sex` and `min`.

```df.RAW <- df.RAW %>%
pivot_longer(names_to = 'ENDPOINT', values_to = 'VALUE', cols = c(iqv:apo)) %>%
mutate(IMPUTED = is.na(VALUE),
sex = factor(sex, levels = c(0, 1), labels = c('f', 'm')),
min = factor(min, levels = c(0, 1), labels = c('n', 'y')))
glimpse(df.RAW)
```
```## Observations: 24,576
## Variables: 7
## \$ sch      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## \$ pup      <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, …
## \$ sex      <fct> m, m, m, m, m, m, m, m, m, m, m, m, f, f, f, f, f, f, f, f, f, f, f, f, f, f, …
## \$ min      <fct> y, y, y, y, y, y, y, y, y, y, y, y, n, n, n, n, n, n, y, y, y, y, y, y, n, n, …
## \$ ENDPOINT <chr> "iqv", "iqp", "lpr", "lpo", "apr", "apo", "iqv", "iqp", "lpr", "lpo", "apr", "…
## \$ VALUE    <dbl> -1.35350942, -3.72274979, 33.00000000, NA, 10.00000000, 12.00000000, 2.1464905…
## \$ IMPUTED  <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
```

## Specifying the imputation model

```init <- mice(df.RAW, maxit = 0)
meth <- init\$method
predM <- init\$predictorMatrix

# groupwise imputation of variable VALUE
meth["VALUE"] <- "bygroup"

# remove as predictor
predM[, c('IMPUTED')] <- 0
# specify name of the grouping variable ('ENDPOINT') and imputation method ('pmm')
group <- list( "VALUE" = "ENDPOINT" )
imputationFunction <- list("VALUE" = "pmm" )
```

## Running multiple imputations

```## grouped imputation

mids_endpoint <- df.RAW %>%
mice(.,
method = meth,
predictorMatrix = predM,
m = 3,
seed = 2020,
printFlag = FALSE,
group = group,
imputationFunction = imputationFunction
)
```

## Creating complete data

```df.COMPLETE <- mice::complete(mids_endpoint, include = FALSE) %>%
mutate(ENDPOINT = factor(ENDPOINT, levels = c('iqv', 'iqp', 'apr', 'apo', 'lpr', 'lpo')))
```

## Plotting observed and imputed values

```ggplot(df.COMPLETE, aes(x = ENDPOINT, y = VALUE)) +
geom_jitter(aes(colour = IMPUTED, alpha = IMPUTED), size = 0.5) +
geom_violin(alpha=0) +
scale_colour_manual(values = c('grey', 'red3')) +
theme_bw() +
theme(legend.position = 'none') +
labs(colour = NULL, alpha = NULL, x = NULL, y = NULL,
title = 'The distribution of observed and imputed values',
subtitle = 'Red dots indicate imputed values')
```

## Author: norbert

Biometrician at Clinical Trial Centre, Leipzig University (GER), with degrees in sociology (MA) and public health (MPH).

## 2 thoughts on “Mastering Multiple Imputations using R. Part I: Grouped imputations”

1. Nice! Must admit I haven’t worked with the mice package yet, only read about it. Should give it a try some time. So far I sometimes used caret to impute missings. Can be done via random forests or knn (k nearest neighbors), which can lead to plausible results, taking other variables into account.

Recently stumbled upon the naniar package by Nick Tierney to visualize missing data – very neat, great ideas behind that approach (e. g. geom_miss_point). In the vignette, the simputation package is recommended for imputing missing values. States clearly that mean / median imputation is almost never a good idea …
https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html

1. Thanks, Wolf. I hope I’ll find some time to complete this blog post soon. 🙂

This site uses Akismet to reduce spam. Learn how your comment data is processed.