Intro
Multiple imputation is one of the great ideas in statistical science. The technique is simple, elegant and powerful. It is simple because it fills the holes in the data with plausible values. It is elegant because the uncertainty about the unknown data is coded in the data itself. And it is powerful because it can solve “other” problems that are actually missing data problems in disguise (Stef van Buuren).
Multiple imputation has become a standard for the treatment of missing data. Unlike conventional imputation methods (mean, median imputation etc.) that merely replace the missing values with a single value, “Rubin’s (1987) multiple imputation procedure replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute” (see Yang C. Yuan). According to Wikipedia, “the primary method of multiple imputation is multiple imputation by chained equations (MICE).” In R, the MICE method is implemented by the mice
package.
…
In this blog post, I show how to conduct multiple imputation on a numeric variable grouped by another variable indicating the meaning of the numeric values. The “grouping” is done using a function of the miceadds
package and works similar to the group_by()
function of the dplyr
package.
Data and packages
library(tidyverse) library(mice) library(miceadds) library(labelled)
The dataset we use in this blog post comes along with the mice
package and is called brandsma
. the dataset contains “data from 4106 pupils attending 216 schools. This dataset includes all pupils and schools with missing data.” (see mice
package description). With the following code, we select the variables we need for multiple imputations and for calculating statistical models. The variables are:
sch
(School number)pup
(Pupil ID)sex
(Sex of pupil)min
(Minority member)iqv
(IQ verbal)iqp
(IQ performal)lpr
(language score PRE)lpo
(language score POST)apr
(Arithmetic score PRE)apo
(Arithmetic score POST)
df.RAW <- mice::brandsma %>% select(sch, pup, sex, min, iqv, iqp, lpr, lpo, apr, apo) %>% filter(!is.na(sex)) %>% set_variable_labels(.labels = c( "School number", "Pupil ID", "Sex of pupil", "Minority member", "IQ verbal", "IQ performal", "language score PRE", "language score POST", "Arithmetic score PRE", "Arithmetic score POST" )) glimpse(df.RAW)
## Observations: 4,096 ## Variables: 10 ## $ sch <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,… ## $ pup <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, … ## $ sex <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… ## $ min <int> 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0,… ## $ iqv <dbl> -1.3535094, 2.1464906, 3.1464906, 2.6464906, -2.3535094, -0.8535094, -3.8535094, -2… ## $ iqp <dbl> -3.72274979, 3.27725021, 1.27391680, -1.05608313, -0.05608313, -1.05608313, -4.3900… ## $ lpr <dbl> 33, 44, 36, 36, 33, 29, 19, 22, 20, 44, 34, 31, 18, 36, 31, 34, 31, 32, 23, 20, 27,… ## $ lpo <dbl> NA, 50, 46, 45, 33, 46, 20, 30, 30, 57, 36, 36, 29, 40, 41, 47, 33, 37, 29, 26, 37,… ## $ apr <dbl> 10, 18, 14, 12, 10, 13, 8, 8, 7, 17, 10, 14, 11, 10, 10, 12, 9, 13, 9, 9, 7, 16, 16… ## $ apo <dbl> 12, 30, 24, 19, 24, 26, 9, 13, 13, 30, 23, 22, 19, 23, 18, 22, 15, 21, 13, 12, 11, …
Using the pivot_longer()
function of the tidyr
package, we transform the original data into a long table. The values of the quantitative variables are stored in the new variable VALUE
. The ENDPOINT
variable indicates to which variable each value belongs. Furthermore, we create a new variable (IMPUTED) which indicates missing values that need to be imputed. In addition, we assign value labels to the variables sex
and min
.
df.RAW <- df.RAW %>% pivot_longer(names_to = 'ENDPOINT', values_to = 'VALUE', cols = c(iqv:apo)) %>% mutate(IMPUTED = is.na(VALUE), sex = factor(sex, levels = c(0, 1), labels = c('f', 'm')), min = factor(min, levels = c(0, 1), labels = c('n', 'y'))) glimpse(df.RAW)
## Observations: 24,576 ## Variables: 7 ## $ sch <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ pup <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, … ## $ sex <fct> m, m, m, m, m, m, m, m, m, m, m, m, f, f, f, f, f, f, f, f, f, f, f, f, f, f, … ## $ min <fct> y, y, y, y, y, y, y, y, y, y, y, y, n, n, n, n, n, n, y, y, y, y, y, y, n, n, … ## $ ENDPOINT <chr> "iqv", "iqp", "lpr", "lpo", "apr", "apo", "iqv", "iqp", "lpr", "lpo", "apr", "… ## $ VALUE <dbl> -1.35350942, -3.72274979, 33.00000000, NA, 10.00000000, 12.00000000, 2.1464905… ## $ IMPUTED <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
Specifying the imputation model
init <- mice(df.RAW, maxit = 0) meth <- init$method predM <- init$predictorMatrix # groupwise imputation of variable VALUE meth["VALUE"] <- "bygroup" # remove as predictor predM[, c('IMPUTED')] <- 0 # specify name of the grouping variable ('ENDPOINT') and imputation method ('pmm') group <- list( "VALUE" = "ENDPOINT" ) imputationFunction <- list("VALUE" = "pmm" )
Running multiple imputations
## grouped imputation mids_endpoint <- df.RAW %>% mice(., method = meth, predictorMatrix = predM, m = 3, seed = 2020, printFlag = FALSE, group = group, imputationFunction = imputationFunction )
Creating complete data
df.COMPLETE <- mice::complete(mids_endpoint, include = FALSE) %>% mutate(ENDPOINT = factor(ENDPOINT, levels = c('iqv', 'iqp', 'apr', 'apo', 'lpr', 'lpo')))
Plotting observed and imputed values
ggplot(df.COMPLETE, aes(x = ENDPOINT, y = VALUE)) + geom_jitter(aes(colour = IMPUTED, alpha = IMPUTED), size = 0.5) + geom_violin(alpha=0) + scale_colour_manual(values = c('grey', 'red3')) + theme_bw() + theme(legend.position = 'none') + labs(colour = NULL, alpha = NULL, x = NULL, y = NULL, title = 'The distribution of observed and imputed values', subtitle = 'Red dots indicate imputed values')