How to combine box and jitter plots using R and ggplot2

R makes it easy to combine different kinds of plots into one overall graph. This may be useful to visualize both basic measures of central tendency (median, quartiles etc.) and the distribution of a certain variable. Moreover, so called cut-off values can be added to the graph.

In this blog post, I show how to combine box and jitter plots using the ggplot2 package.

First of all, we need to install and load the R packages required for the following steps. Since we want to do the installation and loading using the pacman package, we need to check whether this package has been installed already. If not, it will be installed and loaded. If yes, it will just be loaded (line 1). Furthermore we need the R packages ggplot2 and Hmisc. This time, the p_load function checks whether these packages have been installed already and either installs and loads or just loads them (line 2).

if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, Hmisc)

In a second step, we create three random variables (var.scale, var.group, var.cutoff) with n=300.

  • var.scale is a numeric variable with a mean value of about 50 and a standard deviation of about 17.
  • var.group is a factor variable comprising the groups male dnd female.
  • var.cutoff was calculated based on var.scale using predefined cut-off values (0 – 40 == low, 41 –60 = medium, >60 == high).
var.scale <- round(rnorm(300, 50, 17))
var.group <- rbinom(300, 1, .5)
var.group <- factor(var.group, 
                     levels = c(0:1), 
                     labels = c("male", "female"))

var.cutoff <- ifelse(var.scale <= 40, 1, 
                     ifelse(var.scale > 40 & var.scale <= 60, 2, 3))

var.cutoff <- factor(var.cutoff, 
                     levels = c(3:1), 
                     labels = c("high", "medium", "low"))

The describe() function of the Hmisc package returns some basic measures of central tendency.

Hmisc::describe(var.scale)
## var.scale 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     300       0      71       1   51.25   24.00   30.90   41.00   50.00 
##     .75     .90     .95 
##   63.25   70.00   76.00 
## 
## lowest :   8  10  14  16  17, highest:  85  97 100 102 104
Hmisc::describe(var.group)
## var.group 
##       n missing  unique 
##     300       0       2 
## 
## male (141, 47%), female (159, 53%)
Hmisc::describe(var.cutoff)
## var.cutoff 
##       n missing  unique 
##     300       0       3 
## 
## high (87, 29%), medium (141, 47%), low (72, 24%)

Since the ggplot2 package requires the variables to be in a data frame, we have to create a new data frame df comprising our predefined variables using the data.frame() function.

df <- data.frame(var.scale, var.cutoff, var.group)

Using the functions xlab(), ylab() and ggtitle(), axis labels and plot title will be defined.

Box plots will be created using the geom_boxplot() function, with width specifying the boxes' width :-).

Jitter plots will be created using the geom_jitter() function. In addition, specifications have been made for colour and position and size of the dots.

ggplot(df) +
  xlab("Group") +
  ylab("Scale") +
  ggtitle("Combination of Box and Jitter Plot") + 
  geom_boxplot(aes(var.group, var.scale), 
               width=0.5) + 
  geom_jitter(aes(var.group, var.scale, colour = var.cutoff), 
              position = position_jitter(width = .15, height=-0.7),
              size=2) +
  scale_y_continuous(limits=c(0, 101), 
                     breaks = seq(0, 110, 10)) +
  scale_color_manual(name="Legend", 
                     values=c("red", "blue3", "green3")) 

plot of chunk plot

Finally, we are going to format both Y-axis and legend using the functions scale_y_continuous() and scale_color_manual().

Advertisements

About norbert

I am post doc at the Department of Medical Psychology and Sociology, Leipzig University (GER), with degrees in sociology (MA) and public health (MPH).
This entry was posted in Visualizing Data and tagged . Bookmark the permalink.

2 Responses to How to combine box and jitter plots using R and ggplot2

  1. Wolf says:

    Hi Norbert,
    great post! I like this type of diagram because it is really informative (good “ink to information”-relation). Used it as well for a client, and came across the following issue:
    I noticed some data points were plotted twice. It took me quite some time to figure out why. In one subgroup there were just two outliers, but four outliers appeared in the plot. geom_jitter plots all the points, regardless whether they are outliers or not. geom_boxplot plots outliers, regardless of what geom_jitter does.
    The solution was to turn off outliers in the geom_boxplot call:

    geom_boxplot(aes(outlier.color = NA))

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s