How to combine box and jitter plots using R and ggplot2

R makes it easy to combine different kinds of plots into one overall graph. This may be useful to visualize both basic measures of central tendency (median, quartiles etc.) and the distribution of a certain variable. Moreover, so called cut-off values can be added to the graph.

In this blog post, I show how to combine box and jitter plots using the `ggplot2` package.

First of all, we need to install and load the R packages required for the following steps. Since we want to do the installation and loading using the `pacman` package, we need to check whether this package has been installed already. If not, it will be installed and loaded. If yes, it will just be loaded (line 1). Furthermore we need the R packages `ggplot2` and `Hmisc`. This time, the `p_load` function checks whether these packages have been installed already and either installs and loads or just loads them (line 2).

```if (!require("pacman")) install.packages("pacman")
```

In a second step, we create three random variables (var.scale, var.group, var.cutoff) with n=300.

• var.scale is a numeric variable with a mean value of about 50 and a standard deviation of about 17.
• var.group is a factor variable comprising the groups male dnd female.
• var.cutoff was calculated based on var.scale using predefined cut-off values (0 – 40 == low, 41 –60 = medium, >60 == high).
```var.scale <- round(rnorm(300, 50, 17))
var.group <- rbinom(300, 1, .5)
var.group <- factor(var.group,
levels = c(0:1),
labels = c("male", "female"))

var.cutoff <- ifelse(var.scale <= 40, 1,
ifelse(var.scale > 40 & var.scale <= 60, 2, 3))

var.cutoff <- factor(var.cutoff,
levels = c(3:1),
labels = c("high", "medium", "low"))
```

The describe() function of the Hmisc package returns some basic measures of central tendency.

```Hmisc::describe(var.scale)
```
```## var.scale
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##     300       0      71       1   51.25   24.00   30.90   41.00   50.00
##     .75     .90     .95
##   63.25   70.00   76.00
##
## lowest :   8  10  14  16  17, highest:  85  97 100 102 104
```
```Hmisc::describe(var.group)
```
```## var.group
##       n missing  unique
##     300       0       2
##
## male (141, 47%), female (159, 53%)
```
```Hmisc::describe(var.cutoff)
```
```## var.cutoff
##       n missing  unique
##     300       0       3
##
## high (87, 29%), medium (141, 47%), low (72, 24%)
```

Since the `ggplot2` package requires the variables to be in a data frame, we have to create a new data frame df comprising our predefined variables using the `data.frame()` function.

```df <- data.frame(var.scale, var.cutoff, var.group)
```

Using the functions `xlab()`, `ylab()` and `ggtitle()`, axis labels and plot title will be defined.

Box plots will be created using the `geom_boxplot()` function, with `width` specifying the boxes' width :-).

Jitter plots will be created using the `geom_jitter()` function. In addition, specifications have been made for `colour` and `position` and `size` of the dots.

```ggplot(df) +
xlab("Group") +
ylab("Scale") +
ggtitle("Combination of Box and Jitter Plot") +
geom_boxplot(aes(var.group, var.scale),
width=0.5) +
geom_jitter(aes(var.group, var.scale, colour = var.cutoff),
position = position_jitter(width = .15, height=-0.7),
size=2) +
scale_y_continuous(limits=c(0, 101),
breaks = seq(0, 110, 10)) +
scale_color_manual(name="Legend",
values=c("red", "blue3", "green3"))
``` Finally, we are going to format both Y-axis and legend using the functions `scale_y_continuous()` and `scale_color_manual()`. Biometrician at Clinical Trial Centre, Leipzig University (GER), with degrees in sociology (MA) and public health (MPH).
This entry was posted in Visualizing Data and tagged . Bookmark the permalink.

2 Responses to How to combine box and jitter plots using R and ggplot2

1. Wolf says:

Hi Norbert,
great post! I like this type of diagram because it is really informative (good “ink to information”-relation). Used it as well for a client, and came across the following issue:
I noticed some data points were plotted twice. It took me quite some time to figure out why. In one subgroup there were just two outliers, but four outliers appeared in the plot. geom_jitter plots all the points, regardless whether they are outliers or not. geom_boxplot plots outliers, regardless of what geom_jitter does.
The solution was to turn off outliers in the geom_boxplot call:

geom_boxplot(aes(outlier.color = NA))

• koehlern says:

That’s a good hint. Thanks very much!

This site uses Akismet to reduce spam. Learn how your comment data is processed.