R makes it easy to combine different kinds of plots into one overall graph. This may be useful to visualize both basic measures of central tendency (median, quartiles etc.) and the distribution of a certain variable. Moreover, so called cut-off values can be added to the graph.
In this blog post, I show how to combine box and jitter plots using the ggplot2
package.
First of all, we need to install and load the R packages required for the following steps. Since we want to do the installation and loading using the pacman
package, we need to check whether this package has been installed already. If not, it will be installed and loaded. If yes, it will just be loaded (line 1). Furthermore we need the R packages ggplot2
and Hmisc
. This time, the p_load
function checks whether these packages have been installed already and either installs and loads or just loads them (line 2).
if (!require("pacman")) install.packages("pacman") pacman::p_load(ggplot2, Hmisc)
In a second step, we create three random variables (var.scale, var.group, var.cutoff) with n=300.
- var.scale is a numeric variable with a mean value of about 50 and a standard deviation of about 17.
- var.group is a factor variable comprising the groups male dnd female.
- var.cutoff was calculated based on var.scale using predefined cut-off values (0 – 40 == low, 41 –60 = medium, >60 == high).
var.scale <- round(rnorm(300, 50, 17)) var.group <- rbinom(300, 1, .5) var.group <- factor(var.group, levels = c(0:1), labels = c("male", "female")) var.cutoff <- ifelse(var.scale <= 40, 1, ifelse(var.scale > 40 & var.scale <= 60, 2, 3)) var.cutoff <- factor(var.cutoff, levels = c(3:1), labels = c("high", "medium", "low"))
The describe() function of the Hmisc package returns some basic measures of central tendency.
Hmisc::describe(var.scale)
## var.scale ## n missing unique Info Mean .05 .10 .25 .50 ## 300 0 71 1 51.25 24.00 30.90 41.00 50.00 ## .75 .90 .95 ## 63.25 70.00 76.00 ## ## lowest : 8 10 14 16 17, highest: 85 97 100 102 104
Hmisc::describe(var.group)
## var.group ## n missing unique ## 300 0 2 ## ## male (141, 47%), female (159, 53%)
Hmisc::describe(var.cutoff)
## var.cutoff ## n missing unique ## 300 0 3 ## ## high (87, 29%), medium (141, 47%), low (72, 24%)
Since the ggplot2
package requires the variables to be in a data frame, we have to create a new data frame df comprising our predefined variables using the data.frame()
function.
df <- data.frame(var.scale, var.cutoff, var.group)
Using the functions xlab()
, ylab()
and ggtitle()
, axis labels and plot title will be defined.
Box plots will be created using the geom_boxplot()
function, with width
specifying the boxes' width :-).
Jitter plots will be created using the geom_jitter()
function. In addition, specifications have been made for colour
and position
and size
of the dots.
ggplot(df) + xlab("Group") + ylab("Scale") + ggtitle("Combination of Box and Jitter Plot") + geom_boxplot(aes(var.group, var.scale), width=0.5) + geom_jitter(aes(var.group, var.scale, colour = var.cutoff), position = position_jitter(width = .15, height=-0.7), size=2) + scale_y_continuous(limits=c(0, 101), breaks = seq(0, 110, 10)) + scale_color_manual(name="Legend", values=c("red", "blue3", "green3"))
Finally, we are going to format both Y-axis and legend using the functions scale_y_continuous()
and scale_color_manual()
.
Hi Norbert,
great post! I like this type of diagram because it is really informative (good “ink to information”-relation). Used it as well for a client, and came across the following issue:
I noticed some data points were plotted twice. It took me quite some time to figure out why. In one subgroup there were just two outliers, but four outliers appeared in the plot. geom_jitter plots all the points, regardless whether they are outliers or not. geom_boxplot plots outliers, regardless of what geom_jitter does.
The solution was to turn off outliers in the geom_boxplot call:
geom_boxplot(aes(outlier.color = NA))
That’s a good hint. Thanks very much!