How to plot animated maps with gganimate

Intro

Leipzig is the largest city in the federal state of Saxony, Germany. Since about 2009, Leipzig is one of the fastest growing German cities. Between 2006 and 2016, the population has grown from about 506.000 to 580.000 inhabitants.The largest part of the population growth can be attributed to inward migration.

In this blog post, I show how to create an animated map of Leipzig visualizing the migration balances for each of Leipzig's 63 districts (“Ortsteile”).

The data

The data I will visualize in this blog post are provided by the local Statistical Bureau. They can be downloaded as .csv file from the following website. The shapefile I use, is not publicly available and can be purchased for about 25 Euro from the Statistical Bureau.

R packages

The following code will install load and / or install the R packages required for this blog post.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(readr, rgdal, ggplot2, reshape2)
pacman::p_load_current_gh("dgrtwo/gganimate")

While the packages readr and rgdal are required for importing the .csv and the shapefile, ggplot2 and gganimate are needed to plot and animate the data. The reshape2 package is required for transforming data frames from wide to long format.

After importing the csv file containing the migration data, we add a key variable (id) to identify each of Leipzig's 63 districts. Using the melt-function of the reshape2 package, we reformat the data frame from wide into long form.

# import csv file
df.migrat <- read_csv2("./Data_LE/Bevölkerungsbewegung_Wanderungen_Wanderungssaldo.csv",
                      col_types = cols(`2015` = col_number()),
                      n_max = 63,
                      locale = locale(encoding = "latin1"),
                      trim_ws = TRUE)
# add id variable
df.migrat$id <- c(1:63)
# drop variables not needed
df.migrat <- df.migrat[,c(1,18,2:17)]
# change name of first variable
colnames(df.migrat)[1] <- 'district'
# convert from wide to long
df.migrat <- reshape2::melt(df.migrat, id.vars=c("district","id"))
# change name of third variable
colnames(df.migrat)[3] <- 'year'

With the next code snippet, we import the shapefile using the readOGR-function of the rgdal package. The object we receive (sh.file) is of class SpatialPolygonsDataFrame. After adding a key variable, we transform this object into an ordinary data frame using the fortify-function of the ggplot2 package. Furthermore, we join both data frames by the key variable id. Before we plot the data, we create a new variable (co) categorizing the migration balance variable (value).

# import shapefile
sh.file <- readOGR("ot.shp", layer="ot")
# add id variable
sh.file$id <- c(1:63)
# Create a data frame
sh.file <- fortify(sh.file, region = 'id')
# Merge data frames by id variable
mydata <- merge(sh.file, df.migrat, by='id')
# create a categorized variable
mydata$co <- ifelse(mydata$value >= 1000, 4,
                   ifelse(mydata$value >= 0, 3,
                                ifelse(mydata$value >= -1000, 2, 1)))
# define levels and labels
mydata$co  <- factor(mydata$co, levels = c(4:1),
                     labels = c('> 1000', '0-1000', '-1000-0', '< -1000'))

Moreover, we define a theme suitable for plotting maps with ggpot2. The following theme is a customized version of a theme I found on Stack Overflow.

theme_opts <- list(theme(panel.grid.minor = element_blank(),
                         panel.grid.major = element_blank(),
                         panel.background = element_blank(),
                         plot.background = element_blank(),
                         panel.border = element_blank(),
                         axis.line = element_blank(),
                         axis.text.x = element_blank(),
                         axis.text.y = element_blank(),
                         axis.ticks = element_blank(),
                         axis.title.x = element_blank(),
                         axis.title.y = element_blank(),
                         legend.position="right",
                         plot.title = element_text(size=16)))

Plotting the data

Finally, we can plot the data. The first plot (static) visualizes the migration data for the whole period of time (2000–2015).

ggplot(mydata, aes(x = long, y = lat, group = group, fill=value)) +
  theme_opts + 
  coord_equal() +
  geom_polygon(color = 'grey') +
  labs(title = "Migration balance between 2000 and 2015",
       subtitle = "Static Map",
       caption = 'Data source: Stadt Leipzig, Amt für Statistik und Wahlen, 2015') +
  scale_fill_distiller(name='Migration\nbalance', direction = 1, palette='RdYlGn')

plot of chunk map

For creating an animated plot, we add the frame argument stemming from the gganimate package. With interval = 5 we define the time (in seconds) between the “slides”.

p <- ggplot(mydata, aes(x = long, y = lat, group = group, fill=value, frame=year)) +
  theme_opts + 
  coord_equal() +
  geom_polygon(color = 'grey') +
  labs(title = "Migration balance in year: ",
       subtitle = "Animated Map",
       caption = 'Data source: Stadt Leipzig, Amt für Statistik und Wahlen, 2015') +
  scale_fill_distiller(name='Migration\nbalance', direction = 1, palette='RdYlGn')

gg_animate(p, interval = 5) 

Posted in Visualizing Data | Tagged , , , , , | Leave a comment

Text mining PubMed using R – a case study

Intro

For a couple of years, I've been doing research in the field of psychooncology, which basically is “concerned both with the effects of cancer on a person's psychological health as well as the social and behavioral factors that may affect the disease process of cancer and/or the remission of it.” The most influential psycho-oncological journal is named “Psycho-Oncology”. The journal was founded in 1992 and is published monthly by Wiley-Blackwell.

Data

In this blog post, I'll play arround with some bibliographic data from the PubMed data base. With the following query, we receive bibliographical information for every paper published in Psycho-Oncology from 1997 up today.

("Psycho-oncology"[Jour])

As the following screen shot shows, our search query returned 2785 entries. The search results can be downloaded and saved into a .csv file by clicking on the arrow right beside Send to, choosing destination (File) and format (CSV).

Fig. 1: Screenshot from PubMed

R packages

The following code will install load and / or install the R packages required for this blog post.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(pacman::p_load(readr, stringr, ggplot2, qdap, qdapRegex))

Import

The data can be easily imported into R using the read_csv() function from the readr package.

# import .csv file
mydata <- read_csv("pubmed_result.csv")

Data manipulation

With the next code snippet, we do some data manipulation in order to make our data frame tidy. We start with removing the last row of the data frame containing data about a paper published in 1995, remove superfluous headlines, entries without author and select the variables we need for analysis. The str_detect() function stems from the stringr package.

# remove only article from 1995
mydata <- mydata[1:nrow(mydata)-1,]
# remove headlines
mydata <- subset(mydata, str_detect(URL, 'URL')==FALSE)
# remove entries without author
mydata <- subset(mydata, str_detect(Description, 'Author A.')==FALSE)
mydata <- subset(mydata, str_detect(Description, '\\[No authors listed\\]')==FALSE)
# Select variables
mydata <- mydata[, c('Title', 'Description', 'ShortDetails')]

In the next step, we extract the year of publication from the ShortDetails variable. Based on the year variable, we create a variable named time merging the years to four periods of time (five years sub-periods between 1998 and 2016). Furthermore, we calculate the number of authors for each paper using the str_count function of the stringr package. Since each author except for the last is followed by a comma, we just need to count the number of commas and add 1.

# Return Year of publication
mydata$Year <- as.numeric(str_sub(mydata$ShortDetails, 
                                  nchar(mydata$ShortDetails)-3, 
                                  nchar(mydata$ShortDetails)))
mydata$ShortDetails <- NULL
colnames(mydata) <- c('title', 'authors', 'year')
mydata$time <- ifelse(mydata$year <= 2001, 1,
                      ifelse(mydata$year >= 2002 & mydata$year <=2006, 2,
                             ifelse(mydata$year >= 2007 & mydata$year <=2011, 3, 4)))
mydata$year <- as.factor(mydata$year)
# Categorize years
mydata$time <- factor(mydata$time, levels = c(1:4),
                      labels = c('1997-2001', '2002-2006', '2007-2011', '2012-2016'))
# Calculate number of authors
mydata$auth.n <- str_count(mydata$authors, '\\,') + 1

Before we can plot these data, we need to create 4 more data frames:

  • df.pup containing the number of papers published each year between 1997 and 2006,
  • df.auth.1 containing the mean number of authors per paper and year,
  • df.auth.2 containing the mean number of authors per paper and time period,
  • df.auth.3 containing the maximum number of authors per paper and year,
df.pup <- data.frame(table(mydata$year))
df.auth.1 <- data.frame(aggregate(mydata$auth.n, list(time = mydata$year), mean))
df.auth.2 <- data.frame(aggregate(mydata$auth.n, list(time = mydata$time), mean))
df.auth.3 <- data.frame(aggregate(mydata$auth.n, list(time = mydata$year), max))

Now, we can plot the data using the ggplot2 package.

ggplot(df.pup, aes(x=Var1, y=Freq, fill=0-Freq)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=Freq), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, max(df.pup$Freq)+10), breaks = seq(0, max(df.pup$Freq)+10, by=20)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 2: Number of papers published each year",
       caption = "Data source: PubMed")

plot of chunk ggplot1

Except for a few years, there has been a continuous rise in the number of publications. Since January 2013, Psycho-Oncology has been available exclusively online. Thus, the number of annual papers published is not restricted by requirements of the printed version anymore. Consequently, there was an enormous rise in the number of publications in 2013. However, in the following year (2014) the number of publications declined quite remarkably. The reason for this decline may be a lack of submissions.

ggplot(df.auth.1, aes(x=time, y=x, fill=0-x)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(x,1)), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, 10), breaks = seq(0,10, by=2)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 3: Mean number of authors per paper and year",
       caption = "Data source: PubMed")

plot of chunk ggplot2

According to figure 3 and 4, the number of authorships per article has been rising over the whole time period. This is in accordance with the general trend towards a rise in the number of personal authors per citation.

ggplot(df.auth.2, aes(x=time, y=x, fill=0-x)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(x,1)), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, 10), breaks = seq(0,10, by=2)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 4: Mean number of authors per paper and time period",
       caption = "Data source: PubMed")

plot of chunk ggplot3

Between 1997 and 2016, the mean number of authors per publication has increased from about four to six. The enormous rise in the maximum number of authors (see Fig. 5) supports the idea of an inflation in the number of authors per paper.

ggplot(df.auth.3, aes(x=time, y=x, fill=0-x)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(x,1)), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, max(df.auth.3$x)+5), breaks = seq(0, max(df.auth.3$x)+5, by=5)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 5: Max number of authors per paper and year",
      caption = "Data source: PubMed")

plot of chunk ggplot4

Text mining

Before we analyze the data, some janitorial work needs to be done.
We start with (1) defining a list of stopwords, (2, 3) remove non-word characters, (4, 5) replace spaces in two word groups we want to be grouped together, and (6) remove the defined stopwords from the variable containing the citation titles.

sw <- c(Top200Words, 'among') # qdap
mydata$txt <- rm_non_words(mydata$title) # qdapRegex
mydata$txt <- str_replace_all(mydata$txt, "'", "") # stringr
keeps <- c('quality of life', 'health related')
mydata$txt <- space_fill(mydata$txt, keeps, sep = "~~") # qdap
mydata$txt <- rm_stop(mydata$txt, sw,
                      separate = FALSE) # qdap

In the next step, we generate a distribution table for all words of the citation titles. The resulting data frame (df.words) consits of three variables: interval (citation title terms), freq (term frequency), and percent (percentage of the term).

df.words <- dist_tab(breaker(mydata$txt)) # qdap
df.words <- df.words[,c(1, 2, 4)]
df.words <- subset(df.words, percent > 0.5)
df.words <- df.words[order(df.words$percent),] 
df.words$interval <- factor(df.words$interval, 
                            levels=df.words$interval,
                            labels=df.words$interval, 
                            ordered=TRUE) 
kable(head(df.words))
interval freq percent
666 coping 141 0.51
2407 prostate 140 0.51
789 depression 143 0.52
1335 health 147 0.53
2678 review 150 0.54
2190 patient 154 0.56

With the next R code we create a list l.txt_time containing four sub lists.

# group titles by time period
l.txt_time <- by(mydata$txt, mydata$time, breaker)
str(l.txt_time)
## List of 4
##  $ 1997-2001: chr [1:2160] "relationship" "between" "trait" "anxiety" ...
##  $ 2002-2006: chr [1:3608] "measurement" "coping" "stress" "responses" ...
##  $ 2007-2011: chr [1:7310] "novel" "stop" "multidisciplinary" "clinic" ...
##  $ 2012-2016: chr [1:14606] "relationship" "between" "objectively" "measured" ...
##  - attr(*, "dim")= int 4
##  - attr(*, "dimnames")=List of 1
##   ..$ mydata$time: chr [1:4] "1997-2001" "2002-2006" "2007-2011" "2012-2016"
##  - attr(*, "call")= language by.default(data = mydata$txt, INDICES = mydata$time, FUN = breaker)
##  - attr(*, "class")= chr "by"

The following code snippet is to convert these four lists into separate data frames, each containing the most frequent citation title terms of the different periods of time.

# time span 1997-2001
df.t1 <- dist_tab(l.txt_time[[1]])
df.t1 <- df.t1[,c(1, 2, 4)]
df.t1 <- subset(df.t1, percent > 0.2)
df.t1 <- df.t1[order(df.t1$percent),] 
df.t1$interval <- factor(df.t1$interval, 
                         levels=df.t1$interval,
                         labels=df.t1$interval, 
                         ordered=TRUE) 
df.t1$year <- levels(mydata$time)[1]
# time span 2002-2006
df.t2 <- dist_tab(l.txt_time[[2]])
df.t2 <- df.t2[,c(1, 2, 4)]
df.t2 <- subset(df.t2, percent > 0.2)
df.t2 <- df.t2[order(df.t2$percent),] 
df.t2$interval <- factor(df.t2$interval, levels=df.t2$interval,
                         labels=df.t2$interval, ordered=TRUE) 
df.t2$year <- levels(mydata$time)[2]
# time span 2007-2011
df.t3 <- dist_tab(l.txt_time[[3]])
df.t3 <- df.t3[,c(1, 2, 4)]
df.t3 <- subset(df.t3, percent > 0.3)
df.t3 <- df.t3[order(df.t3$percent),] 
df.t3$interval <- factor(df.t3$interval, levels=df.t3$interval,
                         labels=df.t3$interval, ordered=TRUE) 
df.t3$year <- levels(mydata$time)[3]
# time span 2002-2016
df.t4 <- dist_tab(l.txt_time[[4]])
df.t4 <- df.t4[,c(1, 2, 4)]
df.t4 <- subset(df.t4, percent > 0.3)
df.t4 <- df.t4[order(df.t4$percent),] 
df.t4$interval <- factor(df.t4$interval, levels=df.t4$interval,
                         labels=df.t4$interval, ordered=TRUE) 
df.t4$year <- levels(mydata$time)[4]

Eventually, we create a data frame (df.time) containing the same variables as the data frame df.words and a variable (year) defining the period of time the paper was published.

# merging
df.time <- rbind(df.t1, df.t2, df.t3, df.t4)
df.time <- merge(df.time, df.words, by='interval')
df.time <- df.time[,1:4]
colnames(df.time) <- c('term', 'freq', 'percent', 'year')
df.time$year <- as.factor(df.time$year)

Finally, we plot the most frequent citation title terms over four periods of time.

ggplot(df.time, aes(x=term, y=percent, fill=year)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(percent,1)), size = 4, 
            fontface=2, hjust =  -0.3, vjust = 0.4) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', 
                     limits=c(0, max(df.time$percent)+1), 
                     breaks = seq(0, max(df.time$percent)+1, by=2)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  coord_flip() +
  facet_grid(. ~ year) +
  labs(title = "Fig. 6: Most frequent terms in time",
       caption = "Data source: PubMed")

plot of chunk ggplot5

Above all, the terms survivors and distress were increasingly used over time which reflects a growing interest in so-called psychological distress (for a critical discussion see) and cancer survivorship.

Posted in Bibliometrics, Text Mining | Tagged , , | Leave a comment

How to import a bunch of Excel files with multiple sheets

The Problem

I have been recently conducting an evaluation study in the field of social work with elderly people. With the purpose to provide advice to elderly peoply regarding age-related questions about housing, home care etc, 10 offices for senior citizens were founded. Each of the offices is requested to document its activities (number of persons, events etc.) on a monthly basis. The data need to be entered in a prestructured Excel table. Since the offices started working about 2.5 years ago, I needed to handle 300 Excel sheets (30 months * 10 offices).

In a first step, I decided to create one Excel file for every office each containing 30 sheets (one sheet per month). While the files are named after the offices (abbreviated with 4 characters), the sheets are named after the following pattern: YYYY.MM (year followed by month).

The Solution

Since I did not find a solution for my problem with the packages I usually use to import Excel files into R (xlsx, readxl), I searched the internet for help. Fortunatelly, I found the paper “How to import and merge many Excel files; each with multiple sheets of data for statistical analysis.” by Jon Starkweather. The paper is really worth reading and gives a very comprehensive description on the subject matter. The following code snippets stem from Starkweather's paper.

In a first step, we have to load the following packages:

library(rJava)
library(XLConnect, pos = 4)

In a second step, we define the file type, we want to import (.xls), save the sheet names of the Excel files into a new vector called sheet.names (since the sheet names in each of the files are identical, we may extract them from any of the 10 files) and create another vector (e.names) containing the names for the variables we want to import (in this case 28).

file.names <- list.files(pattern='*.xls')
sheet.names <- getSheets(loadWorkbook('Name.xls'))
e.names <- paste0(rep('v', 28), c(1:28))

In a thirt step, we create a data frame with 28 variables, named
v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, v16, v17, v18, v19, v20, v21, v22, v23, v24, v25, v26, v27, v28
and one row containing NAs only.

data.1 <- data.frame(matrix(rep(NA,length(e.names)),
                            ncol = length(e.names)))
names(data.1) <- e.names

Finally, we use 2 for-loops to import all the files and sheets and bind them to a data frame we can use for analysis.

for (i in 1:length(file.names)) {
    wb <- loadWorkbook(file.names[i])
    for (j in 1:length(sheet.names)) {
        ss <- readWorksheet(wb, sheet.names[j], startCol = 2, header = TRUE)
        condition <- rep(sheet.names[j], nrow(ss))
        sub.id <- rep(file.names[i], nrow(ss))
        s.frame <- seq(1:nrow(ss))
        df.1 <- data.frame(sub.id, condition, s.frame, ss)
        names(df.1) <- e.names
        data.1 <- rbind(data.1, df.1)
        rm(ss, condition, s.frame, sub.id, df.1)
    }
    rm(wb)
}

In the mentioned paper, Jon Starkweather elaborates in detail on what each line of each for-loop is doing.

Posted in Data Management | Tagged , | Leave a comment

How to add a background image to ggplot2 graphs

When producing so called infographics, it is rather common to use images rather than a mere grid as background. In this blog post, I will show how to use a background image with ggplot2.

Packages required

The following code will install load and / or install the R packages required for this blog post.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(jpeg, png, ggplot2, grid, neuropsychology)

Choosing the data

The data set I will be using in this blog post is named diamonds and part of the ggplot2 package. It contains information about – surprise, surprise – diamonds, e.g. price and cut (Fair, Good, Very Good, Premium, Ideal). Using the tapply-function, we create a table returning the maximum prices per cut. Since we need the data to be organized in a data frame, we must transform the table using the data.frame-function.

mydata <- data.frame(price = tapply(diamonds$price, diamonds$cut, max))
mydata$cut <- rownames(mydata)
cut price
Fair 18574
Good 18788
Very Good 18818
Premium 18823
Ideal 18806

Importing the background image

The file format of the background image we will be using in this blog post is JPG. Since the image imitates a blackboard, we name it “blackboard.jpg”. The image file must be imported using the readJPEG-function of the jpeg package. The imported image will be saved into an object named image.

imgage <- jpeg::readJPEG("blackboard.jpg")

To import other image file formats, different packages and functions must be used. The next code snippet shows how to import PNG images.

image <- png::readPNG("blackboard.png")

Drawing the plot

In the next step, we actually draw a bar chart with a backgriund image. To make blackboard.jpg the background image, we need to combine the annotation_custom-function of the ggplot2 package and the rasterGrob-function of the grid package.

ggplot(mydata, aes(cut, price, fill = -price)) +
  ggtitle("Bar chart with background image") +
  scale_fill_continuous(guide = FALSE) +
  annotation_custom(rasterGrob(imgage, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                               -Inf, Inf, -Inf, Inf) +
  geom_bar(stat="identity", position = "dodge", width = .75, colour = 'white') +
  scale_y_continuous('Price in $', limits = c(0, max(mydata$price) + max(mydata$price) / 4)) +
  scale_x_discrete('Cut') +
  geom_text(aes(label = round(price), ymax = 0), size = 7, fontface = 2, 
            colour = 'white', hjust = 0.5, vjust = -1) 

plot of chunk plot1

Adding opacity

Using the specification alpha = 0.5, we add 50% opacity to the bars. alpha ranges between 0 and 1, with higher values indicating greater opacity.

ggplot(mydata, aes(cut, price, fill = -price)) +
  theme_neuropsychology() +
  ggtitle("Bar chart with background image") +
  scale_fill_continuous(guide = FALSE) +
  annotation_custom(rasterGrob(imgage, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                               -Inf, Inf, -Inf, Inf) +
  geom_bar(stat="identity", position = "dodge", width = .75, colour = 'white', alpha = 0.5) +
  scale_y_continuous('Price in $', limits = c(0, max(mydata$price) + max(mydata$price) / 4)) +
  scale_x_discrete('Cut') +
  geom_text(aes(label = round(price), ymax = 0), size = 7, fontface = 2, 
            colour = 'white', hjust = 0.5, vjust = -1) 

plot of chunk plot2

The recently published R package neuropsychology contains a theme named theme_neuropsychology(). This theme may be used to get bigger axis titles as well as bigger axis and legend text.

Posted in Visualizing Data | Tagged | Leave a comment

How to number and reference tables and figures in R Markdown files

Introduction

R Markdown is a great tool to make research results reproducible. However, in scientific research papers or reports, tables and figures usually need to be numbered and referenced. Unfortunately, R Markdown has no “native” method to number and reference table and figure captions. The recently published bookdown package makes it very easy to number and reference tables and figures (Link). However, since bookdown uses LaTex functionality, R Markdown files created with bookdown cannot be converted into MS Word (.docx) files.

In this blog post, I will explain how to number and reference tables and figures in R Markdown files using the captioner package.

Packages required

The following code will install load and / or install the R packages required for this blog post. The dataset I will be using in this blog post is named bundesligR and part of the bundesligR package. It contains “all final tables of Germany's highest football league, the Bundesliga” Link.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(knitr, captioner, bundesligR, stringr)

In the first code snippet, we create a table using the kable function of the knitr package. With caption we can specify a simple table caption. As we can see, the caption will not be numbered and, thus, cannot be referenced in the document.

knitr::kable(bundesligR::bundesligR[c(1:6), c(2,3,11,10)],
             align = c('c', 'l', 'c', 'c'),
             caption = "German Bundesliga: Final Table 2015/16, Position 1-6")
Position Team Points GD
1 FC Bayern Muenchen 88 63
2 Borussia Dortmund 78 48
3 Bayer 04 Leverkusen 60 16
4 Borussia Moenchengladbach 55 17
5 FC Schalke 04 52 2
6 1. FSV Mainz 05 50 4

Table numbering

Thanks to Alathea Letaw's captioner package, we can number tables and figures.
In a first step, we define a function named table_nums and apply it to the tables' name and caption. We define both table name and table caption. Furthermore, we may also define a prefix (Tab. for tables and Fig. for figures).

table_nums <- captioner::captioner(prefix = "Tab.")

tab.1_cap <- table_nums(name = "tab_1", 
                        caption = "German Bundesliga: Final Table 2015/16, Position 7-12")
tab.2_cap <- table_nums(name = "tab_2", 
                        caption = "German Bundesliga: Final Table 2015/16, Position 12-18")

The next code snippet combines both inline code and a code chunk. With fig.cap = tab.1_cap, we specify the caption of the first table. It is important to separate inline code and code chunk. Otherwise the numbering won't work.

Tab. 1: German Bundesliga: Final Table 2015/16, Position 7-12

Position Team Points GD
7 Hertha BSC 50 0
8 VfL Wolfsburg 45 -2
9 1. FC Koeln 43 -4
10 Hamburger SV 41 -6
11 FC Ingolstadt 04 40 -9
12 FC Augsburg 38 -10

Table referencing

Since we have received a numbered table, it should also be possible to reference the table. However, we can not just
use the inline code table_nums('tab_1'). Otherwise, we wi'll get the following output:

[1] “Tab. 1: German Bundesliga: Final Table 2015/16, Position 7-12”

In order to return the desired output (prefix Tab. and table number), I have written the function f.ref. Using a regular expression, the function returns all characters of the table_nums('tab_1') output located before the first colon.

f.ref <- function(x) {
  stringr::str_extract(table_nums(x), "[^:]*")
}

When we apply this function to tab_1, the inline code returns the following result:

As we can see in f.ref("tab_1"), the Berlin based football club Hertha BSC had position seven in the final table.

As we can see in Tab. 1, the Berlin based football club Hertha BSC had position seven in the final table.

Just to make the table complete, Tab. 2 shows positions 13 to 18 of the final Bundesliga table.

Tab. 2: German Bundesliga: Final Table 2015/16, Position 12-18

knitr::kable(bundesligR::bundesligR[c(13:18), c(2,3,11,10)],
             align = c('c', 'l', 'c', 'c'),
             row.names = FALSE)
Position Team Points GD
13 Werder Bremen 38 -15
14 SV Darmstadt 98 38 -15
15 TSG 1899 Hoffenheim 37 -15
16 Eintracht Frankfurt 36 -18
17 VfB Stuttgart 33 -25
18 Hannover 96 25 -31

And what about figures?

Figures can be numbered and referenced following the same principle.

Posted in Data Management | Tagged , , | Leave a comment

How to combine box and jitter plots using R and ggplot2

R makes it easy to combine different kinds of plots into one overall graph. This may be useful to visualize both basic measures of central tendency (median, quartiles etc.) and the distribution of a certain variable. Moreover, so called cut-off values can be added to the graph.

In this blog post, I show how to combine box and jitter plots using the ggplot2 package.

First of all, we need to install and load the R packages required for the following steps. Since we want to do the installation and loading using the pacman package, we need to check whether this package has been installed already. If not, it will be installed and loaded. If yes, it will just be loaded (line 1). Furthermore we need the R packages ggplot2 and Hmisc. This time, the p_load function checks whether these packages have been installed already and either installs and loads or just loads them (line 2).

if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, Hmisc)

In a second step, we create three random variables (var.scale, var.group, var.cutoff) with n=300.

  • var.scale is a numeric variable with a mean value of about 50 and a standard deviation of about 17.
  • var.group is a factor variable comprising the groups male dnd female.
  • var.cutoff was calculated based on var.scale using predefined cut-off values (0 – 40 == low, 41 –60 = medium, >60 == high).
var.scale <- round(rnorm(300, 50, 17))
var.group <- rbinom(300, 1, .5)
var.group <- factor(var.group, 
                     levels = c(0:1), 
                     labels = c("male", "female"))

var.cutoff <- ifelse(var.scale <= 40, 1, 
                     ifelse(var.scale > 40 & var.scale <= 60, 2, 3))

var.cutoff <- factor(var.cutoff, 
                     levels = c(3:1), 
                     labels = c("high", "medium", "low"))

The describe() function of the Hmisc package returns some basic measures of central tendency.

Hmisc::describe(var.scale)
## var.scale 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     300       0      71       1   51.25   24.00   30.90   41.00   50.00 
##     .75     .90     .95 
##   63.25   70.00   76.00 
## 
## lowest :   8  10  14  16  17, highest:  85  97 100 102 104
Hmisc::describe(var.group)
## var.group 
##       n missing  unique 
##     300       0       2 
## 
## male (141, 47%), female (159, 53%)
Hmisc::describe(var.cutoff)
## var.cutoff 
##       n missing  unique 
##     300       0       3 
## 
## high (87, 29%), medium (141, 47%), low (72, 24%)

Since the ggplot2 package requires the variables to be in a data frame, we have to create a new data frame df comprising our predefined variables using the data.frame() function.

df <- data.frame(var.scale, var.cutoff, var.group)

Using the functions xlab(), ylab() and ggtitle(), axis labels and plot title will be defined.

Box plots will be created using the geom_boxplot() function, with width specifying the boxes' width :-).

Jitter plots will be created using the geom_jitter() function. In addition, specifications have been made for colour and position and size of the dots.

ggplot(df) +
  xlab("Group") +
  ylab("Scale") +
  ggtitle("Combination of Box and Jitter Plot") + 
  geom_boxplot(aes(var.group, var.scale), 
               width=0.5) + 
  geom_jitter(aes(var.group, var.scale, colour = var.cutoff), 
              position = position_jitter(width = .15, height=-0.7),
              size=2) +
  scale_y_continuous(limits=c(0, 101), 
                     breaks = seq(0, 110, 10)) +
  scale_color_manual(name="Legend", 
                     values=c("red", "blue3", "green3")) 

plot of chunk plot

Finally, we are going to format both Y-axis and legend using the functions scale_y_continuous() and scale_color_manual().

Posted in Visualizing Data | Tagged | 2 Comments

How to use R for matching samples (propensity score)

According to Wikipedia, propensity score matching (PSM) is a “statistical matching technique that attempts to estimate the effect of a treatment, policy, or other intervention by accounting for the covariates that predict receiving the treatment”. In a broader sense, propensity score analysis assumes that an unbiased comparison between samples can only be made when the subjects of both samples have similar characteristics. Thus, PSM can not only be used as “an alternative method to estimate the effect of receiving treatment when random assignment of treatments to subjects is not feasible” (Thavaneswaran 2008). It can also be used for the comparison of samples in epidemiological studies. Let's give an example:

Health-related quality of life (HRQOL) is considered an important outcome in cancer therapy. One of the most frequently used instruments to measure HRQOL in cancer patients is the core quality-of-life questionnaire of the European Organisation for Research and Treatment of Cancer. The EORTC QLQ-C30 is a 30-item instrument comprised of five functioning scales, nine symptom scales and one scale measuring Global quality of life. All scales have a score range between 0 and 100. While high scores of the symptom scales indicate a high burden of symptoms, high scores of the functioning scales and on the GQoL scale indicate better functioning resp. quality of life.

However, without having any reference point, it is difficult if not impossible to interpret the scores. Fortunately, the EORTC QLQ-C30 questionnaire was used in several general population surveys. Therefore, patient scores may be compared against scores of the general population. This makes it far easier to decide whether the burden of symptoms or functional impairments can be attributed to cancer (treatment) or not. PSM can be used to make both patient and population samples comparable by matching for relevant demographic characteristics like age and sex.

In this blog post, I show how to do PSM using R. A more comprehensive PSM guide can be found under: A Step-by-Step Guide to Propensity Score Matching in R.

Creating two random dataframes

Since we don't want to use real-world data in this blog post, we need to emulate the data. This can be easily done using the Wakefield package.

In a first step, we create a dataframe named df.patients. We want the dataframe to contain specifications of age and sex for 250 patients. The patients' age shall be between 30 and 78 years. Furthermore, 70% of patients shall be male.

set.seed(1234)
df.patients <- r_data_frame(n = 250, 
                            age(x = 30:78, 
                                name = 'Age'), 
                            sex(x = c("Male", "Female"), 
                                prob = c(0.70, 0.30), 
                                name = "Sex"))
df.patients$Sample <- as.factor('Patients')

The summary-function returns some basic information about the dataframe created. As we can see, the mean age of the patient sample is 53.7 and roughly 70% of the patients are male (69.2%).

summary(df.patients)
##       Age            Sex           Sample   
##  Min.   :30.00   Male  :173   Patients:250  
##  1st Qu.:42.00   Female: 77                 
##  Median :54.00                              
##  Mean   :53.71                              
##  3rd Qu.:66.00                              
##  Max.   :78.00

In a second step, we create another dataframe named df.population. We want this dataframe to comprise the same variables as df.patients with different specifications. With 18 to 80, the age-range of the population shall be wider than in the patient sample and the proportion of female and male patients shall be the same.

set.seed(1234)
df.population <- r_data_frame(n = 1000, 
                              age(x = 18:80, 
                                  name = 'Age'), 
                              sex(x = c("Male", "Female"), 
                                  prob = c(0.50, 0.50), 
                                  name = "Sex"))
df.population$Sample <- as.factor('Population')

The following table shows the sample's mean age (49.5 years) and the proportion of men (48.5%) and women (51.5%).

summary(df.population)
##       Age            Sex             Sample    
##  Min.   :18.00   Male  :485   Population:1000  
##  1st Qu.:34.00   Female:515                    
##  Median :50.00                                 
##  Mean   :49.46                                 
##  3rd Qu.:65.00                                 
##  Max.   :80.00

Merging the dataframes

Before we match the samples, we need to merge both dataframes. Based on the variable Sample, we create a new variable named Group (type logic) and a further variable (Distress) containing information about the individuals' level of distress. The Distress variable is created using the age-function of the Wakefield package. As we can see, women will have higher levels of distress.

mydata <- rbind(df.patients, df.population)
mydata$Group <- as.logical(mydata$Sample == 'Patients')
mydata$Distress <- ifelse(mydata$Sex == 'Male', age(nrow(mydata), x = 0:42, name = 'Distress'),
                                                age(nrow(mydata), x = 15:42, name = 'Distress'))

When we compare the distribution of age and sex in both samples, we discover significant differences:

pacman::p_load(tableone)
table1 <- CreateTableOne(vars = c('Age', 'Sex', 'Distress'), 
                         data = mydata, 
                         factorVars = 'Sex', 
                         strata = 'Sample')
table1 <- print(table1, 
                printToggle = FALSE, 
                noSpaces = TRUE)
kable(table1[,1:3],  
      align = 'c', 
      caption = 'Table 1: Comparison of unmatched samples')
Patients Population p
n 250 1000
Age (mean (sd)) 53.71 (13.88) 49.46 (18.33) 0.001
Sex = Female (%) 77 (30.8) 515 (51.5) <0.001
Distress (mean (sd)) 22.86 (11.38) 25.13 (11.11) 0.004

Furthermore, the level of distress seems to be significantly higher in the population sample.

Matching the samples

Now, that we have completed preparation and inspection of data, we are going to match the two samples using the matchit-function of the MatchIt package. The method command method="nearest" specifies that the nearest neighbors method will be used. Other matching methods are exact matching, subclassification, optimal matching, genetic matching, and full matching (method = c("exact", "subclass", "optimal", ""genetic", "full")). The ratio command ratio = 1 indicates a one-to-one matching approach. With regard to our example, for each case in the patient sample exactly one case in the population sample will be matched. Please also note that the Group variable needs to be logic (TRUE vs. FALSE).

set.seed(1234)
match.it <- matchit(Group ~ Age + Sex, data = mydata, method="nearest", ratio=1)
a <- summary(match.it)

For further data presentation, we save the output of the summary-function into a variable named a.

After matching the samples, the size of the population sample was reduced to the size of the patient sample (n=250; see table 2).

kable(a$nn, digits = 2, align = 'c', 
      caption = 'Table 2: Sample sizes')
Control Treated
All 1000 250
Matched 250 250
Unmatched 750 0
Discarded 0 0

The following output shows, that the distributions of the variables Age and Sex are nearly identical after matching.

kable(a$sum.matched[c(1,2,4)], digits = 2, align = 'c', 
      caption = 'Table 3: Summary of balance for matched data')
Means Treated Means Control Mean Diff
distance 0.23 0.23 0.00
Age 53.71 53.65 0.06
SexMale 0.69 0.69 0.00
SexFemale 0.31 0.31 0.00

The distributions of propensity scores can be visualized using the plot-function which is part of the MatchIt package .

plot(match.it, type = 'jitter', interactive = FALSE)

plot of chunk plot

Saving the matched samples

Finally, the matched samples will be saved into a new dataframe named df.match.

df.match <- match.data(match.it)[1:ncol(mydata)]
rm(df.patients, df.population)

Eventually, we can check whether the differences in the level of distress between both samples are still significant.

pacman::p_load(tableone)
table4 <- CreateTableOne(vars = c('Age', 'Sex', 'Distress'), 
                         data = df.match, 
                         factorVars = 'Sex', 
                         strata = 'Sample')
table4 <- print(table4, 
                printToggle = FALSE, 
                noSpaces = TRUE)
kable(table4[,1:3],  
      align = 'c', 
      caption = 'Table 4: Comparison of matched samples')
Patients Population p
n 250 250
Age (mean (sd)) 53.71 (13.88) 53.65 (13.86) 0.961
Sex = Female (%) 77 (30.8) 77 (30.8) 1.000
Distress (mean (sd)) 22.86 (11.38) 24.13 (11.88) 0.222

With a p-value of 0.222, Student's t-test does not indicate significant differences anymore. Thus, PSM helped to avoid an alpha mistake.


PS 1: The packages used in this blog post can be loaded/installed using the following code:

pacman::p_load(knitr, wakefield, MatchIt, tableone, captioner)

PS 2: Thanks very much to my colleague Katharina Kuba for for telling me about the MatchIt package.

Posted in Indroduction | Tagged , | Leave a comment