My first R package: streetsofle

Intro

A couple of days ago I published my first R package on GitHub. I’ve named the package streetsofle standing for streets of Leipzig because it includes a data set containing the street directory of the German city of Leipzig. The street directory was published by the Statistical Bureau of Leipzig and may be downloaded as PDF file from the following website.

Leipzig is divided into 10 greater adminsitrative districts (“Stadtbezirke”) and 63 smaller local districts (“Ortsteile”). The names of both smaller and greater districts can be found under the following link.

Figure 1: Map of Leipzig

plot of chunk map

Furthermore, the Leipzig area is covered by 34 postal codes (“Postleitzahlen”). That is:

##  [1] "04103" "04105" "04107" "04109" "04129" "04155" "04157" "04158"
##  [9] "04159" "04177" "04178" "04179" "04205" "04207" "04209" "04229"
## [17] "04249" "04275" "04277" "04279" "04288" "04289" "04299" "04315"
## [25] "04316" "04317" "04318" "04319" "04328" "04329" "04347" "04349"
## [33] "04356" "04357"

Installation

if (!require("devtools")) install.packages("devtools")
devtools::install_github("nrkoehler/streetsofle")

The Dataset

The data frame streetsofle contains 3391 observations and 8 variables. That is:

  • plz: postal code,
  • street_key: street identification number,
  • street_name: street name,
  • street_num: a list of street numbers,
  • bz_key: identification number for greater districts (‘Stadtbezirke’) ,
  • bz_name: names of greater districts (‘Bezirke’),
  • ot_key: identification number for smaller districts (‘Ortsteile’) ,
  • ot_name: names of smaller districts (‘Ortsteile’).

Since street sections without addresses are usually not covered by postal codes, the variable plz contains some missing values (n=169).

Next steps

Writing functions

Within the next couple of weeks I will write some functions to analyse the street directory data. The next code snipped, for example, shows how to calculate the number of smaller districts (‘Ortsteile’) traversed by the streets.

f <- function(x){length(unique(x))}
df <- data.frame(ot_num = tapply(streetsofle$ot_name, streetsofle$street_name, f))
psych::headTail(df)
##                      ot_num
## Aachener Straße           1
## Abrahamstraße             1
## Abtnaundorfer Straße      1
## Achatstraße               1
## ...                     ...
## Zwergmispelstraße         1
## Zwetschgenweg             1
## Zwickauer Straße          3
## Zwiebelweg                1

The following table shows the number of smaller districts traversed by the streets of Leipzig.

no. of districts no. of streets
1 2721
2 207
3 49
4 11
5 4
6 2
7 2
10 1

While 91% of the streets don’t traverse any district border, one street traverses the borders of 10 districts.

Shiny Web-App

Based on these functions I’m planning to write a Shiny Web-App providing several search functions.

Advertisements
Posted in Fav R Packages | Tagged , , | Leave a comment

How to scrape, import and visualize Telegram chats

Intro

Telegram is a cloud-based and cross-platform instant messaging service. Unlike WhatsApp, Telegram clients exist not only for mobile devices but also for desktop operating systems. In addition, there is also a web based client.

In this blog post I show, how to import Telegram chats into R and how to plot a chat using the qdap package.

R packages

The following code will install load and / or install the R packages required for this blog post.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(readr, qdap, lubridate)

Scraping the data

A very straightforward way to save Telegram chats is to use the Chrome extension Save Telegram Chat History. On Quora, Stacy Wu explains how to use it:

  • Visit https://web.telegram.org.
  • Select a peer you want to get the chat history from.
  • Load the messages.
  • Select all text (Ctrl+A) and copy it (Ctrl+C).
  • Paste the text into a text editor (Ctrl+V) and save to a local file.

I have saved the chat to a local csv-file. Since the first two lines contain non-tabular information, they need to be manually removed. Furthermore, undesired line breaks must be manually removed as well.

Importing the data

The following line of code shows how to import the csv file into R using the read_csv() function of the readr package.

mydata <- readr::read_csv("telegram.csv", col_names=FALSE)

Data wrangling

After importing the file, our data frame consists of two string variables: X1 containing information about day and time of the conversations and X2 containing the names of the persons involved in the chat as well as the chat text. With the following lines of code we create 4 new variables:

  • day containing the dates of the chats,
  • time containing the times of day of the chats,
  • person containing the names of the persons involved in the chat,
  • txt containing the actual chat text.
mydata$day <- stringr::str_sub(mydata$X1, 1, 10)
mydata$day <- lubridate::dmy(mydata$day)
mydata$time <- stringr::str_sub(mydata$X1, 12, 19)
mydata$time <- lubridate::hms(mydata$time)
mydata$person <- stringr::str_extract(mydata$X2, "[^:]*")
mydata$person <- factor(mydata$person, levels = unique(mydata$person), labels = c('Me', 'Other'))
mydata$txt <- gsub(".*:\\s*","", mydata$X2)
mydata <- mydata[, c(3:6)]
head(mydata, 2)
## # A tibble: 2 × 4
##          day         time person   txt
##       <date> <S4: Period> <fctr> <chr>
## 1 2017-01-20  21H 10M 14S     Me Hello
## 2 2017-01-20  21H 11M 42S  Other  Huhu

Gradient word cloud

Since the chat involves only two persons, I decided to plot it as gradient word cloud, a visualization technique developed by Tyler Rinker. The function gradient_cloud() I use in the next code snippet is part of his qdap package. Gradient word clouds “color words with a gradient based on degree of usage between two individuals” (See).

gradient_cloud(mydata$txt, mydata$person, title = "Gradient word cloud of Telegram chat")

plot of chunk gwc

The chat I have ploted is very short and, thus, not very telling. I'm wondering how it looks in a couple of months.

Posted in Text Mining, Visualizing Data | Tagged , | Leave a comment

How to plot animated maps with gganimate

Intro

Leipzig is the largest city in the federal state of Saxony, Germany. Since about 2009, Leipzig is one of the fastest growing German cities. Between 2006 and 2016, the population has grown from about 506.000 to 580.000 inhabitants.The largest part of the population growth can be attributed to inward migration.

In this blog post, I show how to create an animated map of Leipzig visualizing the migration balances for each of Leipzig's 63 districts (“Ortsteile”).

The data

The data I will visualize in this blog post are provided by the local Statistical Bureau. They can be downloaded as .csv file from the following website. The shapefile I use, is not publicly available and can be purchased for about 25 Euro from the Statistical Bureau.

R packages

The following code will install load and / or install the R packages required for this blog post.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(readr, rgdal, ggplot2, reshape2)
pacman::p_load_current_gh("dgrtwo/gganimate")

While the packages readr and rgdal are required for importing the .csv and the shapefile, ggplot2 and gganimate are needed to plot and animate the data. The reshape2 package is required for transforming data frames from wide to long format.

After importing the csv file containing the migration data, we add a key variable (id) to identify each of Leipzig's 63 districts. Using the melt-function of the reshape2 package, we reformat the data frame from wide into long form.

# import csv file
df.migrat <- read_csv2("./Data_LE/Bevölkerungsbewegung_Wanderungen_Wanderungssaldo.csv",
                      col_types = cols(`2015` = col_number()),
                      n_max = 63,
                      locale = locale(encoding = "latin1"),
                      trim_ws = TRUE)
# add id variable
df.migrat$id <- c(1:63)
# drop variables not needed
df.migrat <- df.migrat[,c(1,18,2:17)]
# change name of first variable
colnames(df.migrat)[1] <- 'district'
# convert from wide to long
df.migrat <- reshape2::melt(df.migrat, id.vars=c("district","id"))
# change name of third variable
colnames(df.migrat)[3] <- 'year'

With the next code snippet, we import the shapefile using the readOGR-function of the rgdal package. The object we receive (sh.file) is of class SpatialPolygonsDataFrame. After adding a key variable, we transform this object into an ordinary data frame using the fortify-function of the ggplot2 package. Furthermore, we join both data frames by the key variable id. Before we plot the data, we create a new variable (co) categorizing the migration balance variable (value).

# import shapefile
sh.file <- readOGR("ot.shp", layer="ot")
# add id variable
sh.file$id <- c(1:63)
# Create a data frame
sh.file <- fortify(sh.file, region = 'id')
# Merge data frames by id variable
mydata <- merge(sh.file, df.migrat, by='id')
# create a categorized variable
mydata$co <- ifelse(mydata$value >= 1000, 4,
                   ifelse(mydata$value >= 0, 3,
                                ifelse(mydata$value >= -1000, 2, 1)))
# define levels and labels
mydata$co  <- factor(mydata$co, levels = c(4:1),
                     labels = c('> 1000', '0-1000', '-1000-0', '< -1000'))

Moreover, we define a theme suitable for plotting maps with ggpot2. The following theme is a customized version of a theme I found on Stack Overflow.

theme_opts <- list(theme(panel.grid.minor = element_blank(),
                         panel.grid.major = element_blank(),
                         panel.background = element_blank(),
                         plot.background = element_blank(),
                         panel.border = element_blank(),
                         axis.line = element_blank(),
                         axis.text.x = element_blank(),
                         axis.text.y = element_blank(),
                         axis.ticks = element_blank(),
                         axis.title.x = element_blank(),
                         axis.title.y = element_blank(),
                         legend.position="right",
                         plot.title = element_text(size=16)))

Plotting the data

Finally, we can plot the data. The first plot (static) visualizes the migration data for the whole period of time (2000–2015).

ggplot(mydata, aes(x = long, y = lat, group = group, fill=value)) +
  theme_opts + 
  coord_equal() +
  geom_polygon(color = 'grey') +
  labs(title = "Migration balance between 2000 and 2015",
       subtitle = "Static Map",
       caption = 'Data source: Stadt Leipzig, Amt für Statistik und Wahlen, 2015') +
  scale_fill_distiller(name='Migration\nbalance', direction = 1, palette='RdYlGn')

plot of chunk map

For creating an animated plot, we add the frame argument stemming from the gganimate package. With interval = 5 we define the time (in seconds) between the “slides”.

p <- ggplot(mydata, aes(x = long, y = lat, group = group, fill=value, frame=year)) +
  theme_opts + 
  coord_equal() +
  geom_polygon(color = 'grey') +
  labs(title = "Migration balance in year: ",
       subtitle = "Animated Map",
       caption = 'Data source: Stadt Leipzig, Amt für Statistik und Wahlen, 2015') +
  scale_fill_distiller(name='Migration\nbalance', direction = 1, palette='RdYlGn')

gg_animate(p, interval = 5) 

Posted in Visualizing Data | Tagged , , , , , | Leave a comment

Text mining PubMed using R – a case study

Intro

For a couple of years, I've been doing research in the field of psychooncology, which basically is “concerned both with the effects of cancer on a person's psychological health as well as the social and behavioral factors that may affect the disease process of cancer and/or the remission of it.” The most influential psycho-oncological journal is named “Psycho-Oncology”. The journal was founded in 1992 and is published monthly by Wiley-Blackwell.

Data

In this blog post, I'll play arround with some bibliographic data from the PubMed data base. With the following query, we receive bibliographical information for every paper published in Psycho-Oncology from 1997 up today.

("Psycho-oncology"[Jour])

As the following screen shot shows, our search query returned 2785 entries. The search results can be downloaded and saved into a .csv file by clicking on the arrow right beside Send to, choosing destination (File) and format (CSV).

Fig. 1: Screenshot from PubMed

R packages

The following code will install load and / or install the R packages required for this blog post.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(pacman::p_load(readr, stringr, ggplot2, qdap, qdapRegex))

Import

The data can be easily imported into R using the read_csv() function from the readr package.

# import .csv file
mydata <- read_csv("pubmed_result.csv")

Data manipulation

With the next code snippet, we do some data manipulation in order to make our data frame tidy. We start with removing the last row of the data frame containing data about a paper published in 1995, remove superfluous headlines, entries without author and select the variables we need for analysis. The str_detect() function stems from the stringr package.

# remove only article from 1995
mydata <- mydata[1:nrow(mydata)-1,]
# remove headlines
mydata <- subset(mydata, str_detect(URL, 'URL')==FALSE)
# remove entries without author
mydata <- subset(mydata, str_detect(Description, 'Author A.')==FALSE)
mydata <- subset(mydata, str_detect(Description, '\\[No authors listed\\]')==FALSE)
# Select variables
mydata <- mydata[, c('Title', 'Description', 'ShortDetails')]

In the next step, we extract the year of publication from the ShortDetails variable. Based on the year variable, we create a variable named time merging the years to four periods of time (five years sub-periods between 1998 and 2016). Furthermore, we calculate the number of authors for each paper using the str_count function of the stringr package. Since each author except for the last is followed by a comma, we just need to count the number of commas and add 1.

# Return Year of publication
mydata$Year <- as.numeric(str_sub(mydata$ShortDetails, 
                                  nchar(mydata$ShortDetails)-3, 
                                  nchar(mydata$ShortDetails)))
mydata$ShortDetails <- NULL
colnames(mydata) <- c('title', 'authors', 'year')
mydata$time <- ifelse(mydata$year <= 2001, 1,
                      ifelse(mydata$year >= 2002 & mydata$year <=2006, 2,
                             ifelse(mydata$year >= 2007 & mydata$year <=2011, 3, 4)))
mydata$year <- as.factor(mydata$year)
# Categorize years
mydata$time <- factor(mydata$time, levels = c(1:4),
                      labels = c('1997-2001', '2002-2006', '2007-2011', '2012-2016'))
# Calculate number of authors
mydata$auth.n <- str_count(mydata$authors, '\\,') + 1

Before we can plot these data, we need to create 4 more data frames:

  • df.pup containing the number of papers published each year between 1997 and 2006,
  • df.auth.1 containing the mean number of authors per paper and year,
  • df.auth.2 containing the mean number of authors per paper and time period,
  • df.auth.3 containing the maximum number of authors per paper and year,
df.pup <- data.frame(table(mydata$year))
df.auth.1 <- data.frame(aggregate(mydata$auth.n, list(time = mydata$year), mean))
df.auth.2 <- data.frame(aggregate(mydata$auth.n, list(time = mydata$time), mean))
df.auth.3 <- data.frame(aggregate(mydata$auth.n, list(time = mydata$year), max))

Now, we can plot the data using the ggplot2 package.

ggplot(df.pup, aes(x=Var1, y=Freq, fill=0-Freq)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=Freq), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, max(df.pup$Freq)+10), breaks = seq(0, max(df.pup$Freq)+10, by=20)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 2: Number of papers published each year",
       caption = "Data source: PubMed")

plot of chunk ggplot1

Except for a few years, there has been a continuous rise in the number of publications. Since January 2013, Psycho-Oncology has been available exclusively online. Thus, the number of annual papers published is not restricted by requirements of the printed version anymore. Consequently, there was an enormous rise in the number of publications in 2013. However, in the following year (2014) the number of publications declined quite remarkably. The reason for this decline may be a lack of submissions.

ggplot(df.auth.1, aes(x=time, y=x, fill=0-x)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(x,1)), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, 10), breaks = seq(0,10, by=2)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 3: Mean number of authors per paper and year",
       caption = "Data source: PubMed")

plot of chunk ggplot2

According to figure 3 and 4, the number of authorships per article has been rising over the whole time period. This is in accordance with the general trend towards a rise in the number of personal authors per citation.

ggplot(df.auth.2, aes(x=time, y=x, fill=0-x)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(x,1)), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, 10), breaks = seq(0,10, by=2)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 4: Mean number of authors per paper and time period",
       caption = "Data source: PubMed")

plot of chunk ggplot3

Between 1997 and 2016, the mean number of authors per publication has increased from about four to six. The enormous rise in the maximum number of authors (see Fig. 5) supports the idea of an inflation in the number of authors per paper.

ggplot(df.auth.3, aes(x=time, y=x, fill=0-x)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(x,1)), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, max(df.auth.3$x)+5), breaks = seq(0, max(df.auth.3$x)+5, by=5)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 5: Max number of authors per paper and year",
      caption = "Data source: PubMed")

plot of chunk ggplot4

Text mining

Before we analyze the data, some janitorial work needs to be done.
We start with (1) defining a list of stopwords, (2, 3) remove non-word characters, (4, 5) replace spaces in two word groups we want to be grouped together, and (6) remove the defined stopwords from the variable containing the citation titles.

sw <- c(Top200Words, 'among') # qdap
mydata$txt <- rm_non_words(mydata$title) # qdapRegex
mydata$txt <- str_replace_all(mydata$txt, "'", "") # stringr
keeps <- c('quality of life', 'health related')
mydata$txt <- space_fill(mydata$txt, keeps, sep = "~~") # qdap
mydata$txt <- rm_stop(mydata$txt, sw,
                      separate = FALSE) # qdap

In the next step, we generate a distribution table for all words of the citation titles. The resulting data frame (df.words) consits of three variables: interval (citation title terms), freq (term frequency), and percent (percentage of the term).

df.words <- dist_tab(breaker(mydata$txt)) # qdap
df.words <- df.words[,c(1, 2, 4)]
df.words <- subset(df.words, percent > 0.5)
df.words <- df.words[order(df.words$percent),] 
df.words$interval <- factor(df.words$interval, 
                            levels=df.words$interval,
                            labels=df.words$interval, 
                            ordered=TRUE) 
kable(head(df.words))
interval freq percent
666 coping 141 0.51
2407 prostate 140 0.51
789 depression 143 0.52
1335 health 147 0.53
2678 review 150 0.54
2190 patient 154 0.56

With the next R code we create a list l.txt_time containing four sub lists.

# group titles by time period
l.txt_time <- by(mydata$txt, mydata$time, breaker)
str(l.txt_time)
## List of 4
##  $ 1997-2001: chr [1:2160] "relationship" "between" "trait" "anxiety" ...
##  $ 2002-2006: chr [1:3608] "measurement" "coping" "stress" "responses" ...
##  $ 2007-2011: chr [1:7310] "novel" "stop" "multidisciplinary" "clinic" ...
##  $ 2012-2016: chr [1:14606] "relationship" "between" "objectively" "measured" ...
##  - attr(*, "dim")= int 4
##  - attr(*, "dimnames")=List of 1
##   ..$ mydata$time: chr [1:4] "1997-2001" "2002-2006" "2007-2011" "2012-2016"
##  - attr(*, "call")= language by.default(data = mydata$txt, INDICES = mydata$time, FUN = breaker)
##  - attr(*, "class")= chr "by"

The following code snippet is to convert these four lists into separate data frames, each containing the most frequent citation title terms of the different periods of time.

# time span 1997-2001
df.t1 <- dist_tab(l.txt_time[[1]])
df.t1 <- df.t1[,c(1, 2, 4)]
df.t1 <- subset(df.t1, percent > 0.2)
df.t1 <- df.t1[order(df.t1$percent),] 
df.t1$interval <- factor(df.t1$interval, 
                         levels=df.t1$interval,
                         labels=df.t1$interval, 
                         ordered=TRUE) 
df.t1$year <- levels(mydata$time)[1]
# time span 2002-2006
df.t2 <- dist_tab(l.txt_time[[2]])
df.t2 <- df.t2[,c(1, 2, 4)]
df.t2 <- subset(df.t2, percent > 0.2)
df.t2 <- df.t2[order(df.t2$percent),] 
df.t2$interval <- factor(df.t2$interval, levels=df.t2$interval,
                         labels=df.t2$interval, ordered=TRUE) 
df.t2$year <- levels(mydata$time)[2]
# time span 2007-2011
df.t3 <- dist_tab(l.txt_time[[3]])
df.t3 <- df.t3[,c(1, 2, 4)]
df.t3 <- subset(df.t3, percent > 0.3)
df.t3 <- df.t3[order(df.t3$percent),] 
df.t3$interval <- factor(df.t3$interval, levels=df.t3$interval,
                         labels=df.t3$interval, ordered=TRUE) 
df.t3$year <- levels(mydata$time)[3]
# time span 2002-2016
df.t4 <- dist_tab(l.txt_time[[4]])
df.t4 <- df.t4[,c(1, 2, 4)]
df.t4 <- subset(df.t4, percent > 0.3)
df.t4 <- df.t4[order(df.t4$percent),] 
df.t4$interval <- factor(df.t4$interval, levels=df.t4$interval,
                         labels=df.t4$interval, ordered=TRUE) 
df.t4$year <- levels(mydata$time)[4]

Eventually, we create a data frame (df.time) containing the same variables as the data frame df.words and a variable (year) defining the period of time the paper was published.

# merging
df.time <- rbind(df.t1, df.t2, df.t3, df.t4)
df.time <- merge(df.time, df.words, by='interval')
df.time <- df.time[,1:4]
colnames(df.time) <- c('term', 'freq', 'percent', 'year')
df.time$year <- as.factor(df.time$year)

Finally, we plot the most frequent citation title terms over four periods of time.

ggplot(df.time, aes(x=term, y=percent, fill=year)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(percent,1)), size = 4, 
            fontface=2, hjust =  -0.3, vjust = 0.4) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', 
                     limits=c(0, max(df.time$percent)+1), 
                     breaks = seq(0, max(df.time$percent)+1, by=2)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  coord_flip() +
  facet_grid(. ~ year) +
  labs(title = "Fig. 6: Most frequent terms in time",
       caption = "Data source: PubMed")

plot of chunk ggplot5

Above all, the terms survivors and distress were increasingly used over time which reflects a growing interest in so-called psychological distress (for a critical discussion see) and cancer survivorship.

Posted in Bibliometrics, Text Mining | Tagged , , | Leave a comment

How to import a bunch of Excel files with multiple sheets

The Problem

I have been recently conducting an evaluation study in the field of social work with elderly people. With the purpose to provide advice to elderly peoply regarding age-related questions about housing, home care etc, 10 offices for senior citizens were founded. Each of the offices is requested to document its activities (number of persons, events etc.) on a monthly basis. The data need to be entered in a prestructured Excel table. Since the offices started working about 2.5 years ago, I needed to handle 300 Excel sheets (30 months * 10 offices).

In a first step, I decided to create one Excel file for every office each containing 30 sheets (one sheet per month). While the files are named after the offices (abbreviated with 4 characters), the sheets are named after the following pattern: YYYY.MM (year followed by month).

The Solution

Since I did not find a solution for my problem with the packages I usually use to import Excel files into R (xlsx, readxl), I searched the internet for help. Fortunatelly, I found the paper “How to import and merge many Excel files; each with multiple sheets of data for statistical analysis.” by Jon Starkweather. The paper is really worth reading and gives a very comprehensive description on the subject matter. The following code snippets stem from Starkweather's paper.

In a first step, we have to load the following packages:

library(rJava)
library(XLConnect, pos = 4)

In a second step, we define the file type, we want to import (.xls), save the sheet names of the Excel files into a new vector called sheet.names (since the sheet names in each of the files are identical, we may extract them from any of the 10 files) and create another vector (e.names) containing the names for the variables we want to import (in this case 28).

file.names <- list.files(pattern='*.xls')
sheet.names <- getSheets(loadWorkbook('Name.xls'))
e.names <- paste0(rep('v', 28), c(1:28))

In a thirt step, we create a data frame with 28 variables, named
v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, v16, v17, v18, v19, v20, v21, v22, v23, v24, v25, v26, v27, v28
and one row containing NAs only.

data.1 <- data.frame(matrix(rep(NA,length(e.names)),
                            ncol = length(e.names)))
names(data.1) <- e.names

Finally, we use 2 for-loops to import all the files and sheets and bind them to a data frame we can use for analysis.

for (i in 1:length(file.names)) {
    wb <- loadWorkbook(file.names[i])
    for (j in 1:length(sheet.names)) {
        ss <- readWorksheet(wb, sheet.names[j], startCol = 2, header = TRUE)
        condition <- rep(sheet.names[j], nrow(ss))
        sub.id <- rep(file.names[i], nrow(ss))
        s.frame <- seq(1:nrow(ss))
        df.1 <- data.frame(sub.id, condition, s.frame, ss)
        names(df.1) <- e.names
        data.1 <- rbind(data.1, df.1)
        rm(ss, condition, s.frame, sub.id, df.1)
    }
    rm(wb)
}

In the mentioned paper, Jon Starkweather elaborates in detail on what each line of each for-loop is doing.

Posted in Data Management | Tagged , | Leave a comment

How to add a background image to ggplot2 graphs

When producing so called infographics, it is rather common to use images rather than a mere grid as background. In this blog post, I will show how to use a background image with ggplot2.

Packages required

The following code will install load and / or install the R packages required for this blog post.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(jpeg, png, ggplot2, grid, neuropsychology)

Choosing the data

The data set I will be using in this blog post is named diamonds and part of the ggplot2 package. It contains information about – surprise, surprise – diamonds, e.g. price and cut (Fair, Good, Very Good, Premium, Ideal). Using the tapply-function, we create a table returning the maximum prices per cut. Since we need the data to be organized in a data frame, we must transform the table using the data.frame-function.

mydata <- data.frame(price = tapply(diamonds$price, diamonds$cut, max))
mydata$cut <- rownames(mydata)
cut price
Fair 18574
Good 18788
Very Good 18818
Premium 18823
Ideal 18806

Importing the background image

The file format of the background image we will be using in this blog post is JPG. Since the image imitates a blackboard, we name it “blackboard.jpg”. The image file must be imported using the readJPEG-function of the jpeg package. The imported image will be saved into an object named image.

imgage <- jpeg::readJPEG("blackboard.jpg")

To import other image file formats, different packages and functions must be used. The next code snippet shows how to import PNG images.

image <- png::readPNG("blackboard.png")

Drawing the plot

In the next step, we actually draw a bar chart with a backgriund image. To make blackboard.jpg the background image, we need to combine the annotation_custom-function of the ggplot2 package and the rasterGrob-function of the grid package.

ggplot(mydata, aes(cut, price, fill = -price)) +
  ggtitle("Bar chart with background image") +
  scale_fill_continuous(guide = FALSE) +
  annotation_custom(rasterGrob(imgage, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                               -Inf, Inf, -Inf, Inf) +
  geom_bar(stat="identity", position = "dodge", width = .75, colour = 'white') +
  scale_y_continuous('Price in $', limits = c(0, max(mydata$price) + max(mydata$price) / 4)) +
  scale_x_discrete('Cut') +
  geom_text(aes(label = round(price), ymax = 0), size = 7, fontface = 2, 
            colour = 'white', hjust = 0.5, vjust = -1) 

plot of chunk plot1

Adding opacity

Using the specification alpha = 0.5, we add 50% opacity to the bars. alpha ranges between 0 and 1, with higher values indicating greater opacity.

ggplot(mydata, aes(cut, price, fill = -price)) +
  theme_neuropsychology() +
  ggtitle("Bar chart with background image") +
  scale_fill_continuous(guide = FALSE) +
  annotation_custom(rasterGrob(imgage, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                               -Inf, Inf, -Inf, Inf) +
  geom_bar(stat="identity", position = "dodge", width = .75, colour = 'white', alpha = 0.5) +
  scale_y_continuous('Price in $', limits = c(0, max(mydata$price) + max(mydata$price) / 4)) +
  scale_x_discrete('Cut') +
  geom_text(aes(label = round(price), ymax = 0), size = 7, fontface = 2, 
            colour = 'white', hjust = 0.5, vjust = -1) 

plot of chunk plot2

The recently published R package neuropsychology contains a theme named theme_neuropsychology(). This theme may be used to get bigger axis titles as well as bigger axis and legend text.

Posted in Visualizing Data | Tagged | Leave a comment

How to number and reference tables and figures in R Markdown files

Introduction

R Markdown is a great tool to make research results reproducible. However, in scientific research papers or reports, tables and figures usually need to be numbered and referenced. Unfortunately, R Markdown has no “native” method to number and reference table and figure captions. The recently published bookdown package makes it very easy to number and reference tables and figures (Link). However, since bookdown uses LaTex functionality, R Markdown files created with bookdown cannot be converted into MS Word (.docx) files.

In this blog post, I will explain how to number and reference tables and figures in R Markdown files using the captioner package.

Packages required

The following code will install load and / or install the R packages required for this blog post. The dataset I will be using in this blog post is named bundesligR and part of the bundesligR package. It contains “all final tables of Germany's highest football league, the Bundesliga” Link.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(knitr, captioner, bundesligR, stringr)

In the first code snippet, we create a table using the kable function of the knitr package. With caption we can specify a simple table caption. As we can see, the caption will not be numbered and, thus, cannot be referenced in the document.

knitr::kable(bundesligR::bundesligR[c(1:6), c(2,3,11,10)],
             align = c('c', 'l', 'c', 'c'),
             caption = "German Bundesliga: Final Table 2015/16, Position 1-6")
Position Team Points GD
1 FC Bayern Muenchen 88 63
2 Borussia Dortmund 78 48
3 Bayer 04 Leverkusen 60 16
4 Borussia Moenchengladbach 55 17
5 FC Schalke 04 52 2
6 1. FSV Mainz 05 50 4

Table numbering

Thanks to Alathea Letaw's captioner package, we can number tables and figures.
In a first step, we define a function named table_nums and apply it to the tables' name and caption. We define both table name and table caption. Furthermore, we may also define a prefix (Tab. for tables and Fig. for figures).

table_nums <- captioner::captioner(prefix = "Tab.")

tab.1_cap <- table_nums(name = "tab_1", 
                        caption = "German Bundesliga: Final Table 2015/16, Position 7-12")
tab.2_cap <- table_nums(name = "tab_2", 
                        caption = "German Bundesliga: Final Table 2015/16, Position 12-18")

The next code snippet combines both inline code and a code chunk. With fig.cap = tab.1_cap, we specify the caption of the first table. It is important to separate inline code and code chunk. Otherwise the numbering won't work.

Tab. 1: German Bundesliga: Final Table 2015/16, Position 7-12

Position Team Points GD
7 Hertha BSC 50 0
8 VfL Wolfsburg 45 -2
9 1. FC Koeln 43 -4
10 Hamburger SV 41 -6
11 FC Ingolstadt 04 40 -9
12 FC Augsburg 38 -10

Table referencing

Since we have received a numbered table, it should also be possible to reference the table. However, we can not just
use the inline code table_nums('tab_1'). Otherwise, we wi'll get the following output:

[1] “Tab. 1: German Bundesliga: Final Table 2015/16, Position 7-12”

In order to return the desired output (prefix Tab. and table number), I have written the function f.ref. Using a regular expression, the function returns all characters of the table_nums('tab_1') output located before the first colon.

f.ref <- function(x) {
  stringr::str_extract(table_nums(x), "[^:]*")
}

When we apply this function to tab_1, the inline code returns the following result:

As we can see in f.ref("tab_1"), the Berlin based football club Hertha BSC had position seven in the final table.

As we can see in Tab. 1, the Berlin based football club Hertha BSC had position seven in the final table.

Just to make the table complete, Tab. 2 shows positions 13 to 18 of the final Bundesliga table.

Tab. 2: German Bundesliga: Final Table 2015/16, Position 12-18

knitr::kable(bundesligR::bundesligR[c(13:18), c(2,3,11,10)],
             align = c('c', 'l', 'c', 'c'),
             row.names = FALSE)
Position Team Points GD
13 Werder Bremen 38 -15
14 SV Darmstadt 98 38 -15
15 TSG 1899 Hoffenheim 37 -15
16 Eintracht Frankfurt 36 -18
17 VfB Stuttgart 33 -25
18 Hannover 96 25 -31

And what about figures?

Figures can be numbered and referenced following the same principle.

Posted in Data Management | Tagged , , | Leave a comment