How to import a bunch of Excel files with multiple sheets

The Problem

I have been recently conducting an evaluation study in the field of social work with elderly people. With the purpose to provide advice to elderly peoply regarding age-related questions about housing, home care etc, 10 offices for senior citizens were founded. Each of the offices is requested to document its activities (number of persons, events etc.) on a monthly basis. The data need to be entered in a prestructured Excel table. Since the offices started working about 2.5 years ago, I needed to handle 300 Excel sheets (30 months * 10 offices).

In a first step, I decided to create one Excel file for every office each containing 30 sheets (one sheet per month). While the files are named after the offices (abbreviated with 4 characters), the sheets are named after the following pattern: YYYY.MM (year followed by month).

The Solution

Since I did not find a solution for my problem with the packages I usually use to import Excel files into R (xlsx, readxl), I searched the internet for help. Fortunatelly, I found the paper “How to import and merge many Excel files; each with multiple sheets of data for statistical analysis.” by Jon Starkweather. The paper is really worth reading and gives a very comprehensive description on the subject matter. The following code snippets stem from Starkweather's paper.

In a first step, we have to load the following packages:

library(rJava)
library(XLConnect, pos = 4)

In a second step, we define the file type, we want to import (.xls), save the sheet names of the Excel files into a new vector called sheet.names (since the sheet names in each of the files are identical, we may extract them from any of the 10 files) and create another vector (e.names) containing the names for the variables we want to import (in this case 28).

file.names <- list.files(pattern='*.xls')
sheet.names <- getSheets(loadWorkbook('Name.xls'))
e.names <- paste0(rep('v', 28), c(1:28))

In a thirt step, we create a data frame with 28 variables, named
v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, v16, v17, v18, v19, v20, v21, v22, v23, v24, v25, v26, v27, v28
and one row containing NAs only.

data.1 <- data.frame(matrix(rep(NA,length(e.names)),
                            ncol = length(e.names)))
names(data.1) <- e.names

Finally, we use 2 for-loops to import all the files and sheets and bind them to a data frame we can use for analysis.

for (i in 1:length(file.names)) {
    wb <- loadWorkbook(file.names[i])
    for (j in 1:length(sheet.names)) {
        ss <- readWorksheet(wb, sheet.names[j], startCol = 2, header = TRUE)
        condition <- rep(sheet.names[j], nrow(ss))
        sub.id <- rep(file.names[i], nrow(ss))
        s.frame <- seq(1:nrow(ss))
        df.1 <- data.frame(sub.id, condition, s.frame, ss)
        names(df.1) <- e.names
        data.1 <- rbind(data.1, df.1)
        rm(ss, condition, s.frame, sub.id, df.1)
    }
    rm(wb)
}

In the mentioned paper, Jon Starkweather elaborates in detail on what each line of each for-loop is doing.

Posted in Data Management | Tagged , | Leave a comment

How to add a background image to ggplot2 graphs

When producing so called infographics, it is rather common to use images rather than a mere grid as background. In this blog post, I will show how to use a background image with ggplot2.

Packages required

The following code will install load and / or install the R packages required for this blog post.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(jpeg, png, ggplot2, grid, neuropsychology)

Choosing the data

The data set I will be using in this blog post is named diamonds and part of the ggplot2 package. It contains information about – surprise, surprise – diamonds, e.g. price and cut (Fair, Good, Very Good, Premium, Ideal). Using the tapply-function, we create a table returning the maximum prices per cut. Since we need the data to be organized in a data frame, we must transform the table using the data.frame-function.

mydata <- data.frame(price = tapply(diamonds$price, diamonds$cut, max))
mydata$cut <- rownames(mydata)
cut price
Fair 18574
Good 18788
Very Good 18818
Premium 18823
Ideal 18806

Importing the background image

The file format of the background image we will be using in this blog post is JPG. Since the image imitates a blackboard, we name it “blackboard.jpg”. The image file must be imported using the readJPEG-function of the jpeg package. The imported image will be saved into an object named image.

imgage <- jpeg::readJPEG("blackboard.jpg")

To import other image file formats, different packages and functions must be used. The next code snippet shows how to import PNG images.

image <- png::readPNG("blackboard.png")

Drawing the plot

In the next step, we actually draw a bar chart with a backgriund image. To make blackboard.jpg the background image, we need to combine the annotation_custom-function of the ggplot2 package and the rasterGrob-function of the grid package.

ggplot(mydata, aes(cut, price, fill = -price)) +
  ggtitle("Bar chart with background image") +
  scale_fill_continuous(guide = FALSE) +
  annotation_custom(rasterGrob(imgage, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                               -Inf, Inf, -Inf, Inf) +
  geom_bar(stat="identity", position = "dodge", width = .75, colour = 'white') +
  scale_y_continuous('Price in $', limits = c(0, max(mydata$price) + max(mydata$price) / 4)) +
  scale_x_discrete('Cut') +
  geom_text(aes(label = round(price), ymax = 0), size = 7, fontface = 2, 
            colour = 'white', hjust = 0.5, vjust = -1) 

plot of chunk plot1

Adding opacity

Using the specification alpha = 0.5, we add 50% opacity to the bars. alpha ranges between 0 and 1, with higher values indicating greater opacity.

ggplot(mydata, aes(cut, price, fill = -price)) +
  theme_neuropsychology() +
  ggtitle("Bar chart with background image") +
  scale_fill_continuous(guide = FALSE) +
  annotation_custom(rasterGrob(imgage, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                               -Inf, Inf, -Inf, Inf) +
  geom_bar(stat="identity", position = "dodge", width = .75, colour = 'white', alpha = 0.5) +
  scale_y_continuous('Price in $', limits = c(0, max(mydata$price) + max(mydata$price) / 4)) +
  scale_x_discrete('Cut') +
  geom_text(aes(label = round(price), ymax = 0), size = 7, fontface = 2, 
            colour = 'white', hjust = 0.5, vjust = -1) 

plot of chunk plot2

The recently published R package neuropsychology contains a theme named theme_neuropsychology(). This theme may be used to get bigger axis titles as well as bigger axis and legend text.

Posted in Visualizing Data | Tagged | Leave a comment

How to number and reference tables and figures in R Markdown files

Introduction

R Markdown is a great tool to make research results reproducible. However, in scientific research papers or reports, tables and figures usually need to be numbered and referenced. Unfortunately, R Markdown has no “native” method to number and reference table and figure captions. The recently published bookdown package makes it very easy to number and reference tables and figures (Link). However, since bookdown uses LaTex functionality, R Markdown files created with bookdown cannot be converted into MS Word (.docx) files.

In this blog post, I will explain how to number and reference tables and figures in R Markdown files using the captioner package.

Packages required

The following code will install load and / or install the R packages required for this blog post. The dataset I will be using in this blog post is named bundesligR and part of the bundesligR package. It contains “all final tables of Germany's highest football league, the Bundesliga” Link.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(knitr, captioner, bundesligR, stringr)

In the first code snippet, we create a table using the kable function of the knitr package. With caption we can specify a simple table caption. As we can see, the caption will not be numbered and, thus, cannot be referenced in the document.

knitr::kable(bundesligR::bundesligR[c(1:6), c(2,3,11,10)],
             align = c('c', 'l', 'c', 'c'),
             caption = "German Bundesliga: Final Table 2015/16, Position 1-6")
Position Team Points GD
1 FC Bayern Muenchen 88 63
2 Borussia Dortmund 78 48
3 Bayer 04 Leverkusen 60 16
4 Borussia Moenchengladbach 55 17
5 FC Schalke 04 52 2
6 1. FSV Mainz 05 50 4

Table numbering

Thanks to Alathea Letaw's captioner package, we can number tables and figures.
In a first step, we define a function named table_nums and apply it to the tables' name and caption. We define both table name and table caption. Furthermore, we may also define a prefix (Tab. for tables and Fig. for figures).

table_nums <- captioner::captioner(prefix = "Tab.")

tab.1_cap <- table_nums(name = "tab_1", 
                        caption = "German Bundesliga: Final Table 2015/16, Position 7-12")
tab.2_cap <- table_nums(name = "tab_2", 
                        caption = "German Bundesliga: Final Table 2015/16, Position 12-18")

The next code snippet combines both inline code and a code chunk. With fig.cap = tab.1_cap, we specify the caption of the first table. It is important to separate inline code and code chunk. Otherwise the numbering won't work.

Tab. 1: German Bundesliga: Final Table 2015/16, Position 7-12

Position Team Points GD
7 Hertha BSC 50 0
8 VfL Wolfsburg 45 -2
9 1. FC Koeln 43 -4
10 Hamburger SV 41 -6
11 FC Ingolstadt 04 40 -9
12 FC Augsburg 38 -10

Table referencing

Since we have received a numbered table, it should also be possible to reference the table. However, we can not just
use the inline code table_nums('tab_1'). Otherwise, we wi'll get the following output:

[1] “Tab. 1: German Bundesliga: Final Table 2015/16, Position 7-12”

In order to return the desired output (prefix Tab. and table number), I have written the function f.ref. Using a regular expression, the function returns all characters of the table_nums('tab_1') output located before the first colon.

f.ref <- function(x) {
  stringr::str_extract(table_nums(x), "[^:]*")
}

When we apply this function to tab_1, the inline code returns the following result:

As we can see in f.ref("tab_1"), the Berlin based football club Hertha BSC had position seven in the final table.

As we can see in Tab. 1, the Berlin based football club Hertha BSC had position seven in the final table.

Just to make the table complete, Tab. 2 shows positions 13 to 18 of the final Bundesliga table.

Tab. 2: German Bundesliga: Final Table 2015/16, Position 12-18

knitr::kable(bundesligR::bundesligR[c(13:18), c(2,3,11,10)],
             align = c('c', 'l', 'c', 'c'),
             row.names = FALSE)
Position Team Points GD
13 Werder Bremen 38 -15
14 SV Darmstadt 98 38 -15
15 TSG 1899 Hoffenheim 37 -15
16 Eintracht Frankfurt 36 -18
17 VfB Stuttgart 33 -25
18 Hannover 96 25 -31

And what about figures?

Figures can be numbered and referenced following the same principle.

Posted in Data Management | Tagged , , | Leave a comment

How to combine box and jitter plots using R and ggplot2

R makes it easy to combine different kinds of plots into one overall graph. This may be useful to visualize both basic measures of central tendency (median, quartiles etc.) and the distribution of a certain variable. Moreover, so called cut-off values can be added to the graph.

In this blog post, I show how to combine box and jitter plots using the ggplot2 package.

First of all, we need to install and load the R packages required for the following steps. Since we want to do the installation and loading using the pacman package, we need to check whether this package has been installed already. If not, it will be installed and loaded. If yes, it will just be loaded (line 1). Furthermore we need the R packages ggplot2 and Hmisc. This time, the p_load function checks whether these packages have been installed already and either installs and loads or just loads them (line 2).

if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, Hmisc)

In a second step, we create three random variables (var.scale, var.group, var.cutoff) with n=300.

  • var.scale is a numeric variable with a mean value of about 50 and a standard deviation of about 17.
  • var.group is a factor variable comprising the groups male dnd female.
  • var.cutoff was calculated based on var.scale using predefined cut-off values (0 – 40 == low, 41 –60 = medium, >60 == high).
var.scale <- round(rnorm(300, 50, 17))
var.group <- rbinom(300, 1, .5)
var.group <- factor(var.group, 
                     levels = c(0:1), 
                     labels = c("male", "female"))

var.cutoff <- ifelse(var.scale <= 40, 1, 
                     ifelse(var.scale > 40 & var.scale <= 60, 2, 3))

var.cutoff <- factor(var.cutoff, 
                     levels = c(3:1), 
                     labels = c("high", "medium", "low"))

The describe() function of the Hmisc package returns some basic measures of central tendency.

Hmisc::describe(var.scale)
## var.scale 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     300       0      71       1   51.25   24.00   30.90   41.00   50.00 
##     .75     .90     .95 
##   63.25   70.00   76.00 
## 
## lowest :   8  10  14  16  17, highest:  85  97 100 102 104
Hmisc::describe(var.group)
## var.group 
##       n missing  unique 
##     300       0       2 
## 
## male (141, 47%), female (159, 53%)
Hmisc::describe(var.cutoff)
## var.cutoff 
##       n missing  unique 
##     300       0       3 
## 
## high (87, 29%), medium (141, 47%), low (72, 24%)

Since the ggplot2 package requires the variables to be in a data frame, we have to create a new data frame df comprising our predefined variables using the data.frame() function.

df <- data.frame(var.scale, var.cutoff, var.group)

Using the functions xlab(), ylab() and ggtitle(), axis labels and plot title will be defined.

Box plots will be created using the geom_boxplot() function, with width specifying the boxes' width :-).

Jitter plots will be created using the geom_jitter() function. In addition, specifications have been made for colour and position and size of the dots.

ggplot(df) +
  xlab("Group") +
  ylab("Scale") +
  ggtitle("Combination of Box and Jitter Plot") + 
  geom_boxplot(aes(var.group, var.scale), 
               width=0.5) + 
  geom_jitter(aes(var.group, var.scale, colour = var.cutoff), 
              position = position_jitter(width = .15, height=-0.7),
              size=2) +
  scale_y_continuous(limits=c(0, 101), 
                     breaks = seq(0, 110, 10)) +
  scale_color_manual(name="Legend", 
                     values=c("red", "blue3", "green3")) 

plot of chunk plot

Finally, we are going to format both Y-axis and legend using the functions scale_y_continuous() and scale_color_manual().

Posted in Visualizing Data | Tagged | 2 Comments

How to use R for matching samples (propensity score)

According to Wikipedia, propensity score matching (PSM) is a “statistical matching technique that attempts to estimate the effect of a treatment, policy, or other intervention by accounting for the covariates that predict receiving the treatment”. In a broader sense, propensity score analysis assumes that an unbiased comparison between samples can only be made when the subjects of both samples have similar characteristics. Thus, PSM can not only be used as “an alternative method to estimate the effect of receiving treatment when random assignment of treatments to subjects is not feasible” (Thavaneswaran 2008). It can also be used for the comparison of samples in epidemiological studies. Let's give an example:

Health-related quality of life (HRQOL) is considered an important outcome in cancer therapy. One of the most frequently used instruments to measure HRQOL in cancer patients is the core quality-of-life questionnaire of the European Organisation for Research and Treatment of Cancer. The EORTC QLQ-C30 is a 30-item instrument comprised of five functioning scales, nine symptom scales and one scale measuring Global quality of life. All scales have a score range between 0 and 100. While high scores of the symptom scales indicate a high burden of symptoms, high scores of the functioning scales and on the GQoL scale indicate better functioning resp. quality of life.

However, without having any reference point, it is difficult if not impossible to interpret the scores. Fortunately, the EORTC QLQ-C30 questionnaire was used in several general population surveys. Therefore, patient scores may be compared against scores of the general population. This makes it far easier to decide whether the burden of symptoms or functional impairments can be attributed to cancer (treatment) or not. PSM can be used to make both patient and population samples comparable by matching for relevant demographic characteristics like age and sex.

In this blog post, I show how to do PSM using R. A more comprehensive PSM guide can be found under: A Step-by-Step Guide to Propensity Score Matching in R.

Creating two random dataframes

Since we don't want to use real-world data in this blog post, we need to emulate the data. This can be easily done using the Wakefield package.

In a first step, we create a dataframe named df.patients. We want the dataframe to contain specifications of age and sex for 250 patients. The patients' age shall be between 30 and 78 years. Furthermore, 70% of patients shall be male.

set.seed(1234)
df.patients <- r_data_frame(n = 250, 
                            age(x = 30:78, 
                                name = 'Age'), 
                            sex(x = c("Male", "Female"), 
                                prob = c(0.70, 0.30), 
                                name = "Sex"))
df.patients$Sample <- as.factor('Patients')

The summary-function returns some basic information about the dataframe created. As we can see, the mean age of the patient sample is 53.7 and roughly 70% of the patients are male (69.2%).

summary(df.patients)
##       Age            Sex           Sample   
##  Min.   :30.00   Male  :173   Patients:250  
##  1st Qu.:42.00   Female: 77                 
##  Median :54.00                              
##  Mean   :53.71                              
##  3rd Qu.:66.00                              
##  Max.   :78.00

In a second step, we create another dataframe named df.population. We want this dataframe to comprise the same variables as df.patients with different specifications. With 18 to 80, the age-range of the population shall be wider than in the patient sample and the proportion of female and male patients shall be the same.

set.seed(1234)
df.population <- r_data_frame(n = 1000, 
                              age(x = 18:80, 
                                  name = 'Age'), 
                              sex(x = c("Male", "Female"), 
                                  prob = c(0.50, 0.50), 
                                  name = "Sex"))
df.population$Sample <- as.factor('Population')

The following table shows the sample's mean age (49.5 years) and the proportion of men (48.5%) and women (51.5%).

summary(df.population)
##       Age            Sex             Sample    
##  Min.   :18.00   Male  :485   Population:1000  
##  1st Qu.:34.00   Female:515                    
##  Median :50.00                                 
##  Mean   :49.46                                 
##  3rd Qu.:65.00                                 
##  Max.   :80.00

Merging the dataframes

Before we match the samples, we need to merge both dataframes. Based on the variable Sample, we create a new variable named Group (type logic) and a further variable (Distress) containing information about the individuals' level of distress. The Distress variable is created using the age-function of the Wakefield package. As we can see, women will have higher levels of distress.

mydata <- rbind(df.patients, df.population)
mydata$Group <- as.logical(mydata$Sample == 'Patients')
mydata$Distress <- ifelse(mydata$Sex == 'Male', age(nrow(mydata), x = 0:42, name = 'Distress'),
                                                age(nrow(mydata), x = 15:42, name = 'Distress'))

When we compare the distribution of age and sex in both samples, we discover significant differences:

pacman::p_load(tableone)
table1 <- CreateTableOne(vars = c('Age', 'Sex', 'Distress'), 
                         data = mydata, 
                         factorVars = 'Sex', 
                         strata = 'Sample')
table1 <- print(table1, 
                printToggle = FALSE, 
                noSpaces = TRUE)
kable(table1[,1:3],  
      align = 'c', 
      caption = 'Table 1: Comparison of unmatched samples')
Patients Population p
n 250 1000
Age (mean (sd)) 53.71 (13.88) 49.46 (18.33) 0.001
Sex = Female (%) 77 (30.8) 515 (51.5) <0.001
Distress (mean (sd)) 22.86 (11.38) 25.13 (11.11) 0.004

Furthermore, the level of distress seems to be significantly higher in the population sample.

Matching the samples

Now, that we have completed preparation and inspection of data, we are going to match the two samples using the matchit-function of the MatchIt package. The method command method="nearest" specifies that the nearest neighbors method will be used. Other matching methods are exact matching, subclassification, optimal matching, genetic matching, and full matching (method = c("exact", "subclass", "optimal", ""genetic", "full")). The ratio command ratio = 1 indicates a one-to-one matching approach. With regard to our example, for each case in the patient sample exactly one case in the population sample will be matched. Please also note that the Group variable needs to be logic (TRUE vs. FALSE).

set.seed(1234)
match.it <- matchit(Group ~ Age + Sex, data = mydata, method="nearest", ratio=1)
a <- summary(match.it)

For further data presentation, we save the output of the summary-function into a variable named a.

After matching the samples, the size of the population sample was reduced to the size of the patient sample (n=250; see table 2).

kable(a$nn, digits = 2, align = 'c', 
      caption = 'Table 2: Sample sizes')
Control Treated
All 1000 250
Matched 250 250
Unmatched 750 0
Discarded 0 0

The following output shows, that the distributions of the variables Age and Sex are nearly identical after matching.

kable(a$sum.matched[c(1,2,4)], digits = 2, align = 'c', 
      caption = 'Table 3: Summary of balance for matched data')
Means Treated Means Control Mean Diff
distance 0.23 0.23 0.00
Age 53.71 53.65 0.06
SexMale 0.69 0.69 0.00
SexFemale 0.31 0.31 0.00

The distributions of propensity scores can be visualized using the plot-function which is part of the MatchIt package .

plot(match.it, type = 'jitter', interactive = FALSE)

plot of chunk plot

Saving the matched samples

Finally, the matched samples will be saved into a new dataframe named df.match.

df.match <- match.data(match.it)[1:ncol(mydata)]
rm(df.patients, df.population)

Eventually, we can check whether the differences in the level of distress between both samples are still significant.

pacman::p_load(tableone)
table4 <- CreateTableOne(vars = c('Age', 'Sex', 'Distress'), 
                         data = df.match, 
                         factorVars = 'Sex', 
                         strata = 'Sample')
table4 <- print(table4, 
                printToggle = FALSE, 
                noSpaces = TRUE)
kable(table4[,1:3],  
      align = 'c', 
      caption = 'Table 4: Comparison of matched samples')
Patients Population p
n 250 250
Age (mean (sd)) 53.71 (13.88) 53.65 (13.86) 0.961
Sex = Female (%) 77 (30.8) 77 (30.8) 1.000
Distress (mean (sd)) 22.86 (11.38) 24.13 (11.88) 0.222

With a p-value of 0.222, Student's t-test does not indicate significant differences anymore. Thus, PSM helped to avoid an alpha mistake.


PS 1: The packages used in this blog post can be loaded/installed using the following code:

pacman::p_load(knitr, wakefield, MatchIt, tableone, captioner)

PS 2: Thanks very much to my colleague Katharina Kuba for for telling me about the MatchIt package.

Posted in Indroduction | Tagged , | Leave a comment

How to parse Evernote export files (.enex) using R

Evernote is a “cross-platform […] app designed for note taking, organizing, and archiving” (Wikipedia). All notes can be tagged and exported. I'm using Evernote, above all, to save and tag interesting blog posts related to R.

plot of chunk logo

In this blog post, I show how to import and parse an exported Evernote file with R.

Exporting the data from Evernote

In a first step, I've exported all of my notes tagged with 'R':

  • Open the Evernote client;
  • Select all notes to be exported;
  • Go to 'File' > 'Export';
  • Select option 'Export as a file in ENEX format (.enex)' from the format options box;
  • Name the file 'Evernote.enex' and save it into your RStudio project folder.

Importing the data into R

Since the '.enex' file has xml properties, the 'Evernote.enex' file can be imported using the XML package. Because of its structure, the imported file cannot be transformed into a dataframe right away. Instead, we need to transform it into a list (using the XML::xmlToList function).

library(XML)
xmlfile <- xmlParse("Evernote.enex")
xmllist <- xmlToList(xmlfile, addAttributes = FALSE)

In the following section, I show how to create a dataframe based on the xmllist object.

Building a data frame

First, we generate an empty data frame. The number of rows (262) is determined by the number of elements in the xmllist object and the number of columns is set to zero.

mydata <- data.frame(matrix(NA, ncol = 0, nrow = length(xmllist)))
dim(mydata)

[1] 262 0

Second, we read the names of the note titles and save it into a variable called title which is part of our data frame mydata.

for (i in 1:length(xmllist)){
  mydata$title[i] <- unlist(xmllist[[i]]['title'])
}

head(mydata$title, 10)

[1] “Network visualization in R with the igraph package | Rules of Reason”
[2] “More debate analysis with R”
[3] “Analyzing networks of characters in 'Love Actually' – Variance Explained”
[4] “Web scraping in R”
[5] “Color Quantization in R”
[6] “Zellingenach: A visual exploration of the spatial patterns in the endings of German town and village names in R | rud.is”
[7] “Waterfall plots – what and how?”
[8] “Sentiment Analysis on Donald Trump using R and Tableau | DataScience+”
[9] “Version 0.9 of timeline on CRAN”
[10] “Date Formats in R”

In a next step, we obtain the dates the notes were created. In order to receive a variable of the date class, the variable 'create' must be formated. Using the stringr package, we extract year, month and day and save it into the same variable.

for (i in 1:nrow(mydata)){
  mydata$created[i] <- xmllist[[i]]['created']
}


mydata$created <- as.Date(paste0(stringr::str_sub(mydata$created, 1, 4), 
                                 '-', 
                                 stringr::str_sub(mydata$created, 5, 6), 
                                 '-',
                                 stringr::str_sub(mydata$created, 7, 8)))

head(mydata$created, 5)

[1] “2016-01-06” “2016-01-06” “2016-01-05” “2016-01-05” “2016-01-04”

Furthermore, the http addresses of the notes can be read like this:

for (i in 1:nrow(mydata)){
  mydata$www[i] <- xmllist[[i]]['note-attributes']
}

mydata$www <- unlist(qdapRegex::ex_url(mydata$www,
                        trim=TRUE,
                        clean=TRUE,
                        extract=TRUE))

mydata$www <- stringr::str_sub(mydata$www, 1, nchar(mydata$www)-2)

head(mydata$www)

[1] “https://rulesofreason.wordpress.com/2012/11/05/network-visualization-in-r-with-the-igraph-package/
[2] “http://www.r-bloggers.com/more-debate-analysis-with-r/
[3] “http://varianceexplained.org/r/love-actually-network/
[4] “http://cpsievert.github.io/slides/web-scraping/#1
[5] “http://blog.ryanwalker.us/2016/01/color-quantization-in-r.html
[6] “http://rud.is/b/2016/01/03/zellingenach-a-visual-exploration-of-the-spatial-patterns-in-the-endings-of-german-town-and-village-names-in-r/

Finally, we want to read the tags and save them into a variable. Since the number of tags differs between the notes, we have to assess the number of tags for each note:

# number of tags
for (i in 1:nrow(mydata)){
  mydata$num.tag[i] <- length(which(names(xmllist[[i]])=="tag"))
}

head(mydata$num.tag, 20)

[1] 2 2 3 2 2 3 2 5 2 3 3 2 2 3 3 2 2 3 3 3

Since we want to save each tag into a single variable, we need to know the maximum number of tags.

tag.num <- max(mydata$num.tag)
tag.num

[1] 5

With the next code snippet we add three variables to our dataframe: both the position of the first and last tag as numeric variables and a variable (of class list) containing the positions of all tags.

# position of first tag
for (i in 1:nrow(mydata)){
  mydata$pos.1[i] <- which(names(xmllist[[i]])=="tag")[1]
}
# position of last tag
mydata$pos.2 <- mydata$pos.1 + mydata$num.tag - 1
# position of tags
for (i in 1:nrow(mydata)){
  mydata$pos.all[i] <- list(c(mydata$pos.1[i]:mydata$pos.2[i]))
}
# remove pos.1 and pos.2
mydata$pos.1 <- NULL
mydata$pos.2 <- NULL

Since we don't need the variables pos.1 and pos.2 for further processing, we remove them from our dataframe.

In the next step, we create 5 empty variables that will later on contain the tag names.

# create 5 new columns
num.col <- ncol(mydata) 
for (i in (ncol(mydata) + 1):(ncol(mydata) + tag.num)){
  mydata[, i] <- NA
  colnames(mydata)[i] <- paste0('tag.', i - num.col)
}

The following code snipped intents to write the tag names into the variables tag.1 to tag.5.

for (j in (num.col + 1):ncol(mydata)){
  for (i in 1:nrow(mydata)){
    mydata[i, j]  <- xmllist[[i]][mydata$pos.all[[i]][j - num.col]][[1]]
  }}

However, evaluating the code returns the following error message:

Error in '[<-.data.frame'('*tmp*', i, j, value = NULL) :
replacement has length zero

Has anybody got an idea how to get the preceding code snippet working? I'd appreciate every piece of advice.

Thus, I decided to write one loop for each of the five variables. This is definetely not best practice, but it works.

# 1st tag
for (i in 1:nrow(mydata)){
  mydata$tag.1[i]  <- xmllist[[i]][mydata$pos.all[[i]][1]][1]
}
# 2nd tag
for (i in 1:nrow(mydata)){
  mydata$tag.2[i]  <- xmllist[[i]][mydata$pos.all[[i]][2]][1]
}
# 3rd tag
for (i in 1:nrow(mydata)){
  mydata$tag.3[i]  <- xmllist[[i]][mydata$pos.all[[i]][3]][1]
}
# 4th tag
for (i in 1:nrow(mydata)){
  mydata$tag.4[i]  <- xmllist[[i]][mydata$pos.all[[i]][4]][1]
}
# 5th tag
for (i in 1:nrow(mydata)){
  mydata$tag.5[i]  <- xmllist[[i]][mydata$pos.all[[i]][5]][1]
}

In the following step, we define a function (source) replacing NULL by NA and apply this function to each of the five tag variables:

# define function
nullToNA <- function(x) {
  x[sapply(x, is.null)] <- NA
  return(x)
}

# apply function
for (i in (num.col+1):ncol(mydata)){
  for (j in 1:nrow(mydata)){
  mydata[j, i] <- nullToNA(mydata[j, i])
}}

Finally, we paste the values of the five tag variables into a single variable named tags. To do this, we use the paste2 function of the qdap package. Since we don't need the variables tag.1 to tag.5 for further processing, we remove them from the dataframe using the select function of the dplyr package.

mydata$tags <- qdap::paste2(mydata[(num.col+1):ncol(mydata)], 
                            sep = ", ", 
                            handle.na = TRUE, 
                            trim = TRUE)

mydata <- dplyr::select(mydata, -starts_with('tag.'))
mydata$pos.all <- NULL

The final dataframe consists of the following variables:

  • title containing the titles of the notes;
  • created containing the dates the notes were created;
  • www containing the notes' http addresses;
  • num.tag containing the number of tags for each note;
  • tags containing the tag names.

The following table gives an impression about how our final dataframe looks like.

knitr::kable(head(mydata), align = c('l', 'c', 'l', 'c', 'c'))
title created www num.tag tags
Network visualization in R with the igraph package | Rules of Reason 2016-01-06 https://rulesofreason.wordpress.com/2012/11/05/network-visualization-in-r-with-the-igraph-package/ 2 network analysis, R, NA, NA, NA
More debate analysis with R 2016-01-06 http://www.r-bloggers.com/more-debate-analysis-with-r/ 2 text mining, R, NA, NA, NA
Analyzing networks of characters in 'Love Actually' – Variance Explained 2016-01-05 http://varianceexplained.org/r/love-actually-network/ 3 network analysis, text mining, R, NA, NA
Web scraping in R 2016-01-05 http://cpsievert.github.io/slides/web-scraping/#1 2 webscraping, R, NA, NA, NA
Color Quantization in R 2016-01-04 http://blog.ryanwalker.us/2016/01/color-quantization-in-r.html 2 R, image processing, NA, NA, NA
Zellingenach: A visual exploration of the spatial patterns in the endings of German town and village names in R | rud.is 2016-01-04 http://rud.is/b/2016/01/03/zellingenach-a-visual-exploration-of-the-spatial-patterns-in-the-endings-of-german-town-and-village-names-in-r/ 3 text mining, geo, R, NA, NA

The packages used in this blog post can be loaded/installed using the following code:

pacman::p_load(XML, knitr, dplyr, qdap, stringr)

The xmllist object may be downloaded as an .RData file under the following link.

In one of my next blog posts, I will show how to analyse the tags.

Posted in Data Management | Tagged , | Leave a comment

RMarkdown: How to format tables and figures in .docx files

In research, we usually publish the most important findings in tables and figures. When writing research papers using Rmarkdown (*.Rmd), we have several options to format the output of the final MS Word document (.docx).
Tables can be formated using either the knitr package’s kable() function or several functions of the pander package.
Figure sizes can be determined in the chunk options, e.g.

{r name_of_chunk, fig.height=8, fig.width=12}.

However, options for customizing tables and figures are rather limited in Rmarkdown. Thus, I usually customize tables and figures in the final MS Word document.

In this blog post, I show how to quickly format tables and figures in the final MS Word document using a macro). MS Word macros are written in VBA (Visual Basic for Applications) and can be accessed from a menu list or from the toolbar and run by simply clicking. There are loads of tutorials explaining how to write a macro for MS Word, e.g http://www.addictivetips.com/microsoft-office/create-macros-in-word-2010/.

The following two macros are very helpful to format drafts. Since I want drafts to be as compact as possible, tables and figures should not to be too space consuming.

The first macro called FormatTables customizes the format of all tables of the active MS Word document. With wdTableFormatGrid2, we use a table style predefined in MS Word. A list of other table styles can be found under the follwing link. Furthermore, we define font name (Arial) and font size (8 pt), space before (6 pt) and after (10 pt) the table. Finally, the row height is set to 18 pt exactly.

Sub FormatTables()

 Dim tbl As Table
    For Each tbl In ActiveDocument.Tables
         tbl.AutoFormat wdTableFormatGrid2
         tbl.Range.Font.Name = "Arial"
         tbl.Range.Font.Size = 8
         tbl.Range.ParagraphFormat.SpaceBefore = 6
         tbl.Range.ParagraphFormat.SpaceAfter = 10
         tbl.Range.Cells.SetHeight RowHeight:=18, HeightRule:=wdRowHeightExactly

    Next

End Sub

The second macro called FormatFigures merely reduces the size of all figures in the active MS Word document to 45% of its original size.

Sub FormatFigures()

Dim shp As InlineShape


For Each shp In ActiveDocument.InlineShapes
    shp.ScaleHeight = 45
    shp.ScaleWidth = 45
Next

End Sub

Please also see my blog post RMarkdown: How to insert page breaks in a MS Word document.

Posted in Data Management | Tagged , , | Leave a comment