How to fix problems after updating to R 3.4.1 on Linux Mint 18.1 / Ubuntu 16.04

The Problem

After updating R to version 3.4.1 (“Single Candle”) on Linux Mint 18.1, RStudio could neither find my installed R packages nor was it possible to install R packages into some other directory.

The Solution

Fortunately, this problem could be solved very easily. I just needed to tell R where to find the libraries of version 3.4 again. On my 64 bit machine, I had to first open the Renviron file using xed (Linux Mint text editor).

sudo xed /usr/lib/R/etc/Renviron 

Second, I had to enter the following code line and then save and close the file.

R_LIBS_USER=${R_LIBS_USER-'~/R/x86_64-pc-linux-gnu-library/3.4'}

That was it. šŸ™‚

Posted in Tips & Tricks | Tagged , | Leave a comment

How to calculate Odds Ratios in Case-Control Studies using R

Intro

In June 2017 I've started working at the Clinical Trial Centre Leipzig at Leipzig University. Since my knowledge in statistics is rather poor, my employer offered me to attend some seminars in Medical Biometry at the University of Heidelberg. The first seminar I attended was called “Basics of Epidemiology”. At the first day, we learned how to calculate so called odds ratios in case-control studies using a simple pocket calculator.

In this blog post, I will show, how to calculate a simple odds ratio with 95% CI using R.

Data simulation

The data I'm using in this blog post were simulated using the wakefield package. The following code returns a data frame with 2 binary variables (Exposition and Disease) and 1.000 cases.

library(wakefield)

mydata <- data.frame(Exposition = group(n = 1000, x = c('yes', 'no'), 
                             prob = c(0.75, 0.25)),
                     Disease = group(n = 1000, x = c('yes', 'no'), 
                             prob = c(0.75, 0.25)))
dim(mydata)
## [1] 1000    2
head(mydata)
##   Exposition Disease
## 1        yes     yes
## 2         no      no
## 3         no     yes
## 4        yes     yes
## 5        yes     yes
## 6        yes     yes

Based on this data frame, we calculate a table showing how many patients with exposition vs. no exposition developed a disease vs. no disease.

tab <-table(mydata$Exposition, mydata$Disease)
tab
##      
##       yes  no
##   yes 569 210
##   no  163  58

Odds Ratio Calculation

In order to get to know whether the risk for developing a disease is significantly higher in patients having a certain exposition, we need to calculate the odds ratio and its 95% CI.

The following function will return a data frame containing these values.

# return odds ratio with 95%ci
f <- function(x) {
  or <- round((x[1] * x[4]) / (x[2] * x[3]), 2)
  cil <- round(exp(log(or) - 1.96 * sqrt(1/x[1] + 1/x[2] + 1/x[3] + 1/x[4])), 2)
  ciu <- round(exp(log(or) + 1.96 * sqrt(1/x[1] + 1/x[2] + 1/x[3] + 1/x[4])), 2)
  df <- data.frame(matrix(ncol = 3, nrow = 1, 
                              dimnames = list(NULL, c('CI_95_lower', 'OR', 'CI_95_upper'))))
  df[1,] <- rbind(c(cil, or, ciu))
  df <- as.data.frame(df)
}

Now, we can deploy the function on our table tab.

df.or <- f(tab)
knitr::kable(df.or, align = 'c')
CI_95_lower OR CI_95_upper
0.68 0.96 1.35

As the results indicate, patients with a disposition have no higher risk to develop a disease than patients having no disposition.

Posted in Indroduction | Tagged , | Leave a comment

How to parse Citavi files using R

Intro

In academic writing, the use of reference management software has become essential. A couple of years ago, I started using the free open-source software Zotero. However, most of my workmates at Leipzig University work with Citavi, a commercial software which is widely used at German, Austrian, and Swiss universities. The current version of Citavi – Citavi 5 – was released in April 2015.

plot of chunk pic

In this blog post, I show how to import Citavi files into R.

Packages required

Citavi organizes references in SQLite databases. Although Citavi files end with .ctv5 rather than .sqlite, they may be parsed as sqlite files. For reproducing the code of this blog post, the following R packages are required.

library(RSQLite)
library(DBI)
library(tidyverse)
library(stringr)

Data

Import

A connection to the database myrefs.ctv5 may be established using the dbConnect() function of the DBI and the SQLite() function of the RSQLite package.

# connect to Citavi file
con <- DBI::dbConnect(RSQLite::SQLite(), 'myrefs.ctv5')

A list of all tables may be returned using the dbListTables() function of the DBI package. Out database contains 38 tables.

# get a list of all tables
dbListTables(con) 

[1] “Annotation” “Category”
[3] “CategoryCategory” “Changeset”
[5] “Collection” “DBVersion”
[7] “EntityLink” “Group”
[9] “ImportGroup” “ImportGroupReference”
[11] “Keyword” “KnowledgeItem”
[13] “KnowledgeItemCategory” “KnowledgeItemCollection”
[15] “KnowledgeItemGroup” “KnowledgeItemKeyword”
[17] “Library” “Location”
[19] “Periodical” “Person”
[21] “ProjectSettings” “ProjectUserSettings”
[23] “Publisher” “Reference”
[25] “ReferenceAuthor” “ReferenceCategory”
[27] “ReferenceCollaborator” “ReferenceCollection”
[29] “ReferenceEditor” “ReferenceGroup”
[31] “ReferenceKeyword” “ReferenceOrganization”
[33] “ReferenceOthersInvolved” “ReferencePublisher”
[35] “ReferenceReference” “SeriesTitle”
[37] “SeriesTitleEditor” “TaskItem”

Creating data frames

For reading the tables of the database, we need the dbGetQuery() function of the DBI package. Each table contains a number of variables. In case we want to save all variables into a data frame, we select them using the asterisk.

df.author <- dbGetQuery(con,'select * from Person')

In case we are only interested in a couple of variables, we need to specify their names separated by commas.

df.author <- dbGetQuery(con,'select ID, FirstName, LastName, Sex from Person')
df.keyword <- dbGetQuery(con,'select ID, Name from Keyword')
df.refs <- dbGetQuery(con,'select ID, Title, Year, Abstract, CreatedOn, ISBN, PageCount, PlaceOfPublication, ReferenceType   from Reference')

Data wrangling

In the next step, we try to join our data frames using the dplyr package.

mydata <- df.refs %>%
  left_join(df.author, by = 'ID') %>%
    left_join(df.keyword, by = 'ID') 
## Error in eval(expr, envir, enclos): Can't join on 'ID' x 'ID' because of incompatible types (list / list)

However, R returns an error message disclosing that the data frames cannot be joined by their ID variable because of incompatible types.

When we take a closer look at the ID variables we see that they are organized as list containing 1603 elements of type raw. Moreover, each of the list elements consists of 16 alphanumerical elements.

typeof(df.author$ID)
## [1] "list"
str(df.author$ID[[1]])
##  raw [1:16] 41 5b 18 51 ...
length(df.author$ID)
## [1] 1603

In order to be able to join our data frames we need to convert the type of the ID variables from list to character. Furthermore, we collapse the 16 list elements into single strings separated by hyphens.

# df.author
for (i in 1:nrow(df.author)){
  df.author$ID[[i]] <- as.character(df.author$ID[[i]])
  df.author$ID[[i]] <- str_c(df.author$ID[[i]], sep = "", collapse = "-")
}
df.author$ID <- unlist(df.author$ID)
# df.keyword
for (i in 1:nrow(df.keyword)){
  df.keyword$ID[[i]] <- as.character(df.keyword$ID[[i]])
  df.keyword$ID[[i]] <- str_c(df.keyword$ID[[i]], sep = "", collapse = "-")
}
df.keyword$ID <- unlist(df.keyword$ID)
# df.refs
for (i in 1:nrow(df.refs)){
  df.refs$ID[[i]] <- as.character(df.refs$ID[[i]])
  df.refs$ID[[i]] <- str_c(df.refs$ID[[i]], sep = "", collapse = "-")
}
df.refs$ID <- unlist(df.refs$ID)
typeof(df.refs$ID)
## [1] "character"
head(df.refs$ID)
## [1] "34-26-a7-db-e2-79-ac-4c-ad-21-45-91-37-3f-e9-b0"
## [2] "75-e6-c1-fd-65-0a-82-48-a8-c8-e9-80-2a-79-db-2c"
## [3] "0b-8f-73-86-82-0a-c8-48-84-f7-d2-7a-04-df-12-65"
## [4] "7a-79-2b-1e-4a-ef-ae-40-ae-06-d5-bd-5d-e9-a7-cb"
## [5] "12-3c-0e-70-a2-2a-f9-48-ad-e4-b7-7e-85-63-97-96"
## [6] "bc-ae-06-78-89-48-df-47-88-b6-5f-92-20-9b-4c-7c"

Joining the data frames

Finally, our data frames may be joined by their ID variables.

mydata <- df.refs %>%
  left_join(df.author, by = 'ID') %>%
    left_join(df.keyword, by = 'ID') 

Results

Our final data frame contains 1066 cases and 13 variables, that is:

colnames(mydata)
##  [1] "ID"                 "Title"              "Year"              
##  [4] "Abstract"           "CreatedOn"          "ISBN"              
##  [7] "PageCount"          "PlaceOfPublication" "ReferenceType"     
## [10] "FirstName"          "LastName"           "Sex"               
## [13] "Name"
Posted in Data Management | Tagged , | Leave a comment

How to install and use the hexSticker package

Intro

A couple of days ago, the hexSticker package was published on CRAN. The package provides some functions to plot hexagon stickers that may be used to promote R packages. As described on GitHub, the stickers can be plotted either using base R's plotting function, the lattice package or the ggplot2 package. Moreover, it is also possible to plot image files.

Since I found it quite demanding to install the hexSticker package on a current Linux os (Linux Mint 18.1), I decided to write a short tutorial explaining how to install and use the package on Linux Ubuntu-based operating systems.

Linux packages required

In a first step, we need to open the terminal to install the following software packages. While texinfo is required to build R packages from source, libudunits2-dev, fftw-dev and mffm-fftw1 are needed to install some R packages the hexSticker package depends on (ggforce, fftwtools).

sudo apt-get install texinfo libudunits2-dev fftw-dev mffm-fftw1 libfftw3-dev libtiff5-dev

R packages required

Recently, the fftwtools package was added to CRAN. Thus, it can be installed the usual way:

installed.packages('fftwtools', dep = TRUE)

Finally, the EBImage package must be installed from the Bioconductor repository and the packages ggimage, ggforce and hexSticker must be installed from CRAN.

source("https://bioconductor.org/biocLite.R")
biocLite("EBImage")
install.packages("ggimage")
install.packages("ggforce")
install.packages("hexSticker")

For plotting an example hexsticker we need some data provided by the streetsofle package which must be installed from GitHub:

if (!require("devtools")) install.packages("devtools")
devtools::install_github("nrkoehler/streetsofle")

Plotting

With the following code chunk we create two hexstickers for the streetsofle package. The colours chosen stem from the city flag of Leipzig. The arguments of the sticker() function are explained within the code's comments.

library(hexSticker)
library(ggplot2)
library(streetsofle)
data(streetsofle)

p.1 <- ggplot(aes(x = lon, y = lat), data = shape.ortsteile) + 
  theme_map_le() + 
  coord_quickmap() + 
  geom_polygon(aes(x = lon, y = lat, group = group), 
               fill = NA, 
               size = 0.2, 
               color = "#FFCB00") + 
  geom_polygon(aes(x = lon, y = lat, group = group),
               color = "#FFCB00", 
               size = 1, 
               fill = NA, 
               data = shape.bezirke) 

p.1 <- sticker(p.1,
               package="streetsofle", 
               s_x = 1, # horizontal position of subplot
               s_y = 1.1, # vertical position of subplot
               s_width = 1.4, # width of subplot
               s_height = 1.4, # height of subplot
               p_x = 1, # horizontal position of font
               p_y = .43, # vertical position of font
               p_size = 6, # font size
               p_color = "#FFCB00", # font colour
               h_size = 3, # hexagon border size
               h_fill = "#004CFF", # hexagon fill colour
               h_color = "#FFCB00") # hexagon border colour

p.2 <- ggplot(aes(x = lon, y = lat), data = shape.ortsteile) + 
  theme_map_le() + 
  coord_quickmap() + 
  geom_polygon(aes(x = lon, y = lat, group = group), 
               fill = NA, 
               size = 0.2, 
               color = "#004CFF") + 
  geom_polygon(aes(x = lon, y = lat, group = group),
               color = "#004CFF", 
               size = 1, 
               fill = NA, 
               data = shape.bezirke) 

p.2 <- sticker(p.2,
               package="streetsofle", 
               s_x = 1, # horizontal position of subplot
               s_y = 1.1, # vertical position of subplot
               s_width = 1.4, # width of subplot
               s_height = 1.4, # height of subplot
               p_x = 1, # horizontal position of font
               p_y = .43, # vertical position of font
               p_size = 6, # font size
               p_color = "#004CFF", # font color
               h_size = 3, # hexagon border size
               h_fill = "#FFCB00", # hexagon fill colour
               h_color = "#004CFF") # hexagon border colour

Finally, both plots are put into a grid layout using the grid.arrange() function of the gridExtra package.

library(gridExtra)
grid.arrange(p.1, p.2, ncol = 2, respect = TRUE)

plot of chunk unnamed-chunk-6

I'm not sure which sticker looks better. What do you think?

Posted in Fav R Packages, Visualizing Data | Tagged | Leave a comment

How to plot a companion planting guide using ggplot2

Intro

Recently my girl-friend asked me whether I could do some data project being useful for our family rather than just for myself. A couple of years ago – after our first son was born – we decided to rent a small garden (allotment garden) very close to our flat. In our garden we grow some fruit and vegetables, e.g. strawberries, peas, beans and beetroot.

When growing fruit and vegetables it is considered important to know that plants compete for resources. While some plants benefit from one another, other plants sould not be grown together. The internet provides loads of so-called companion planting guides or charts (see Link). For each plant, they define lists of companions (growing well together) and antogonists (not growing well together).

This blog post is to show how to visualize a companion planting guide using R and the ggplot2 package.

Packages and data

With readxl for importing data from Excel, dplyr for data wrangling and ggplot2 for data visualization three R packages are required to reproduce the results of this blog post.

library(readxl)
library(dplyr)
library(ggplot2)

The data I'm going to visualize stem from some companion planting guides published in German (see Link). I only selected the plants relevant for our garden and entered the data manually into an Excel worksheet.

In the first step, I saved the imported data into a data frame (mydata.1). The second data frame mydata.2 is a copy of mydata.1 with the first two variables in reverse order. In order to receive a matrix (with the same number of factor levels in each column), I merged mydata.1 and mydata.2 into mydata using the rbind() function.

mydata.1 <- readxl::read_excel('guide.xlsx', sheet = 1) 
mydata.2 <- mydata.1[, c(2, 1, 3)]
colnames(mydata.2) <- colnames(mydata.1)
mydata <- rbind(mydata.1, mydata.2)
rm(list = c(setdiff(ls(), c('mydata'))))
head(mydata, 10)
## # A tibble: 10 Ɨ 3
##          plant_1  plant_2    status
##            <chr>    <chr>     <chr>
## 1       cucumber     peas companion
## 2           corn     peas companion
## 3         radish     peas companion
## 4           peas cucumber companion
## 5           corn cucumber companion
## 6       beetroot cucumber companion
## 7           corn potatoes companion
## 8       potatoes  cabbage companion
## 9         radish  cabbage companion
## 10 strawberries   lettuce companion

Finally, I removed all objects from workspace not required for data visualization and printed the first ten lines of the tibble (slightly modified data frame) mydata.

Wrangling

With the following code snippet, all variables of mydata are transformed into factors. Furthermore, a parameter needed for plotting is saved as a vector named labs.n.

mydata <- dplyr::mutate_all(mydata, as.factor)
labs.n <- length(levels(mydata$plant_1)) + .5

Plotting

Finally, we plot the matrix using the ggplot2 package. In order to adjust the position of the grid lines, we remove to default grid lines with panel.grid.major = element_blank() and panel.grid.minor = element_blank() and draw new grid lines with geom_vline() and geom_hline(). The new grid lines are in accordance with the boundaries of the tiles.

ggplot(mydata, aes(x = plant_1, y = plant_2, fill = status)) + 
  theme_grey() +
  coord_equal() +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        legend.position = 'bottom') +
  geom_vline(xintercept = seq(0.5, labs.n, 1), color='white') +
  geom_hline(yintercept = seq(0.5, labs.n, 1), color='white') +
  scale_fill_manual('', values = c("red3", "green3")) +
  labs(x='', 
       y='',
       title = "Companion Planting Guide")

plot of chunk plotting

The interpretation of the companion planting guide is very easy: While green tiles signal companion plants that can be grown close to each other, red tiles flag antagonist plants that should not be grown together. The squares are gray when the plants are neither companions nor antagonists.

Posted in Visualizing Data | Tagged | 2 Comments

My first R package: streetsofle

Intro

A couple of days ago I published my first R package on GitHub. I’ve named the package streetsofle standing for streets of Leipzig because it includes a data set containing the street directory of the German city of Leipzig. The street directory was published by the Statistical Bureau of Leipzig and may be downloaded as PDF file from the following website.

Leipzig is divided into 10 greater adminsitrative districts (ā€œStadtbezirkeā€) and 63 smaller local districts (ā€œOrtsteileā€). The names of both smaller and greater districts can be found under the following link.

Figure 1: Map of Leipzig

plot of chunk map

Furthermore, the Leipzig area is covered by 34 postal codes (ā€œPostleitzahlenā€). That is:

##  [1] "04103" "04105" "04107" "04109" "04129" "04155" "04157" "04158"
##  [9] "04159" "04177" "04178" "04179" "04205" "04207" "04209" "04229"
## [17] "04249" "04275" "04277" "04279" "04288" "04289" "04299" "04315"
## [25] "04316" "04317" "04318" "04319" "04328" "04329" "04347" "04349"
## [33] "04356" "04357"

Installation

if (!require("devtools")) install.packages("devtools")
devtools::install_github("nrkoehler/streetsofle")

The Dataset

The data frame streetsofle contains 3391 observations and 8 variables. That is:

  • plz: postal code,
  • street_key: street identification number,
  • street_name: street name,
  • street_num: a list of street numbers,
  • bz_key: identification number for greater districts (‘Stadtbezirke’) ,
  • bz_name: names of greater districts (‘Bezirke’),
  • ot_key: identification number for smaller districts (‘Ortsteile’) ,
  • ot_name: names of smaller districts (‘Ortsteile’).

Since street sections without addresses are usually not covered by postal codes, the variable plz contains some missing values (n=169).

Next steps

Writing functions

Within the next couple of weeks I will write some functions to analyse the street directory data. The next code snipped, for example, shows how to calculate the number of smaller districts (‘Ortsteile’) traversed by the streets.

f <- function(x){length(unique(x))}
df <- data.frame(ot_num = tapply(streetsofle$ot_name, streetsofle$street_name, f))
psych::headTail(df)
##                      ot_num
## Aachener StraƟe           1
## AbrahamstraƟe             1
## Abtnaundorfer StraƟe      1
## AchatstraƟe               1
## ...                     ...
## ZwergmispelstraƟe         1
## Zwetschgenweg             1
## Zwickauer StraƟe          3
## Zwiebelweg                1

The following table shows the number of smaller districts traversed by the streets of Leipzig.

no. of districts no. of streets
1 2721
2 207
3 49
4 11
5 4
6 2
7 2
10 1

While 91% of the streets don’t traverse any district border, one street traverses the borders of 10 districts.

Shiny Web-App

Based on these functions I’m planning to write a Shiny Web-App providing several search functions.

Posted in Fav R Packages | Tagged , , | Leave a comment

How to scrape, import and visualize Telegram chats

Intro

Telegram is a cloud-based and cross-platform instant messaging service. Unlike WhatsApp, Telegram clients exist not only for mobile devices but also for desktop operating systems. In addition, there is also a web based client.

In this blog post I show, how to import Telegram chats into R and how to plot a chat using the qdap package.

R packages

The following code will install load and / or install the R packages required for this blog post.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(readr, qdap, lubridate)

Scraping the data

A very straightforward way to save Telegram chats is to use the Chrome extension Save Telegram Chat History. On Quora, Stacy Wu explains how to use it:

  • Visit https://web.telegram.org.
  • Select a peer you want to get the chat history from.
  • Load the messages.
  • Select all text (Ctrl+A) and copy it (Ctrl+C).
  • Paste the text into a text editor (Ctrl+V) and save to a local file.

I have saved the chat to a local csv-file. Since the first two lines contain non-tabular information, they need to be manually removed. Furthermore, undesired line breaks must be manually removed as well.

Importing the data

The following line of code shows how to import the csv file into R using the read_csv() function of the readr package.

mydata <- readr::read_csv("telegram.csv", col_names=FALSE)

Data wrangling

After importing the file, our data frame consists of two string variables: X1 containing information about day and time of the conversations and X2 containing the names of the persons involved in the chat as well as the chat text. With the following lines of code we create 4 new variables:

  • day containing the dates of the chats,
  • time containing the times of day of the chats,
  • person containing the names of the persons involved in the chat,
  • txt containing the actual chat text.
mydata$day <- stringr::str_sub(mydata$X1, 1, 10)
mydata$day <- lubridate::dmy(mydata$day)
mydata$time <- stringr::str_sub(mydata$X1, 12, 19)
mydata$time <- lubridate::hms(mydata$time)
mydata$person <- stringr::str_extract(mydata$X2, "[^:]*")
mydata$person <- factor(mydata$person, levels = unique(mydata$person), labels = c('Me', 'Other'))
mydata$txt <- gsub(".*:\\s*","", mydata$X2)
mydata <- mydata[, c(3:6)]
head(mydata, 2)
## # A tibble: 2 Ɨ 4
##          day         time person   txt
##       <date> <S4: Period> <fctr> <chr>
## 1 2017-01-20  21H 10M 14S     Me Hello
## 2 2017-01-20  21H 11M 42S  Other  Huhu

Gradient word cloud

Since the chat involves only two persons, I decided to plot it as gradient word cloud, a visualization technique developed by Tyler Rinker. The function gradient_cloud() I use in the next code snippet is part of his qdap package. Gradient word clouds “color words with a gradient based on degree of usage between two individuals” (See).

gradient_cloud(mydata$txt, mydata$person, title = "Gradient word cloud of Telegram chat")

plot of chunk gwc

The chat I have ploted is very short and, thus, not very telling. I'm wondering how it looks in a couple of months.

Posted in Text Mining, Visualizing Data | Tagged , | Leave a comment