Drawing a Fish Curve using R and ggplot2

Intro

Recently, I wondered whether there is a way to draw a fish shape using a mathematical function. Since I did not find a ready-made R function, I tried to write the function by myself. The equations, I've used for writing this function can be found on WolframMathWorld.

The function

The fish_curve() function requires the ggplot2 and the dplyr package. It creates a data frame with two variables (x and y) and 10.000 observations. Finally, the data points are plotted using ggplot2.

fish_curve <- function(colour='black', size = 5){
  library(ggplot2)
  library(dplyr)
  data.frame(
    x = cos(1:10000) - sin(1:10000)^2 / sqrt(2),
    y = cos(1:10000) * sin(1:10000)
  ) %>%
    ggplot(., aes(x, y)) +
    geom_point(colour = colour, size = size) +
    theme_void()
}

Function call with default parameters

With colour and size the fish_curve() function allows the user to specify two parameters; that is colour and size of the plotted points. The default values are black for colour and 5 for size.

(p1 <- fish_curve())

plot of chunk fish-1

Customization

In the following example, we customize colour and size of the fish shape:

(p2 <- fish_curve(colour = 'blue', size = 1))

plot of chunk fish-2

And finally, we place the two plots side by side using the patchwork package:

library(patchwork)
p1 + p2

plot of chunk unnamed-chunk-1

How to order factors by level frequency and level name

Intro

Quite frequently, factor variables are ordered by level frequency. However, factor levels having only a few observations are sometimes collapsed into one level usually named “others”. Since this level is usually not of particular interest, it may be a good idea to put this level in the last position of the plot rather than ordering it by level frequency. In this blog post, I’m going to show how to order a factor variable by level frequency and level name.

To replicate the R code I’m going to use in this post, four R packages must be loaded:

library(dplyr) # for data manipulation
library(ggplot2) # for plotting data
library(gghighlight) # ggplot2 extension for highlighting values

The dataset I’m going to use in this post (mtcars) is part of the datasets package.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

In the first code chunk, we:

  • extract the first word of each car name and write it into a new variable called “brand”,
  • rename all car brands starting with “M” (Mazda, Merc, Maserati) to “Others” and
  • calculate the median miles per gallon (mpg) for each car brand.
df.mtcars %
  mutate(name = str_extract(rownames(.), "^\\w+\\b"),
         brand = str_replace(name, "^M\\w+", 'Others')) %>%
  group_by(brand) %>%
  summarize(mpg = median(mpg))
df.mtcars$brand
##  [1] "AMC"      "Cadillac" "Camaro"   "Chrysler" "Datsun"   "Dodge"   
##  [7] "Duster"   "Ferrari"  "Fiat"     "Ford"     "Honda"    "Hornet"  
## [13] "Lincoln"  "Lotus"    "Others"   "Pontiac"  "Porsche"  "Toyota"  
## [19] "Valiant"  "Volvo"

The following code chunk is to reorder the brand variable by level frequency using the reorder() function.

df.mtcars %
  mutate(brand = as.factor(brand),
         brand = reorder(brand, mpg))
levels(df.mtcars$brand)
##  [1] "Cadillac" "Lincoln"  "Camaro"   "Duster"   "Chrysler" "AMC"     
##  [7] "Dodge"    "Ford"     "Valiant"  "Others"   "Pontiac"  "Ferrari" 
## [13] "Hornet"   "Volvo"    "Datsun"   "Porsche"  "Toyota"   "Fiat"    
## [19] "Honda"    "Lotus"

As we can see, the bar representing the “Others” level is roughly in the middle of the plot.

ggplot(df.mtcars, aes(brand, mpg, fill = brand)) +
  coord_flip() +
  geom_col(width = 0.5) +
  gghighlight(brand == 'Others', unhighlighted_colour = "cornflowerblue") +
  scale_fill_manual(values = c("grey")) +
  theme_bw() +
  theme(legend.position = 'none') +
  labs(x = NULL, 
       y = 'Miles per Gallon',
       title = "Factor variable ordered by level frequency")

plot of chunk unnamed-chunk-4

To put the bar representing the “Others” level at the bottom of the plot, we have to set “Others” as reference category using the relevel() function.

df.mtcars %
  mutate(brand = relevel(brand, ref = "Others"))
levels(df.mtcars$brand)
##  [1] "Others"   "Cadillac" "Lincoln"  "Camaro"   "Duster"   "Chrysler"
##  [7] "AMC"      "Dodge"    "Ford"     "Valiant"  "Pontiac"  "Ferrari" 
## [13] "Hornet"   "Volvo"    "Datsun"   "Porsche"  "Toyota"   "Fiat"    
## [19] "Honda"    "Lotus"

Finally, the bar representing the “Others” level appears at the desired position.

ggplot(df.mtcars, aes(brand, mpg, fill = brand)) +
  coord_flip() +
  geom_col(width = 0.5) +
  gghighlight(brand == 'Others', unhighlighted_colour = "cornflowerblue") +
  scale_fill_manual(values = c("grey")) +
  theme_bw() +
  theme(legend.position = 'none') +
  labs(x = NULL, 
       y = 'Miles per Gallon',
       title = "Factor variable ordered by level frequency and level name")

plot of chunk unnamed-chunk-6

PS: In both plots, the gghighlight() function of the gghighlight package was used to highlight the desired factor level.

How to Plot Venn Diagrams Using R, ggplot2 and ggforce

Intro

Venn diagrams – named after the English logician and philosopher John Venn – “illustrate the logical relationships between two or more sets of items” with overlapping circles.

In this tutorial, I'll show how to plot a three set venn diagram using R and the ggplot2 package.

Packages and Data

For the R code to run, we need to install and load three R packages. Unlike tidyverse and ggforce, the limma package must be installed from Bioconductor rather than from CRAN.

Moreover, we create a random data frame using the rbinom() function.

source("http://www.bioconductor.org/biocLite.R")
biocLite("limma")
library(limma)
library(tidyverse)
library(ggforce)
set.seed((123))
mydata <- data.frame(A = rbinom(100, 1, 0.8),
                     B = rbinom(100, 1, 0.7),
                     C = rbinom(100, 1, 0.6)) %>%
                       mutate_all(., as.logical)

Drawing the Circles

Next, we create a data frame defining the x and y coordinates for the three circles we want to draw and a variable defining the labels. For plotting the circles – the basic structure of our venn diagram – we need the geom_circle() function of the ggforce package. We employ the geom_circle()-function of the ggforce package to actually draw the circles. With the parameter r (= 1.5), we define the radius of the circles.

df.venn <- data.frame(x = c(0, 0.866, -0.866),
                      y = c(1, -0.5, -0.5),
                      labels = c('A', 'B', 'C'))
ggplot(df.venn, aes(x0 = x, y0 = y, r = 1.5, fill = labels)) +
    geom_circle(alpha = .3, size = 1, colour = 'grey') +
      coord_fixed() +
        theme_void()

plot of chunk unnamed-chunk-2

Furthermore, we need a data frame with the values we want the plot and the coordinates for plotting these values. The values can be obtained using the vennCounts() function of the limma package. Since ggplot2 requires data frames we need to first transform the vdc object (class VennCounts) into a matrix and then into a data frame. In addition, we need to add the x and y coordinates for plotting the values.

vdc <- vennCounts(mydata)
class(vdc) <- 'matrix'
df.vdc <- as.data.frame(vdc)[-1,] %>%
  mutate(x = c(0, 1.2, 0.8, -1.2, -0.8, 0, 0),
         y = c(1.2, -0.6, 0.5, -0.6, 0.5, -1, 0))

The final Plot

Finally, we'll customize the look of our venn diagram and plot the values.

ggplot(df.venn) +
  geom_circle(aes(x0 = x, y0 = y, r = 1.5, fill = labels), alpha = .3, size = 1, colour = 'grey') +
  coord_fixed() +
  theme_void() +
  theme(legend.position = 'bottom') +
  scale_fill_manual(values = c('cornflowerblue', 'firebrick',  'gold')) +
  scale_colour_manual(values = c('cornflowerblue', 'firebrick', 'gold'), guide = FALSE) +
  labs(fill = NULL) +
  annotate("text", x = df.vdc$x, y = df.vdc$y, label = df.vdc$Counts, size = 5)

plot of chunk unnamed-chunk-4

How to plot GPS data using R, ggplot2 and ggmap

Intro

A couple of months ago, I bought a mobile GPS navigation device. The device makes it possible to record the routes we travel with our car. Since the routes can by exported as .gpx files, it is rather easy to plot them using R.

Packages

To replicate this blog post, three R packages are required: plotKML to import the data into R and ggplot2 and ggmap to plot the data.

library(plotKML)
library(ggplot2)
library(ggmap)

As a prerequisite for the installation of the plotKML package on my Linux Mint 18.3 OS (based on Ubuntu 16.04), I needed to add the ubuntugis PPA.

sudo add-apt-repository ppa:ubuntugis/ppa
sudo apt-get update
sudo apt-get dist-upgrade

Import

All what's required for importing the .gpx file into R is the readGPX() function of the plotKML package. After importing the file into R, we receive a list (lst.rd) conatining some meta information and one data frame. To be able to plot the GPS data using ggplot2, we need to subset this list and save the data frame into a new object (df).

lst.rd <- readGPX('29 Dez. 1413.gpx')
df <- lst.rd$tracks[[1]][[1]] 

Using the head function, we have a short look on the data:

head(df)
##        lon      lat                 time
## 1 12.98376 51.05915 2017-12-29T12:16:28Z
## 2 12.98377 51.05912 2017-12-29T12:16:29Z
## 3 12.98374 51.05912 2017-12-29T12:16:30Z
## 4 12.98372 51.05914 2017-12-29T12:16:31Z
## 5 12.98372 51.05914 2017-12-29T12:16:32Z
## 6 12.98372 51.05914 2017-12-29T12:16:33Z

As we can see, our data frame consists of three variables: longitude, latitude and time.

Plotting

Whithout much efford, we can plot our GPS data using ggplot2. Please note that GPS coordinates must be plotted as points.

ggplot(df, aes(x = lon, y = lat)) +
  coord_quickmap() +
  geom_point()

plot of chunk plot-1

In addition, the ggmap package offers some functionality to plot the data on maps.

The first example shows how to plot the data on a map provided by Google Maps.

mapImageData <- get_googlemap(center = c(lon = mean(df$lon), lat = mean(df$lat)),
                              zoom = 10,
                              color = 'bw',
                              scale = 1,
                              maptype = "terrain")
ggmap(mapImageData, extent = "device") + # removes axes, etc.
  geom_point(aes(x = lon,
                 y = lat),
             data = df,
             colour = "red3",
             alpha = .1,
             size = .1)

plot of chunk plot-2

The second example shows how to plot the data on a Stamen map.

mapImage <- get_map(location = c(lon = mean(df$lon) - 0.05, lat = mean(df$lat)),
                    source = "stamen",
                    maptype = "toner",
                    zoom = 10)

ggmap(mapImage, extent = "device") + 
  geom_point(aes(x = lon,
                 y = lat),
             data = df,
             colour = "red3",
             size = .2) 

plot of chunk plot-3

References

This blog post heavily borrows from the tutorial Mapping GPS Tracks in R. Thanks very much!

Analyse der DFG Fachkollegienwahl 2015 — Teil I

Die Deutsche Forschungsgemeinschaft (DFG) ist die europaweit größte Forschungsförderungsorganisation mit einem Jahresetat von aktuell etwa 3 Milliarden Euro (2016). Die Aufgabe der DFG besteht in der Auswahl und Finanzierung wissenschaftlicher Forschungsvorhaben an Hochschulen und Forschungsinstituten.

Laut DFG werden eingereichte Förderanträge in einem mehrstufigen Entscheidungsverfahren “von ehrenamtlich tätigen Gutachterinnen und Gutachtern nach ausschließlich wissenschaftlichen Kriterien beurteilt.” In einem zweiten Schritt werden diese Fachgutachten von Mitgliedern sogenannter Fachkollegien bewertet. Diese treffen eine Vorauswahl und entscheiden, ob ein Antragsverfahren weitergeführt oder eingestellt wird. Die eigentliche Förderentscheidung liegt jedoch beim DFG-Hauptausschuss.

Für die Wahl der Fachkollegien wurde im Jahr 2014 eine neue Wahlordnung verabschiedet, nach der jene Kandidaten mit den meisten Wahlvorschlägen einen Platz auf den Wahllisten bekommen (stimmt das?). Dies habe, so der Bamberger Historiker Hartwin Brandt in zwei Beiträgen für die Frankfurter Allgemeine Zeitung (FAZ, 17.05.2017, 18.10.2017) zu inoffiziellen Absprachen zwischen Hochschulvertetern geführt – mit dem Ziel, die eigenen Kanditaten auf die Wahllisten zu platzieren.

Brandt, der für den Bereich “Klassische Philologie” einen “Nordverbund” bestehend aus den Universitäten Bremen, Greifswald, Hamburg, Kiel und Rostock identifiziert, vermutet, dass eine “detaillierte Analyse der mehr als 600 Seiten umfassenden DFG-Liste der Kandidierenden etliche weitere Auffälligkeiten und Kompensationsgeschäfte zutage fördern” dürfte und mutmaßt, dass “soziologische Netzwerk- und Freundschaftsforscher ihre Freude an diesem Datensatz haben werden”.

Die von der DFG veröffentlichte “Liste der Kandidierenden für die DFG-Fachkollegienwahl 2015” (Stand 17.09.2015) ist ein 617 A4-Seiten umdassendes PDF-Dokument. Sie kann als PDF-Datei von den Internetseiten der DFG heruntergeladen werden (Link). Dazu kommt eine ebenfalls von der DFG veröffentlichte Liste mit den Wahlergebnissen.

Um die in beiden Dokumenten enthaltenen Angaben statistisch auswerten zu können, müssen diese zunächst in ein maschinenlesbares Format überführt werden. Mit der Funktion exctract_text() aus dem R-Paket tabulizer können die Bezeichnungen und Nummern der einzelnen Fachgebiete ausgelesen werden. Zwar lassen sich mit der exctract_tables()-Funktion prinzipiell auch in PDF-Dateien enthaltene Tabellen auslesen. Meine Versuche, die auf das entsprechende PDF-Dokument anzuwenden, waren jedoch leider nicht von Erfolg gekrönt, sodass ich die Daten im Copy-and-Paste-Verfahren manuell in eine Excel-Datei üoberführt habe. Da Copy-and-Paste ein vergleichsweise fehleranfälliges Verfahren ist, würde ich mich freuen, von den Lesern dieses Beitrags auf eventuelle Fehler hingewiesen zu werden. Der Datensatz kann von meinem Github-Repository heruntergeladen werden (zur Zeit nur im Format .rda).

Der Datensatz df.dfg enthält die folgenden Variablen:

  • wb.nummer: Nummer des Wissenschaftsbereichs;
  • wb.name: Name des Wissenschaftsbereichs;
  • fg.nummer: Nummer des Fachgebiets;
  • fg.name: Name des Fachgebiets;
  • fk.nummer: Nummer des Fachkollegiums;
  • fg.name : Name des Fachkollegiums;
  • fach.nummer: Nummer des Fachs;
  • fach.name: Name des Fachs;
  • person.name: Name und Vorname der kandidierenden Person;
  • person.sex: Geschlecht der kandidierenden Person;
  • person.inst: Institution der kandidierenden Person;
  • inst.von: vorschlagende Institutionen
  • inst.von.kat: Kategorisierung der vorschlagenden Institution in: Hochschule vs. Fachgesellschaft vs. Sonstiges;
  • anzahl.plaetze: Anzahl wählbarer Personen pro Fachkollegium;
  • anzahl.kandidaten: Anzahl kandidierender Personen pro Fachkollegium;
  • person.platz: Platzierung der kandidierende Person (Wahlergebnis)
  • person.stimmen: Anzahl der Stimmen, mit denen der Kandidierende gewählt wurde (Wahlergebnis);
  • person.gewaehlt: Wurde die kandidierende Person gewählt?
## Observations: 9,695
## Variables: 18
## $ wb.nummer         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ wb.name           <fct> Geistes- und Sozialwissenschaften, Geistes- ...
## $ fg.nummer         <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, ...
## $ fg.name           <fct> Geisteswissenschaften, Geisteswissenschaften...
## $ fk.nummer         <chr> "FN_101", "FN_101", "FN_101", "FN_101", "FN_...
## $ fk.name           <chr> "Alte Kulturen", "Alte Kulturen", "Alte Kult...
## $ fach.nummer       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ fach.name         <chr> "Ur- und Frühgeschichte (weltweit)", "Ur- un...
## $ person.name       <chr> "Bemmann, Jan", "Brather, Sebastian", "Brath...
## $ person.sex        <fct> m, m, m, w, m, m, m, m, m, w, w, w, m, m, m,...
## $ person.inst       <chr> "U Bonn", "U Freiburg", "U Freiburg", "RGZM"...
## $ inst.von          <chr> "U Bonn", "U Freiburg", "WGL", "WGL", "Dt. V...
## $ inst.von.kat      <chr> "Hochschule", "Hochschule", "Sonstiges", "So...
## $ anzahl.plaetze    <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,...
## $ anzahl.kandidaten <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,...
## $ person.platz      <dbl> 7, 5, 5, 6, 2, 2, 9, 8, 8, 1, 1, 1, 3, 3, 4,...
## $ person.stimmen    <dbl> NA, NA, NA, NA, 195, 195, NA, NA, NA, 255, 2...
## $ person.gewaehlt   <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALS...

Im Bereich der medizinischen Fächer gibt es die Besonderheit, dass für Kandidaten, die an einem Universitätsklinikum beschäftigt sind, als Institution das Klinikum, und nicht die Universität genannt wird. Um tatsächlich vorhandene Netzwerkstrukturen rekonstruieren zu können, habe ich immer dann, wenn sich ein Klinikum eindeutig einer Universität zuordnen ließ, in der Variable Institution den Namen der Universität abgespeichert (z.B. “Universitätsklinikum Münster als "Universität Münster”). Falls keine eindeutige Zuordnung möglich war, habe ich den Namen des Klinikums nicht überschrieben (z.B. “Universitätsklinikum Schleswig-Holstein”).

Die Liste der Kandidierenden lässt sich in vier “Wissenschaftsbereiche” untergliedern, denen die Fachkollegien zugeordnet sind. Zuguterletzt ist jedem Fachkollegium eine Reihe von Fächern zugeordnet (z.B. “Literaturwissenschaften” in “Ältere deutsche Literatur”, “Neuere deutsche Literatur” etc.).

  • Geistes- und Sozialwissenschaften
    • Geisteswissenschaften (Fachkollegium 101–108)
    • Sozial- und Verhaltenswissenschaften (Fachkollegium 109–113)
  • Lebenswissenschaften
    • Biologie (Fachkollegium 201–203)
    • Medizin (Fachkollegium 204–206)
    • Agrar-, Forstwissenschaften und Tiermedizin (Fachkollegium 207)
  • Naturwissenschaften
    • Chemie (Fachkollegium 301–307)
    • Physik (Fachkollegium 308–311)
    • Mathematik (Fachkollegium 312)
    • Geowissenschaften (Fachkollegium 313–318)
  • Ingenieurswissenschaften
    • Maschinenbau und Produktionstechnik (Fachkollegium 401–402)
    • Wärmetechnik/Verfahrenstechnik (Fachkollegium 403–404)
    • Materialwissenschaft und Werkstofftechnik (Fachkollegium 405–406)
    • Informatik, System- und Elektrotechnik (Fachkollegium 407–409)
    • Bauwesen und Architektur (Fachkollegium 410)

In den nächsten Monaten werde ich mich in einer Reihe von Blogposts an einer statistischen Analyse der DFG-Kandidierendenliste versuchen.

Aktualisiert am: 17.02.2018

How to draw a CONSORT flow diagram using R and GraphViz

Intro

The acronym CONSORT stands for “Consolidated Standards of Reporting Trials”. The CONSORT statement is published by the CONSORT group and gives recommendations aiming to improve the quality of reports of randomized clinical trials. Besides a 25 item checklist, the current CONSORT statement (2010) includes the CONSORT flow diagram which is to visualize the progress through the phases of a parallel randomised clinical trial of two groups (see Figure 1).

In this blog post, I will do both show how to “draw” the CONSORT flow diagram using R and discuss the pros and cons of this approach.

Figure 1: CONSORT 2010 Flow Diagram

R Packages

The following R packages are required to create, plot and save flow diagrams: DiagrammeR supports the Graphviz software which consists of a graph description language called DOT. DiagrammeRsvg and rsvg are required to save diagrams to a hard drive.

library(DiagrammeR)
library(DiagrammeRsvg)
library(rsvg)

In case we want to include the flow diagram into a .Rmd document and render it as PDF, we need to install the phantomjs JavaScript library. This can be done using the install_phantomjs() function of the webshot package.

webshot::install_phantomjs()

GraphViz is very flexible regarding the definition of line colors, arrow shapes, node shapes, and many other layout features. However, since the actual purpose of GraphViz is the visualization of graphs (networks), it is difficult to control the position of the nodes.

Unlike in a Cartesian plot, the position of the nodes is not defined by x and y coordinates. Instead, the nodes' position can only be determined by edges (relationships between nodes) and their attributes.

Creating a grid

When I looked at the CONSORT flow diagram (see Figure 1), I had the impression that the nodes may be placed using a grid-based system.

Figure 2: Placement of nodes using a grid
plot of chunk structure

The CONSORT flow diagram conceptualized as grid consists of 21 nodes. That is:

  • 4 nodes with plain text (blue, T1-T4) describing the phase of trial (Enrollment, Allocation, Follow-Up, Analysis) and
  • 9 boxed nodes (black, B1-B9) containing text (“Randomized”) and numerical data (“n=200”).

In order to populate each cell of the grid, we need to add:

  • 4 nodes rendered as tiny and, thus, invisible points (green, P1-P4). We use the points to mimic edges brancing-off from other edges.
  • 4 completely invisible nodes (grey, I1_I4).

The numbers in brackets denote the nodes' ID numbers.

Eventually, the grid structure is created using both vertical and horizontal edges.

Creating the CONSORT Flow Diagram

When creating the actual CONSORT diagram we gonna make use of the grid structure. All we have to do now is to change some attributes of the nodes and edges.

Preparation

To make the flow diagram dynamic, we need to save constant elements (text labels) and dynamic elements (numeric values) into separate vectors (text, values). The paste1()-function is a simple wrapper of the base R paste0()-function making it more convenient to concatinate text labels and numeric values.

# Values ------------------------------------------------------------------
values <- c(210, 10, 200, 100, 100, 10, 10, 90, 90)

# Defining Text Labels ----------------------------------------------------
text <- c('Assessment for\neligibility',
          'Excluded',
          'Randomized',
          'Allocated to\nintervention',
          'Allocated to\nintervention',
          'Lost to follow-up',
          'Lost to follow-up',
          'Analysed',
          'Analysed')

# Defining Function -------------------------------------------------------
paste1 <- function(x, y){
  paste0(x, ' (n=', y, ')')
}

# Concatenating Values and Text Labels ------------------------------------
LABS <- paste1(text, values)
LABS
## [1] "Assessment for\neligibility (n=210)"
## [2] "Excluded (n=10)"                    
## [3] "Randomized (n=200)"                 
## [4] "Allocated to\nintervention (n=100)" 
## [5] "Allocated to\nintervention (n=100)" 
## [6] "Lost to follow-up (n=10)"           
## [7] "Lost to follow-up (n=10)"           
## [8] "Analysed (n=90)"                    
## [9] "Analysed (n=90)"

With the next step, we create a node data frame (ndf) using the create_node_df()-function of the DiagrammR package. The attributes of this function are pretty much self-explanatory.

Node Data Frame

ndf <-
  create_node_df(
    n = 21,
    label = c('Enrollment', 'Allocation', 'Follow-Up', 'Analysis',
              LABS, rep("", 8)),
    style = c(rep("solid", 13), rep('invis', 8)),
    shape = c(rep("plaintext", 4), 
              rep("box", 9),
              rep("point", 8)),
    width = c(rep(2, 4), rep(2.5, 9), rep(0.001, 8)),
    hight = c(rep(0.5, 13), rep(0.001, 8)),
    fontsize = c(rep(14, 4), rep(10, 17)),
    fontname = c(rep('Arial Rounded MT Bold', 4), rep('Courier New', 17)),
    penwidth = 2.0,
    fixedsize = "true")

Edge Data Frame

Furthermore, we create an edge data frame (edf) using the create_edge_df()-function. Edges we don't want to be visible are colored white (with alpha = 0) and don't get an arrowhead. To make edges horizontal, the constraint attribute needs to be set to false. from and to are vectors containing node IDs from which edges are outbound and incoming. They determine the relationships between the nodes. To arrange the code more clearly, I've ordered outbound and incoming edges by columns and rows according to Figure 2.

edf <-
  create_edge_df(
    arrowhead = c(rep('none', 3), rep("vee", 3), rep('none', 2), "vee", rep('none', 6),
                  rep("vee", 3), rep("none", 3), "vee", rep("none", 10)),
    color = c(rep('#00000000', 3), rep('black', 6), rep('#00000000', 6),
              rep('black', 3), rep('#00000000', 3), rep('black', 1),
              rep('#00000000', 2), rep('black', 2), 
              rep('#00000000', 6)),
    constraint = c(rep("true", 18), rep('false', 14)),
    from = c(1, 19, 20, 16, 8, 10, # column 1
             5, 14, 7, 15, 2, 3, # column 2
             18, 6, 21, 17, 9, 11, # column 3
             1, 5, # row 1
             19, 14, # row 2
             20, 7, # row 3
             16, 15, # row 4
             8, 2, # row 5
             10, 3, # row 6
             12, 4), # row 7
    to = c(19, 20, 16, 8, 10, 12, # column 1
           14, 7, 15, 2, 3, 4, # column 2
           6, 21, 17, 9, 11, 13, # column 3
           5, 18, # row 1
           14, 6, # row 2
           7, 21, # row 3
           15, 17, # row 4
           2, 9, # row 5
           3, 11, # row 6
           4, 13)) # row 7

Plotting

Before we are going to plot the flow diagram, we need to create a graph using the create_graph()-function. The graph can be plotted using the render_graph()-function.

# Create Graph ------------------------------------------------------------
g <- create_graph(ndf, 
                  edf,
                  attr_theme = NULL)

# Plotting ----------------------------------------------------------------
render_graph(g)

Figure 3: Simplified CONSORT Flow Diagram
plot of chunk unnamed-chunk-1

Unfortunately, it didn't find a way to left align text within the text boxes. Thus, I didn't manage to draw a fully featured CONSORT flow diagram using R.

Export

Saving the flow diagram to a hard drive is quite straight forward:

export_graph(g, file_name = "CONSORT.png")

File types to be exported are PNG, PDF, SVG, PS and GEXF (graph file format).

Discussion

As demonstrated in this blog post, “drawing” a CONSORT flow diagram using R and GraphViz is rather cumbersome. Because of some limitations regarding the alignment of text within boxes, only a simplyfied form of the CONSORT flow diagram can be drawn. However, the use of R seems to be the only way to make flow diagrams dynamic.

How to install and use the hexSticker package

Intro

A couple of days ago, the hexSticker package was published on CRAN. The package provides some functions to plot hexagon stickers that may be used to promote R packages. As described on GitHub, the stickers can be plotted either using base R's plotting function, the lattice package or the ggplot2 package. Moreover, it is also possible to plot image files.

Since I found it quite demanding to install the hexSticker package on a current Linux os (Linux Mint 18.1), I decided to write a short tutorial explaining how to install and use the package on Linux Ubuntu-based operating systems.

Linux packages required

In a first step, we need to open the terminal to install the following software packages. While texinfo is required to build R packages from source, libudunits2-dev, fftw-dev and mffm-fftw1 are needed to install some R packages the hexSticker package depends on (ggforce, fftwtools).

sudo apt-get install texinfo libudunits2-dev fftw-dev mffm-fftw1 libfftw3-dev libtiff5-dev

R packages required

Recently, the fftwtools package was added to CRAN. Thus, it can be installed the usual way:

installed.packages('fftwtools', dep = TRUE)

Finally, the EBImage package must be installed from the Bioconductor repository and the packages ggimage, ggforce and hexSticker must be installed from CRAN.

source("https://bioconductor.org/biocLite.R")
biocLite("EBImage")
install.packages("ggimage")
install.packages("ggforce")
install.packages("hexSticker")

For plotting an example hexsticker we need some data provided by the streetsofle package which must be installed from GitHub:

if (!require("devtools")) install.packages("devtools")
devtools::install_github("nrkoehler/streetsofle")

Plotting

With the following code chunk we create two hexstickers for the streetsofle package. The colours chosen stem from the city flag of Leipzig. The arguments of the sticker() function are explained within the code's comments.

library(hexSticker)
library(ggplot2)
library(streetsofle)
data(streetsofle)

p.1 <- ggplot(aes(x = lon, y = lat), data = shape.ortsteile) + 
  theme_map_le() + 
  coord_quickmap() + 
  geom_polygon(aes(x = lon, y = lat, group = group), 
               fill = NA, 
               size = 0.2, 
               color = "#FFCB00") + 
  geom_polygon(aes(x = lon, y = lat, group = group),
               color = "#FFCB00", 
               size = 1, 
               fill = NA, 
               data = shape.bezirke) 

p.1 <- sticker(p.1,
               package="streetsofle", 
               s_x = 1, # horizontal position of subplot
               s_y = 1.1, # vertical position of subplot
               s_width = 1.4, # width of subplot
               s_height = 1.4, # height of subplot
               p_x = 1, # horizontal position of font
               p_y = .43, # vertical position of font
               p_size = 6, # font size
               p_color = "#FFCB00", # font colour
               h_size = 3, # hexagon border size
               h_fill = "#004CFF", # hexagon fill colour
               h_color = "#FFCB00") # hexagon border colour

p.2 <- ggplot(aes(x = lon, y = lat), data = shape.ortsteile) + 
  theme_map_le() + 
  coord_quickmap() + 
  geom_polygon(aes(x = lon, y = lat, group = group), 
               fill = NA, 
               size = 0.2, 
               color = "#004CFF") + 
  geom_polygon(aes(x = lon, y = lat, group = group),
               color = "#004CFF", 
               size = 1, 
               fill = NA, 
               data = shape.bezirke) 

p.2 <- sticker(p.2,
               package="streetsofle", 
               s_x = 1, # horizontal position of subplot
               s_y = 1.1, # vertical position of subplot
               s_width = 1.4, # width of subplot
               s_height = 1.4, # height of subplot
               p_x = 1, # horizontal position of font
               p_y = .43, # vertical position of font
               p_size = 6, # font size
               p_color = "#004CFF", # font color
               h_size = 3, # hexagon border size
               h_fill = "#FFCB00", # hexagon fill colour
               h_color = "#004CFF") # hexagon border colour

Finally, both plots are put into a grid layout using the grid.arrange() function of the gridExtra package.

library(gridExtra)
grid.arrange(p.1, p.2, ncol = 2, respect = TRUE)

plot of chunk unnamed-chunk-6

I'm not sure which sticker looks better. What do you think?

How to plot a companion planting guide using ggplot2

Intro

Recently my girl-friend asked me whether I could do some data project being useful for our family rather than just for myself. A couple of years ago – after our first son was born – we decided to rent a small garden (allotment garden) very close to our flat. In our garden we grow some fruit and vegetables, e.g. strawberries, peas, beans and beetroot.

When growing fruit and vegetables it is considered important to know that plants compete for resources. While some plants benefit from one another, other plants sould not be grown together. The internet provides loads of so-called companion planting guides or charts (see Link). For each plant, they define lists of companions (growing well together) and antogonists (not growing well together).

This blog post is to show how to visualize a companion planting guide using R and the ggplot2 package.

Packages and data

With readxl for importing data from Excel, dplyr for data wrangling and ggplot2 for data visualization three R packages are required to reproduce the results of this blog post.

library(readxl)
library(dplyr)
library(ggplot2)

The data I'm going to visualize stem from some companion planting guides published in German (see Link). I only selected the plants relevant for our garden and entered the data manually into an Excel worksheet.

In the first step, I saved the imported data into a data frame (mydata.1). The second data frame mydata.2 is a copy of mydata.1 with the first two variables in reverse order. In order to receive a matrix (with the same number of factor levels in each column), I merged mydata.1 and mydata.2 into mydata using the rbind() function.

mydata.1 <- readxl::read_excel('guide.xlsx', sheet = 1) 
mydata.2 <- mydata.1[, c(2, 1, 3)]
colnames(mydata.2) <- colnames(mydata.1)
mydata <- rbind(mydata.1, mydata.2)
rm(list = c(setdiff(ls(), c('mydata'))))
head(mydata, 10)
## # A tibble: 10 × 3
##          plant_1  plant_2    status
##            <chr>    <chr>     <chr>
## 1       cucumber     peas companion
## 2           corn     peas companion
## 3         radish     peas companion
## 4           peas cucumber companion
## 5           corn cucumber companion
## 6       beetroot cucumber companion
## 7           corn potatoes companion
## 8       potatoes  cabbage companion
## 9         radish  cabbage companion
## 10 strawberries   lettuce companion

Finally, I removed all objects from workspace not required for data visualization and printed the first ten lines of the tibble (slightly modified data frame) mydata.

Wrangling

With the following code snippet, all variables of mydata are transformed into factors. Furthermore, a parameter needed for plotting is saved as a vector named labs.n.

mydata <- dplyr::mutate_all(mydata, as.factor)
labs.n <- length(levels(mydata$plant_1)) + .5

Plotting

Finally, we plot the matrix using the ggplot2 package. In order to adjust the position of the grid lines, we remove to default grid lines with panel.grid.major = element_blank() and panel.grid.minor = element_blank() and draw new grid lines with geom_vline() and geom_hline(). The new grid lines are in accordance with the boundaries of the tiles.

ggplot(mydata, aes(x = plant_1, y = plant_2, fill = status)) + 
  theme_grey() +
  coord_equal() +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        legend.position = 'bottom') +
  geom_vline(xintercept = seq(0.5, labs.n, 1), color='white') +
  geom_hline(yintercept = seq(0.5, labs.n, 1), color='white') +
  scale_fill_manual('', values = c("red3", "green3")) +
  labs(x='', 
       y='',
       title = "Companion Planting Guide")

plot of chunk plotting

The interpretation of the companion planting guide is very easy: While green tiles signal companion plants that can be grown close to each other, red tiles flag antagonist plants that should not be grown together. The squares are gray when the plants are neither companions nor antagonists.

How to scrape, import and visualize Telegram chats

Intro

Telegram is a cloud-based and cross-platform instant messaging service. Unlike WhatsApp, Telegram clients exist not only for mobile devices but also for desktop operating systems. In addition, there is also a web based client.

In this blog post I show, how to import Telegram chats into R and how to plot a chat using the qdap package.

R packages

The following code will install load and / or install the R packages required for this blog post.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(readr, qdap, lubridate)

Scraping the data

A very straightforward way to save Telegram chats is to use the Chrome extension Save Telegram Chat History. On Quora, Stacy Wu explains how to use it:

  • Visit https://web.telegram.org.
  • Select a peer you want to get the chat history from.
  • Load the messages.
  • Select all text (Ctrl+A) and copy it (Ctrl+C).
  • Paste the text into a text editor (Ctrl+V) and save to a local file.

I have saved the chat to a local csv-file. Since the first two lines contain non-tabular information, they need to be manually removed. Furthermore, undesired line breaks must be manually removed as well.

Importing the data

The following line of code shows how to import the csv file into R using the read_csv() function of the readr package.

mydata <- readr::read_csv("telegram.csv", col_names=FALSE)

Data wrangling

After importing the file, our data frame consists of two string variables: X1 containing information about day and time of the conversations and X2 containing the names of the persons involved in the chat as well as the chat text. With the following lines of code we create 4 new variables:

  • day containing the dates of the chats,
  • time containing the times of day of the chats,
  • person containing the names of the persons involved in the chat,
  • txt containing the actual chat text.
mydata$day <- stringr::str_sub(mydata$X1, 1, 10)
mydata$day <- lubridate::dmy(mydata$day)
mydata$time <- stringr::str_sub(mydata$X1, 12, 19)
mydata$time <- lubridate::hms(mydata$time)
mydata$person <- stringr::str_extract(mydata$X2, "[^:]*")
mydata$person <- factor(mydata$person, levels = unique(mydata$person), labels = c('Me', 'Other'))
mydata$txt <- gsub(".*:\\s*","", mydata$X2)
mydata <- mydata[, c(3:6)]
head(mydata, 2)
## # A tibble: 2 × 4
##          day         time person   txt
##       <date> <S4: Period> <fctr> <chr>
## 1 2017-01-20  21H 10M 14S     Me Hello
## 2 2017-01-20  21H 11M 42S  Other  Huhu

Gradient word cloud

Since the chat involves only two persons, I decided to plot it as gradient word cloud, a visualization technique developed by Tyler Rinker. The function gradient_cloud() I use in the next code snippet is part of his qdap package. Gradient word clouds “color words with a gradient based on degree of usage between two individuals” (See).

gradient_cloud(mydata$txt, mydata$person, title = "Gradient word cloud of Telegram chat")

plot of chunk gwc

The chat I have ploted is very short and, thus, not very telling. I'm wondering how it looks in a couple of months.

How to plot animated maps with gganimate

Intro

Leipzig is the largest city in the federal state of Saxony, Germany. Since about 2009, Leipzig is one of the fastest growing German cities. Between 2006 and 2016, the population has grown from about 506.000 to 580.000 inhabitants.The largest part of the population growth can be attributed to inward migration.

In this blog post, I show how to create an animated map of Leipzig visualizing the migration balances for each of Leipzig's 63 districts (“Ortsteile”).

The data

The data I will visualize in this blog post are provided by the local Statistical Bureau. They can be downloaded as .csv file from the following website. The shapefile I use, is not publicly available and can be purchased for about 25 Euro from the Statistical Bureau.

R packages

The following code will install load and / or install the R packages required for this blog post.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(readr, rgdal, ggplot2, reshape2)
pacman::p_load_current_gh("dgrtwo/gganimate")

While the packages readr and rgdal are required for importing the .csv and the shapefile, ggplot2 and gganimate are needed to plot and animate the data. The reshape2 package is required for transforming data frames from wide to long format.

After importing the csv file containing the migration data, we add a key variable (id) to identify each of Leipzig's 63 districts. Using the melt-function of the reshape2 package, we reformat the data frame from wide into long form.

# import csv file
df.migrat <- read_csv2("./Data_LE/Bevölkerungsbewegung_Wanderungen_Wanderungssaldo.csv",
                      col_types = cols(`2015` = col_number()),
                      n_max = 63,
                      locale = locale(encoding = "latin1"),
                      trim_ws = TRUE)
# add id variable
df.migrat$id <- c(1:63)
# drop variables not needed
df.migrat <- df.migrat[,c(1,18,2:17)]
# change name of first variable
colnames(df.migrat)[1] <- 'district'
# convert from wide to long
df.migrat <- reshape2::melt(df.migrat, id.vars=c("district","id"))
# change name of third variable
colnames(df.migrat)[3] <- 'year'

With the next code snippet, we import the shapefile using the readOGR-function of the rgdal package. The object we receive (sh.file) is of class SpatialPolygonsDataFrame. After adding a key variable, we transform this object into an ordinary data frame using the fortify-function of the ggplot2 package. Furthermore, we join both data frames by the key variable id. Before we plot the data, we create a new variable (co) categorizing the migration balance variable (value).

# import shapefile
sh.file <- readOGR("ot.shp", layer="ot")
# add id variable
sh.file$id <- c(1:63)
# Create a data frame
sh.file <- fortify(sh.file, region = 'id')
# Merge data frames by id variable
mydata <- merge(sh.file, df.migrat, by='id')
# create a categorized variable
mydata$co <- ifelse(mydata$value >= 1000, 4,
                   ifelse(mydata$value >= 0, 3,
                                ifelse(mydata$value >= -1000, 2, 1)))
# define levels and labels
mydata$co  <- factor(mydata$co, levels = c(4:1),
                     labels = c('> 1000', '0-1000', '-1000-0', '< -1000'))

Moreover, we define a theme suitable for plotting maps with ggpot2. The following theme is a customized version of a theme I found on Stack Overflow.

theme_opts <- list(theme(panel.grid.minor = element_blank(),
                         panel.grid.major = element_blank(),
                         panel.background = element_blank(),
                         plot.background = element_blank(),
                         panel.border = element_blank(),
                         axis.line = element_blank(),
                         axis.text.x = element_blank(),
                         axis.text.y = element_blank(),
                         axis.ticks = element_blank(),
                         axis.title.x = element_blank(),
                         axis.title.y = element_blank(),
                         legend.position="right",
                         plot.title = element_text(size=16)))

Plotting the data

Finally, we can plot the data. The first plot (static) visualizes the migration data for the whole period of time (2000–2015).

ggplot(mydata, aes(x = long, y = lat, group = group, fill=value)) +
  theme_opts + 
  coord_equal() +
  geom_polygon(color = 'grey') +
  labs(title = "Migration balance between 2000 and 2015",
       subtitle = "Static Map",
       caption = 'Data source: Stadt Leipzig, Amt für Statistik und Wahlen, 2015') +
  scale_fill_distiller(name='Migration\nbalance', direction = 1, palette='RdYlGn')

plot of chunk map

For creating an animated plot, we add the frame argument stemming from the gganimate package. With interval = 5 we define the time (in seconds) between the “slides”.

p <- ggplot(mydata, aes(x = long, y = lat, group = group, fill=value, frame=year)) +
  theme_opts + 
  coord_equal() +
  geom_polygon(color = 'grey') +
  labs(title = "Migration balance in year: ",
       subtitle = "Animated Map",
       caption = 'Data source: Stadt Leipzig, Amt für Statistik und Wahlen, 2015') +
  scale_fill_distiller(name='Migration\nbalance', direction = 1, palette='RdYlGn')

gg_animate(p, interval = 5)