## Intro

Recently, I wondered whether there is a way to draw a fish shape using a mathematical function. Since I did not find a ready-made R function, I tried to write the function by myself. The equations, I've used for writing this function can be found on WolframMathWorld.

## The function

The `fish_curve()` function requires the `ggplot2` and the `dplyr` package. It creates a data frame with two variables (x and y) and 10.000 observations. Finally, the data points are plotted using `ggplot2`.

```fish_curve <- function(colour='black', size = 5){
library(ggplot2)
library(dplyr)
data.frame(
x = cos(1:10000) - sin(1:10000)^2 / sqrt(2),
y = cos(1:10000) * sin(1:10000)
) %>%
ggplot(., aes(x, y)) +
geom_point(colour = colour, size = size) +
theme_void()
}
```

## Function call with default parameters

With `colour` and `size` the `fish_curve()` function allows the user to specify two parameters; that is colour and size of the plotted points. The default values are black for `colour` and 5 for `size`.

```(p1 <- fish_curve())
``` ## Customization

In the following example, we customize colour and size of the fish shape:

```(p2 <- fish_curve(colour = 'blue', size = 1))
``` And finally, we place the two plots side by side using the `patchwork` package:

```library(patchwork)
p1 + p2
``` ## Intro

Quite frequently, factor variables are ordered by level frequency. However, factor levels having only a few observations are sometimes collapsed into one level usually named “others”. Since this level is usually not of particular interest, it may be a good idea to put this level in the last position of the plot rather than ordering it by level frequency. In this blog post, I’m going to show how to order a factor variable by level frequency and level name.

To replicate the R code I’m going to use in this post, four R packages must be loaded:

```library(dplyr) # for data manipulation
library(ggplot2) # for plotting data
library(gghighlight) # ggplot2 extension for highlighting values
```

The dataset I’m going to use in this post (`mtcars`) is part of the `datasets` package.

```head(mtcars)
```
```##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```

In the first code chunk, we:

• extract the first word of each car name and write it into a new variable called “brand”,
• rename all car brands starting with “M” (Mazda, Merc, Maserati) to “Others” and
• calculate the median miles per gallon (mpg) for each car brand.
```df.mtcars %
mutate(name = str_extract(rownames(.), "^\\w+\\b"),
brand = str_replace(name, "^M\\w+", 'Others')) %>%
group_by(brand) %>%
summarize(mpg = median(mpg))
df.mtcars\$brand
```
```##   "AMC"      "Cadillac" "Camaro"   "Chrysler" "Datsun"   "Dodge"
##   "Duster"   "Ferrari"  "Fiat"     "Ford"     "Honda"    "Hornet"
##  "Lincoln"  "Lotus"    "Others"   "Pontiac"  "Porsche"  "Toyota"
##  "Valiant"  "Volvo"
```

The following code chunk is to reorder the `brand` variable by level frequency using the `reorder()` function.

```df.mtcars %
mutate(brand = as.factor(brand),
brand = reorder(brand, mpg))
levels(df.mtcars\$brand)
```
```##   "Cadillac" "Lincoln"  "Camaro"   "Duster"   "Chrysler" "AMC"
##   "Dodge"    "Ford"     "Valiant"  "Others"   "Pontiac"  "Ferrari"
##  "Hornet"   "Volvo"    "Datsun"   "Porsche"  "Toyota"   "Fiat"
##  "Honda"    "Lotus"
```

As we can see, the bar representing the “Others” level is roughly in the middle of the plot.

```ggplot(df.mtcars, aes(brand, mpg, fill = brand)) +
coord_flip() +
geom_col(width = 0.5) +
gghighlight(brand == 'Others', unhighlighted_colour = "cornflowerblue") +
scale_fill_manual(values = c("grey")) +
theme_bw() +
theme(legend.position = 'none') +
labs(x = NULL,
y = 'Miles per Gallon',
title = "Factor variable ordered by level frequency")
``` To put the bar representing the “Others” level at the bottom of the plot, we have to set “Others” as reference category using the `relevel()` function.

```df.mtcars %
mutate(brand = relevel(brand, ref = "Others"))
levels(df.mtcars\$brand)
```
```##   "Others"   "Cadillac" "Lincoln"  "Camaro"   "Duster"   "Chrysler"
##   "AMC"      "Dodge"    "Ford"     "Valiant"  "Pontiac"  "Ferrari"
##  "Hornet"   "Volvo"    "Datsun"   "Porsche"  "Toyota"   "Fiat"
##  "Honda"    "Lotus"
```

Finally, the bar representing the “Others” level appears at the desired position.

```ggplot(df.mtcars, aes(brand, mpg, fill = brand)) +
coord_flip() +
geom_col(width = 0.5) +
gghighlight(brand == 'Others', unhighlighted_colour = "cornflowerblue") +
scale_fill_manual(values = c("grey")) +
theme_bw() +
theme(legend.position = 'none') +
labs(x = NULL,
y = 'Miles per Gallon',
title = "Factor variable ordered by level frequency and level name")
``` PS: In both plots, the `gghighlight()` function of the `gghighlight` package was used to highlight the desired factor level.

## Intro

Venn diagrams – named after the English logician and philosopher John Venn – “illustrate the logical relationships between two or more sets of items” with overlapping circles.

In this tutorial, I'll show how to plot a three set venn diagram using `R` and the `ggplot2` package.

## Packages and Data

For the R code to run, we need to install and load three R packages. Unlike `tidyverse` and `ggforce`, the `limma` package must be installed from Bioconductor rather than from CRAN.

Moreover, we create a random data frame using the `rbinom()` function.

```source("http://www.bioconductor.org/biocLite.R")
biocLite("limma")
library(limma)
library(tidyverse)
library(ggforce)
set.seed((123))
mydata <- data.frame(A = rbinom(100, 1, 0.8),
B = rbinom(100, 1, 0.7),
C = rbinom(100, 1, 0.6)) %>%
mutate_all(., as.logical)
```

## Drawing the Circles

Next, we create a data frame defining the x and y coordinates for the three circles we want to draw and a variable defining the labels. For plotting the circles – the basic structure of our venn diagram – we need the `geom_circle()` function of the `ggforce` package. We employ the `geom_circle()`-function of the `ggforce` package to actually draw the circles. With the parameter `r` (= 1.5), we define the radius of the circles.

```df.venn <- data.frame(x = c(0, 0.866, -0.866),
y = c(1, -0.5, -0.5),
labels = c('A', 'B', 'C'))
ggplot(df.venn, aes(x0 = x, y0 = y, r = 1.5, fill = labels)) +
geom_circle(alpha = .3, size = 1, colour = 'grey') +
coord_fixed() +
theme_void()
``` Furthermore, we need a data frame with the values we want the plot and the coordinates for plotting these values. The values can be obtained using the `vennCounts()` function of the `limma` package. Since `ggplot2` requires data frames we need to first transform the `vdc` object (class VennCounts) into a matrix and then into a data frame. In addition, we need to add the x and y coordinates for plotting the values.

```vdc <- vennCounts(mydata)
class(vdc) <- 'matrix'
df.vdc <- as.data.frame(vdc)[-1,] %>%
mutate(x = c(0, 1.2, 0.8, -1.2, -0.8, 0, 0),
y = c(1.2, -0.6, 0.5, -0.6, 0.5, -1, 0))
```

## The final Plot

Finally, we'll customize the look of our venn diagram and plot the values.

```ggplot(df.venn) +
geom_circle(aes(x0 = x, y0 = y, r = 1.5, fill = labels), alpha = .3, size = 1, colour = 'grey') +
coord_fixed() +
theme_void() +
theme(legend.position = 'bottom') +
scale_fill_manual(values = c('cornflowerblue', 'firebrick',  'gold')) +
scale_colour_manual(values = c('cornflowerblue', 'firebrick', 'gold'), guide = FALSE) +
labs(fill = NULL) +
annotate("text", x = df.vdc\$x, y = df.vdc\$y, label = df.vdc\$Counts, size = 5)
``` ## Intro

A couple of months ago, I bought a mobile GPS navigation device. The device makes it possible to record the routes we travel with our car. Since the routes can by exported as `.gpx` files, it is rather easy to plot them using R.

## Packages

To replicate this blog post, three R packages are required: `plotKML` to import the data into R and `ggplot2` and `ggmap` to plot the data.

```library(plotKML)
library(ggplot2)
library(ggmap)
```

As a prerequisite for the installation of the `plotKML` package on my Linux Mint 18.3 OS (based on Ubuntu 16.04), I needed to add the `ubuntugis` PPA.

```sudo add-apt-repository ppa:ubuntugis/ppa
sudo apt-get update
```

## Import

All what's required for importing the `.gpx` file into R is the `readGPX()` function of the `plotKML` package. After importing the file into R, we receive a list (`lst.rd`) conatining some meta information and one data frame. To be able to plot the GPS data using `ggplot2`, we need to subset this list and save the data frame into a new object (`df`).

```lst.rd <- readGPX('29 Dez. 1413.gpx')
df <- lst.rd\$tracks[][]
```

Using the head function, we have a short look on the data:

```head(df)
```
```##        lon      lat                 time
## 1 12.98376 51.05915 2017-12-29T12:16:28Z
## 2 12.98377 51.05912 2017-12-29T12:16:29Z
## 3 12.98374 51.05912 2017-12-29T12:16:30Z
## 4 12.98372 51.05914 2017-12-29T12:16:31Z
## 5 12.98372 51.05914 2017-12-29T12:16:32Z
## 6 12.98372 51.05914 2017-12-29T12:16:33Z
```

As we can see, our data frame consists of three variables: longitude, latitude and time.

## Plotting

Whithout much efford, we can plot our GPS data using `ggplot2`. Please note that GPS coordinates must be plotted as points.

```ggplot(df, aes(x = lon, y = lat)) +
coord_quickmap() +
geom_point()
``` In addition, the `ggmap` package offers some functionality to plot the data on maps.

The first example shows how to plot the data on a map provided by Google Maps.

```mapImageData <- get_googlemap(center = c(lon = mean(df\$lon), lat = mean(df\$lat)),
zoom = 10,
color = 'bw',
scale = 1,
maptype = "terrain")
ggmap(mapImageData, extent = "device") + # removes axes, etc.
geom_point(aes(x = lon,
y = lat),
data = df,
colour = "red3",
alpha = .1,
size = .1)
``` The second example shows how to plot the data on a Stamen map.

```mapImage <- get_map(location = c(lon = mean(df\$lon) - 0.05, lat = mean(df\$lat)),
source = "stamen",
maptype = "toner",
zoom = 10)

ggmap(mapImage, extent = "device") +
geom_point(aes(x = lon,
y = lat),
data = df,
colour = "red3",
size = .2)
``` ## References

This blog post heavily borrows from the tutorial Mapping GPS Tracks in R. Thanks very much!

## Analyse der DFG Fachkollegienwahl 2015 — Teil I

Die Deutsche Forschungsgemeinschaft (DFG) ist die europaweit größte Forschungsförderungsorganisation mit einem Jahresetat von aktuell etwa 3 Milliarden Euro (2016). Die Aufgabe der DFG besteht in der Auswahl und Finanzierung wissenschaftlicher Forschungsvorhaben an Hochschulen und Forschungsinstituten.

Laut DFG werden eingereichte Förderanträge in einem mehrstufigen Entscheidungsverfahren “von ehrenamtlich tätigen Gutachterinnen und Gutachtern nach ausschließlich wissenschaftlichen Kriterien beurteilt.” In einem zweiten Schritt werden diese Fachgutachten von Mitgliedern sogenannter Fachkollegien bewertet. Diese treffen eine Vorauswahl und entscheiden, ob ein Antragsverfahren weitergeführt oder eingestellt wird. Die eigentliche Förderentscheidung liegt jedoch beim DFG-Hauptausschuss.

Für die Wahl der Fachkollegien wurde im Jahr 2014 eine neue Wahlordnung verabschiedet, nach der jene Kandidaten mit den meisten Wahlvorschlägen einen Platz auf den Wahllisten bekommen (stimmt das?). Dies habe, so der Bamberger Historiker Hartwin Brandt in zwei Beiträgen für die Frankfurter Allgemeine Zeitung (FAZ, 17.05.2017, 18.10.2017) zu inoffiziellen Absprachen zwischen Hochschulvertetern geführt – mit dem Ziel, die eigenen Kanditaten auf die Wahllisten zu platzieren.

Brandt, der für den Bereich “Klassische Philologie” einen “Nordverbund” bestehend aus den Universitäten Bremen, Greifswald, Hamburg, Kiel und Rostock identifiziert, vermutet, dass eine “detaillierte Analyse der mehr als 600 Seiten umfassenden DFG-Liste der Kandidierenden etliche weitere Auffälligkeiten und Kompensationsgeschäfte zutage fördern” dürfte und mutmaßt, dass “soziologische Netzwerk- und Freundschaftsforscher ihre Freude an diesem Datensatz haben werden”.

Die von der DFG veröffentlichte “Liste der Kandidierenden für die DFG-Fachkollegienwahl 2015” (Stand 17.09.2015) ist ein 617 A4-Seiten umdassendes PDF-Dokument. Sie kann als PDF-Datei von den Internetseiten der DFG heruntergeladen werden (Link). Dazu kommt eine ebenfalls von der DFG veröffentlichte Liste mit den Wahlergebnissen.

Um die in beiden Dokumenten enthaltenen Angaben statistisch auswerten zu können, müssen diese zunächst in ein maschinenlesbares Format überführt werden. Mit der Funktion `exctract_text()` aus dem R-Paket `tabulizer` können die Bezeichnungen und Nummern der einzelnen Fachgebiete ausgelesen werden. Zwar lassen sich mit der `exctract_tables()`-Funktion prinzipiell auch in PDF-Dateien enthaltene Tabellen auslesen. Meine Versuche, die auf das entsprechende PDF-Dokument anzuwenden, waren jedoch leider nicht von Erfolg gekrönt, sodass ich die Daten im Copy-and-Paste-Verfahren manuell in eine Excel-Datei üoberführt habe. Da Copy-and-Paste ein vergleichsweise fehleranfälliges Verfahren ist, würde ich mich freuen, von den Lesern dieses Beitrags auf eventuelle Fehler hingewiesen zu werden. Der Datensatz kann von meinem Github-Repository heruntergeladen werden (zur Zeit nur im Format .rda).

Der Datensatz `df.dfg` enthält die folgenden Variablen:

• `wb.nummer`: Nummer des Wissenschaftsbereichs;
• `wb.name`: Name des Wissenschaftsbereichs;
• `fg.nummer`: Nummer des Fachgebiets;
• `fg.name`: Name des Fachgebiets;
• `fk.nummer`: Nummer des Fachkollegiums;
• `fg.name` : Name des Fachkollegiums;
• `fach.nummer`: Nummer des Fachs;
• `fach.name`: Name des Fachs;
• `person.name`: Name und Vorname der kandidierenden Person;
• `person.sex`: Geschlecht der kandidierenden Person;
• `person.inst`: Institution der kandidierenden Person;
• `inst.von`: vorschlagende Institutionen
• `inst.von.kat`: Kategorisierung der vorschlagenden Institution in: Hochschule vs. Fachgesellschaft vs. Sonstiges;
• `anzahl.plaetze`: Anzahl wählbarer Personen pro Fachkollegium;
• `anzahl.kandidaten`: Anzahl kandidierender Personen pro Fachkollegium;
• `person.platz`: Platzierung der kandidierende Person (Wahlergebnis)
• `person.stimmen`: Anzahl der Stimmen, mit denen der Kandidierende gewählt wurde (Wahlergebnis);
• `person.gewaehlt`: Wurde die kandidierende Person gewählt?
```## Observations: 9,695
## Variables: 18
## \$ wb.nummer         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## \$ wb.name           <fct> Geistes- und Sozialwissenschaften, Geistes- ...
## \$ fg.nummer         <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, ...
## \$ fg.name           <fct> Geisteswissenschaften, Geisteswissenschaften...
## \$ fk.nummer         <chr> "FN_101", "FN_101", "FN_101", "FN_101", "FN_...
## \$ fk.name           <chr> "Alte Kulturen", "Alte Kulturen", "Alte Kult...
## \$ fach.nummer       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## \$ fach.name         <chr> "Ur- und Frühgeschichte (weltweit)", "Ur- un...
## \$ person.name       <chr> "Bemmann, Jan", "Brather, Sebastian", "Brath...
## \$ person.sex        <fct> m, m, m, w, m, m, m, m, m, w, w, w, m, m, m,...
## \$ person.inst       <chr> "U Bonn", "U Freiburg", "U Freiburg", "RGZM"...
## \$ inst.von          <chr> "U Bonn", "U Freiburg", "WGL", "WGL", "Dt. V...
## \$ inst.von.kat      <chr> "Hochschule", "Hochschule", "Sonstiges", "So...
## \$ anzahl.plaetze    <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,...
## \$ anzahl.kandidaten <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,...
## \$ person.platz      <dbl> 7, 5, 5, 6, 2, 2, 9, 8, 8, 1, 1, 1, 3, 3, 4,...
## \$ person.stimmen    <dbl> NA, NA, NA, NA, 195, 195, NA, NA, NA, 255, 2...
## \$ person.gewaehlt   <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALS...
```

Im Bereich der medizinischen Fächer gibt es die Besonderheit, dass für Kandidaten, die an einem Universitätsklinikum beschäftigt sind, als Institution das Klinikum, und nicht die Universität genannt wird. Um tatsächlich vorhandene Netzwerkstrukturen rekonstruieren zu können, habe ich immer dann, wenn sich ein Klinikum eindeutig einer Universität zuordnen ließ, in der Variable `Institution` den Namen der Universität abgespeichert (z.B. “Universitätsklinikum Münster als "Universität Münster”). Falls keine eindeutige Zuordnung möglich war, habe ich den Namen des Klinikums nicht überschrieben (z.B. “Universitätsklinikum Schleswig-Holstein”).

Die Liste der Kandidierenden lässt sich in vier “Wissenschaftsbereiche” untergliedern, denen die Fachkollegien zugeordnet sind. Zuguterletzt ist jedem Fachkollegium eine Reihe von Fächern zugeordnet (z.B. “Literaturwissenschaften” in “Ältere deutsche Literatur”, “Neuere deutsche Literatur” etc.).

• Geistes- und Sozialwissenschaften
• Geisteswissenschaften (Fachkollegium 101–108)
• Sozial- und Verhaltenswissenschaften (Fachkollegium 109–113)
• Lebenswissenschaften
• Biologie (Fachkollegium 201–203)
• Medizin (Fachkollegium 204–206)
• Agrar-, Forstwissenschaften und Tiermedizin (Fachkollegium 207)
• Naturwissenschaften
• Chemie (Fachkollegium 301–307)
• Physik (Fachkollegium 308–311)
• Mathematik (Fachkollegium 312)
• Geowissenschaften (Fachkollegium 313–318)
• Ingenieurswissenschaften
• Maschinenbau und Produktionstechnik (Fachkollegium 401–402)
• Wärmetechnik/Verfahrenstechnik (Fachkollegium 403–404)
• Materialwissenschaft und Werkstofftechnik (Fachkollegium 405–406)
• Informatik, System- und Elektrotechnik (Fachkollegium 407–409)
• Bauwesen und Architektur (Fachkollegium 410)

In den nächsten Monaten werde ich mich in einer Reihe von Blogposts an einer statistischen Analyse der DFG-Kandidierendenliste versuchen.

Aktualisiert am: 17.02.2018

## Intro

The acronym CONSORT stands for “Consolidated Standards of Reporting Trials”. The CONSORT statement is published by the CONSORT group and gives recommendations aiming to improve the quality of reports of randomized clinical trials. Besides a 25 item checklist, the current CONSORT statement (2010) includes the CONSORT flow diagram which is to visualize the progress through the phases of a parallel randomised clinical trial of two groups (see Figure 1).

In this blog post, I will do both show how to “draw” the CONSORT flow diagram using R and discuss the pros and cons of this approach.

Figure 1: CONSORT 2010 Flow Diagram ### R Packages

The following R packages are required to create, plot and save flow diagrams: `DiagrammeR` supports the `Graphviz` software which consists of a graph description language called `DOT`. `DiagrammeRsvg` and `rsvg` are required to save diagrams to a hard drive.

```library(DiagrammeR)
library(DiagrammeRsvg)
library(rsvg)
```

In case we want to include the flow diagram into a `.Rmd` document and render it as PDF, we need to install the `phantomjs` JavaScript library. This can be done using the `install_phantomjs()` function of the `webshot` package.

```webshot::install_phantomjs()
```

`GraphViz` is very flexible regarding the definition of line colors, arrow shapes, node shapes, and many other layout features. However, since the actual purpose of `GraphViz` is the visualization of graphs (networks), it is difficult to control the position of the nodes.

Unlike in a Cartesian plot, the position of the nodes is not defined by x and y coordinates. Instead, the nodes' position can only be determined by edges (relationships between nodes) and their attributes.

### Creating a grid

When I looked at the CONSORT flow diagram (see Figure 1), I had the impression that the nodes may be placed using a grid-based system.

Figure 2: Placement of nodes using a grid The CONSORT flow diagram conceptualized as grid consists of 21 nodes. That is:

• 4 nodes with plain text (blue, T1-T4) describing the phase of trial (Enrollment, Allocation, Follow-Up, Analysis) and
• 9 boxed nodes (black, B1-B9) containing text (“Randomized”) and numerical data (“n=200”).

In order to populate each cell of the grid, we need to add:

• 4 nodes rendered as tiny and, thus, invisible points (green, P1-P4). We use the points to mimic edges brancing-off from other edges.
• 4 completely invisible nodes (grey, I1_I4).

The numbers in brackets denote the nodes' ID numbers.

Eventually, the grid structure is created using both vertical and horizontal edges.

## Creating the CONSORT Flow Diagram

When creating the actual CONSORT diagram we gonna make use of the grid structure. All we have to do now is to change some attributes of the nodes and edges.

### Preparation

To make the flow diagram dynamic, we need to save constant elements (text labels) and dynamic elements (numeric values) into separate vectors (`text`, `values`). The `paste1()`-function is a simple wrapper of the base R `paste0()`-function making it more convenient to concatinate text labels and numeric values.

```# Values ------------------------------------------------------------------
values <- c(210, 10, 200, 100, 100, 10, 10, 90, 90)

# Defining Text Labels ----------------------------------------------------
text <- c('Assessment for\neligibility',
'Excluded',
'Randomized',
'Allocated to\nintervention',
'Allocated to\nintervention',
'Lost to follow-up',
'Lost to follow-up',
'Analysed',
'Analysed')

# Defining Function -------------------------------------------------------
paste1 <- function(x, y){
paste0(x, ' (n=', y, ')')
}

# Concatenating Values and Text Labels ------------------------------------
LABS <- paste1(text, values)
LABS
```
```##  "Assessment for\neligibility (n=210)"
##  "Excluded (n=10)"
##  "Randomized (n=200)"
##  "Allocated to\nintervention (n=100)"
##  "Allocated to\nintervention (n=100)"
##  "Lost to follow-up (n=10)"
##  "Lost to follow-up (n=10)"
##  "Analysed (n=90)"
##  "Analysed (n=90)"
```

With the next step, we create a node data frame (`ndf`) using the `create_node_df()`-function of the `DiagrammR` package. The attributes of this function are pretty much self-explanatory.

### Node Data Frame

```ndf <-
create_node_df(
n = 21,
label = c('Enrollment', 'Allocation', 'Follow-Up', 'Analysis',
LABS, rep("", 8)),
style = c(rep("solid", 13), rep('invis', 8)),
shape = c(rep("plaintext", 4),
rep("box", 9),
rep("point", 8)),
width = c(rep(2, 4), rep(2.5, 9), rep(0.001, 8)),
hight = c(rep(0.5, 13), rep(0.001, 8)),
fontsize = c(rep(14, 4), rep(10, 17)),
fontname = c(rep('Arial Rounded MT Bold', 4), rep('Courier New', 17)),
penwidth = 2.0,
fixedsize = "true")
```

### Edge Data Frame

Furthermore, we create an edge data frame (`edf`) using the `create_edge_df()`-function. Edges we don't want to be visible are colored white (with alpha = 0) and don't get an arrowhead. To make edges horizontal, the `constraint` attribute needs to be set to `false`. `from` and `to` are vectors containing node IDs from which edges are outbound and incoming. They determine the relationships between the nodes. To arrange the code more clearly, I've ordered outbound and incoming edges by columns and rows according to Figure 2.

```edf <-
create_edge_df(
arrowhead = c(rep('none', 3), rep("vee", 3), rep('none', 2), "vee", rep('none', 6),
rep("vee", 3), rep("none", 3), "vee", rep("none", 10)),
color = c(rep('#00000000', 3), rep('black', 6), rep('#00000000', 6),
rep('black', 3), rep('#00000000', 3), rep('black', 1),
rep('#00000000', 2), rep('black', 2),
rep('#00000000', 6)),
constraint = c(rep("true", 18), rep('false', 14)),
from = c(1, 19, 20, 16, 8, 10, # column 1
5, 14, 7, 15, 2, 3, # column 2
18, 6, 21, 17, 9, 11, # column 3
1, 5, # row 1
19, 14, # row 2
20, 7, # row 3
16, 15, # row 4
8, 2, # row 5
10, 3, # row 6
12, 4), # row 7
to = c(19, 20, 16, 8, 10, 12, # column 1
14, 7, 15, 2, 3, 4, # column 2
6, 21, 17, 9, 11, 13, # column 3
5, 18, # row 1
14, 6, # row 2
7, 21, # row 3
15, 17, # row 4
2, 9, # row 5
3, 11, # row 6
4, 13)) # row 7
```

### Plotting

Before we are going to plot the flow diagram, we need to create a graph using the `create_graph()`-function. The graph can be plotted using the `render_graph()`-function.

```# Create Graph ------------------------------------------------------------
g <- create_graph(ndf,
edf,
attr_theme = NULL)

# Plotting ----------------------------------------------------------------
render_graph(g)
```

Figure 3: Simplified CONSORT Flow Diagram Unfortunately, it didn't find a way to left align text within the text boxes. Thus, I didn't manage to draw a fully featured CONSORT flow diagram using R.

### Export

Saving the flow diagram to a hard drive is quite straight forward:

```export_graph(g, file_name = "CONSORT.png")
```

File types to be exported are PNG, PDF, SVG, PS and GEXF (graph file format).

## Discussion

As demonstrated in this blog post, “drawing” a CONSORT flow diagram using R and GraphViz is rather cumbersome. Because of some limitations regarding the alignment of text within boxes, only a simplyfied form of the CONSORT flow diagram can be drawn. However, the use of R seems to be the only way to make flow diagrams dynamic.

## Intro

A couple of days ago, the `hexSticker` package was published on CRAN. The package provides some functions to plot hexagon stickers that may be used to promote `R` packages. As described on GitHub, the stickers can be plotted either using base `R's` plotting function, the `lattice` package or the `ggplot2` package. Moreover, it is also possible to plot image files.    Since I found it quite demanding to install the `hexSticker` package on a current Linux os (Linux Mint 18.1), I decided to write a short tutorial explaining how to install and use the package on Linux Ubuntu-based operating systems.

## Linux packages required

In a first step, we need to open the terminal to install the following software packages. While `texinfo` is required to build `R` packages from source, `libudunits2-dev`, `fftw-dev` and `mffm-fftw1` are needed to install some `R` packages the `hexSticker` package depends on (`ggforce`, `fftwtools`).

``````sudo apt-get install texinfo libudunits2-dev fftw-dev mffm-fftw1 libfftw3-dev libtiff5-dev
``````

## R packages required

Recently, the `fftwtools` package was added to CRAN. Thus, it can be installed the usual way:

``````installed.packages('fftwtools', dep = TRUE)
``````

Finally, the `EBImage` package must be installed from the Bioconductor repository and the packages `ggimage`, `ggforce` and `hexSticker` must be installed from CRAN.

``````source("https://bioconductor.org/biocLite.R")
biocLite("EBImage")
install.packages("ggimage")
install.packages("ggforce")
install.packages("hexSticker")
``````

For plotting an example hexsticker we need some data provided by the `streetsofle` package which must be installed from GitHub:

``````if (!require("devtools")) install.packages("devtools")
devtools::install_github("nrkoehler/streetsofle")
``````

## Plotting

With the following code chunk we create two hexstickers for the `streetsofle` package. The colours chosen stem from the city flag of Leipzig. The arguments of the `sticker()` function are explained within the code's comments.

``````library(hexSticker)
library(ggplot2)
library(streetsofle)
data(streetsofle)

p.1 <- ggplot(aes(x = lon, y = lat), data = shape.ortsteile) +
theme_map_le() +
coord_quickmap() +
geom_polygon(aes(x = lon, y = lat, group = group),
fill = NA,
size = 0.2,
color = "#FFCB00") +
geom_polygon(aes(x = lon, y = lat, group = group),
color = "#FFCB00",
size = 1,
fill = NA,
data = shape.bezirke)

p.1 <- sticker(p.1,
package="streetsofle",
s_x = 1, # horizontal position of subplot
s_y = 1.1, # vertical position of subplot
s_width = 1.4, # width of subplot
s_height = 1.4, # height of subplot
p_x = 1, # horizontal position of font
p_y = .43, # vertical position of font
p_size = 6, # font size
p_color = "#FFCB00", # font colour
h_size = 3, # hexagon border size
h_fill = "#004CFF", # hexagon fill colour
h_color = "#FFCB00") # hexagon border colour

p.2 <- ggplot(aes(x = lon, y = lat), data = shape.ortsteile) +
theme_map_le() +
coord_quickmap() +
geom_polygon(aes(x = lon, y = lat, group = group),
fill = NA,
size = 0.2,
color = "#004CFF") +
geom_polygon(aes(x = lon, y = lat, group = group),
color = "#004CFF",
size = 1,
fill = NA,
data = shape.bezirke)

p.2 <- sticker(p.2,
package="streetsofle",
s_x = 1, # horizontal position of subplot
s_y = 1.1, # vertical position of subplot
s_width = 1.4, # width of subplot
s_height = 1.4, # height of subplot
p_x = 1, # horizontal position of font
p_y = .43, # vertical position of font
p_size = 6, # font size
p_color = "#004CFF", # font color
h_size = 3, # hexagon border size
h_fill = "#FFCB00", # hexagon fill colour
h_color = "#004CFF") # hexagon border colour
``````

Finally, both plots are put into a grid layout using the `grid.arrange()` function of the `gridExtra` package.

``````library(gridExtra)
grid.arrange(p.1, p.2, ncol = 2, respect = TRUE)
`````` I'm not sure which sticker looks better. What do you think?

## Intro

Recently my girl-friend asked me whether I could do some data project being useful for our family rather than just for myself. A couple of years ago – after our first son was born – we decided to rent a small garden (allotment garden) very close to our flat. In our garden we grow some fruit and vegetables, e.g. strawberries, peas, beans and beetroot.

When growing fruit and vegetables it is considered important to know that plants compete for resources. While some plants benefit from one another, other plants sould not be grown together. The internet provides loads of so-called companion planting guides or charts (see Link). For each plant, they define lists of companions (growing well together) and antogonists (not growing well together).

This blog post is to show how to visualize a companion planting guide using `R` and the `ggplot2` package.

## Packages and data

With `readxl` for importing data from Excel, `dplyr` for data wrangling and `ggplot2` for data visualization three `R` packages are required to reproduce the results of this blog post.

```library(readxl)
library(dplyr)
library(ggplot2)
```

The data I'm going to visualize stem from some companion planting guides published in German (see Link). I only selected the plants relevant for our garden and entered the data manually into an Excel worksheet.

In the first step, I saved the imported data into a data frame (`mydata.1`). The second data frame `mydata.2` is a copy of `mydata.1` with the first two variables in reverse order. In order to receive a matrix (with the same number of factor levels in each column), I merged `mydata.1` and `mydata.2` into `mydata` using the `rbind()` function.

```mydata.1 <- readxl::read_excel('guide.xlsx', sheet = 1)
mydata.2 <- mydata.1[, c(2, 1, 3)]
colnames(mydata.2) <- colnames(mydata.1)
mydata <- rbind(mydata.1, mydata.2)
rm(list = c(setdiff(ls(), c('mydata'))))
```
```## # A tibble: 10 × 3
##          plant_1  plant_2    status
##            <chr>    <chr>     <chr>
## 1       cucumber     peas companion
## 2           corn     peas companion
## 4           peas cucumber companion
## 5           corn cucumber companion
## 6       beetroot cucumber companion
## 7           corn potatoes companion
## 8       potatoes  cabbage companion
## 10 strawberries   lettuce companion
```

Finally, I removed all objects from workspace not required for data visualization and printed the first ten lines of the `tibble` (slightly modified data frame) `mydata`.

## Wrangling

With the following code snippet, all variables of `mydata` are transformed into `factors`. Furthermore, a parameter needed for plotting is saved as a vector named `labs.n`.

```mydata <- dplyr::mutate_all(mydata, as.factor)
labs.n <- length(levels(mydata\$plant_1)) + .5
```

## Plotting

Finally, we plot the matrix using the `ggplot2` package. In order to adjust the position of the grid lines, we remove to default grid lines with `panel.grid.major = element_blank()` and `panel.grid.minor = element_blank()` and draw new grid lines with `geom_vline()` and `geom_hline()`. The new grid lines are in accordance with the boundaries of the tiles.

```ggplot(mydata, aes(x = plant_1, y = plant_2, fill = status)) +
theme_grey() +
coord_equal() +
geom_tile() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = 'bottom') +
geom_vline(xintercept = seq(0.5, labs.n, 1), color='white') +
geom_hline(yintercept = seq(0.5, labs.n, 1), color='white') +
scale_fill_manual('', values = c("red3", "green3")) +
labs(x='',
y='',
title = "Companion Planting Guide")
``` The interpretation of the companion planting guide is very easy: While green tiles signal companion plants that can be grown close to each other, red tiles flag antagonist plants that should not be grown together. The squares are gray when the plants are neither companions nor antagonists.

## How to scrape, import and visualize Telegram chats

### Intro

Telegram is a cloud-based and cross-platform instant messaging service. Unlike WhatsApp, Telegram clients exist not only for mobile devices but also for desktop operating systems. In addition, there is also a web based client.

In this blog post I show, how to import Telegram chats into R and how to plot a chat using the `qdap` package.

### R packages

The following code will install load and / or install the R packages required for this blog post.

```if (!require("pacman")) install.packages("pacman")
```

### Scraping the data

A very straightforward way to save Telegram chats is to use the Chrome extension Save Telegram Chat History. On Quora, Stacy Wu explains how to use it:

• Visit https://web.telegram.org.
• Select a peer you want to get the chat history from.
• Select all text (Ctrl+A) and copy it (Ctrl+C).
• Paste the text into a text editor (Ctrl+V) and save to a local file.

I have saved the chat to a local csv-file. Since the first two lines contain non-tabular information, they need to be manually removed. Furthermore, undesired line breaks must be manually removed as well.

### Importing the data

The following line of code shows how to import the csv file into R using the `read_csv()` function of the `readr` package.

```mydata <- readr::read_csv("telegram.csv", col_names=FALSE)
```

### Data wrangling

After importing the file, our data frame consists of two string variables: `X1` containing information about day and time of the conversations and `X2` containing the names of the persons involved in the chat as well as the chat text. With the following lines of code we create 4 new variables:

• `day` containing the dates of the chats,
• `time` containing the times of day of the chats,
• `person` containing the names of the persons involved in the chat,
• `txt` containing the actual chat text.
```mydata\$day <- stringr::str_sub(mydata\$X1, 1, 10)
mydata\$day <- lubridate::dmy(mydata\$day)
mydata\$time <- stringr::str_sub(mydata\$X1, 12, 19)
mydata\$time <- lubridate::hms(mydata\$time)
mydata\$person <- stringr::str_extract(mydata\$X2, "[^:]*")
mydata\$person <- factor(mydata\$person, levels = unique(mydata\$person), labels = c('Me', 'Other'))
mydata\$txt <- gsub(".*:\\s*","", mydata\$X2)
mydata <- mydata[, c(3:6)]
```
```## # A tibble: 2 × 4
##          day         time person   txt
##       <date> <S4: Period> <fctr> <chr>
## 1 2017-01-20  21H 10M 14S     Me Hello
## 2 2017-01-20  21H 11M 42S  Other  Huhu
```

Since the chat involves only two persons, I decided to plot it as gradient word cloud, a visualization technique developed by Tyler Rinker. The function `gradient_cloud()` I use in the next code snippet is part of his `qdap` package. Gradient word clouds “color words with a gradient based on degree of usage between two individuals” (See).

```gradient_cloud(mydata\$txt, mydata\$person, title = "Gradient word cloud of Telegram chat")
``` The chat I have ploted is very short and, thus, not very telling. I'm wondering how it looks in a couple of months.

## How to plot animated maps with gganimate

### Intro

Leipzig is the largest city in the federal state of Saxony, Germany. Since about 2009, Leipzig is one of the fastest growing German cities. Between 2006 and 2016, the population has grown from about 506.000 to 580.000 inhabitants.The largest part of the population growth can be attributed to inward migration.

In this blog post, I show how to create an animated map of Leipzig visualizing the migration balances for each of Leipzig's 63 districts (“Ortsteile”).

### The data

The data I will visualize in this blog post are provided by the local Statistical Bureau. They can be downloaded as .csv file from the following website. The shapefile I use, is not publicly available and can be purchased for about 25 Euro from the Statistical Bureau.

### R packages

The following code will install load and / or install the R packages required for this blog post.

```if (!require("pacman")) install.packages("pacman")
```

While the packages `readr` and `rgdal` are required for importing the .csv and the shapefile, `ggplot2` and `gganimate` are needed to plot and animate the data. The `reshape2` package is required for transforming data frames from wide to long format.

After importing the csv file containing the migration data, we add a key variable (id) to identify each of Leipzig's 63 districts. Using the `melt`-function of the `reshape2` package, we reformat the data frame from wide into long form.

```# import csv file
col_types = cols(`2015` = col_number()),
n_max = 63,
locale = locale(encoding = "latin1"),
trim_ws = TRUE)
df.migrat\$id <- c(1:63)
# drop variables not needed
df.migrat <- df.migrat[,c(1,18,2:17)]
# change name of first variable
colnames(df.migrat) <- 'district'
# convert from wide to long
df.migrat <- reshape2::melt(df.migrat, id.vars=c("district","id"))
# change name of third variable
colnames(df.migrat) <- 'year'
```

With the next code snippet, we import the shapefile using the `readOGR`-function of the `rgdal` package. The object we receive (`sh.file`) is of class SpatialPolygonsDataFrame. After adding a key variable, we transform this object into an ordinary data frame using the `fortify`-function of the `ggplot2` package. Furthermore, we join both data frames by the key variable `id`. Before we plot the data, we create a new variable (`co`) categorizing the migration balance variable (`value`).

```# import shapefile
sh.file\$id <- c(1:63)
# Create a data frame
sh.file <- fortify(sh.file, region = 'id')
# Merge data frames by id variable
mydata <- merge(sh.file, df.migrat, by='id')
# create a categorized variable
mydata\$co <- ifelse(mydata\$value >= 1000, 4,
ifelse(mydata\$value >= 0, 3,
ifelse(mydata\$value >= -1000, 2, 1)))
# define levels and labels
mydata\$co  <- factor(mydata\$co, levels = c(4:1),
labels = c('> 1000', '0-1000', '-1000-0', '< -1000'))
```

Moreover, we define a theme suitable for plotting maps with `ggpot2`. The following theme is a customized version of a theme I found on Stack Overflow.

```theme_opts <- list(theme(panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
panel.background = element_blank(),
plot.background = element_blank(),
panel.border = element_blank(),
axis.line = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.position="right",
plot.title = element_text(size=16)))
```

### Plotting the data

Finally, we can plot the data. The first plot (static) visualizes the migration data for the whole period of time (2000–2015).

```ggplot(mydata, aes(x = long, y = lat, group = group, fill=value)) +
theme_opts +
coord_equal() +
geom_polygon(color = 'grey') +
labs(title = "Migration balance between 2000 and 2015",
subtitle = "Static Map",
caption = 'Data source: Stadt Leipzig, Amt für Statistik und Wahlen, 2015') +
scale_fill_distiller(name='Migration\nbalance', direction = 1, palette='RdYlGn')
``` For creating an animated plot, we add the `frame` argument stemming from the `gganimate` package. With `interval = 5` we define the time (in seconds) between the “slides”.

```p <- ggplot(mydata, aes(x = long, y = lat, group = group, fill=value, frame=year)) +
theme_opts +
coord_equal() +
geom_polygon(color = 'grey') +
labs(title = "Migration balance in year: ",
subtitle = "Animated Map",
caption = 'Data source: Stadt Leipzig, Amt für Statistik und Wahlen, 2015') +
scale_fill_distiller(name='Migration\nbalance', direction = 1, palette='RdYlGn')

gg_animate(p, interval = 5)
``` 