Scoring the PHQ-9 Questionnaire Using R

Intro

The PHQ-9 is the nine-item depression module of the Patient Health Questionnaire. Each of the items is scored on a 4-point Likert scale ranging from 0 (not at all) to 3 (nearly every day). The items are summed to obtain a total score ranging from 0 to 27 with higher scores indicating greater severity of depression. Based on the total score, different levels of severity can be evaluated with 0–4, 5–9, 10–14, 15–19 and 20–27 points indicating “minimal”, “mild”, “moderate”, “moderately severe” and “severe” depression.

The PHQ-9 questionnaire may be found under the following link.

In this blog post, I show how to calculate the PHQ-9 score and the PHQ-9 severety levels.

Packages and data

T dataset we are going to use was published in Plos One. The file has got a Digital Object Identifier (doi) and may be imported into R using the read_delim() function of the readr package.

library(readr)
library(dplyr)
library(ggplot2)

df.phq9 <- readr::read_delim("https://doi.org/10.1371/journal.pone.0156167.s001", 
                             delim = ";", 
                             escape_double = FALSE, 
                             trim_ws = TRUE) %>%
            select(starts_with('phq9'))

glimpse(df.phq9)
## Observations: 1,337
## Variables: 9
## $ phq9_1 <int> 1, 3, 2, 0, 0, 0, 1, 0, 0, 2, 1, 1, 0, 3, 0, 0, 0, 2, 0...
## $ phq9_2 <int> 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ phq9_3 <int> 3, 2, 2, 2, 1, 0, 1, 3, 1, 0, 1, 1, 0, 3, 1, 0, 0, 0, 0...
## $ phq9_4 <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 2, 1, 3, 0, 1, 0, 0, 0, 1, 0...
## $ phq9_5 <int> 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0...
## $ phq9_6 <int> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ phq9_7 <int> 0, 1, 1, 1, 0, 1, 0, 0, 0, 3, 1, 1, 0, 1, 0, 0, 0, 0, 0...
## $ phq9_8 <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0...
## $ phq9_9 <int> 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

The Scoring Function

The scoring_phq9 function requires a data frame containing the PHQ-9 items (data) and a vector containing the items' names (items.phq9) as input parameters.

scoring_phq9 <- function (data, items.phq9) 
{
    data %>% 
    mutate(nvalid.phq9 = rowSums(!is.na(select(., items.phq9))), 
           mean.temp = rowSums(select(., items.phq9), na.rm = TRUE)/nvalid.phq9) %>% 
    mutate_at(vars(items.phq9), funs(ifelse(is.na(.), as.integer(mean.temp), .))) %>% 
    mutate(score.temp = rowSums(select(., items.phq9), na.rm = TRUE), 
           score.phq9 = ifelse(nvalid.phq9 >= 7, as.integer(score.temp), NA), 
           cutoff.phq9 = case_when(score.phq9 >= 20 ~ "severe", score.phq9 >= 15 ~ "moderately severe", 
                          score.phq9 >= 10 ~ "moderate", score.phq9 >= 5 ~ "mild", 
                          score.phq9 < 5 ~ "minimal"),
           cutoff.phq9 = factor(cutoff.phq9, levels = c('minimal', 'mild', 'moderate', 'moderately severe', 'severe'))) %>% 
    select(-mean.temp, -score.temp)
}

Example

The function adds three variables to the original data frame:

  • nvalid.phq9: Number of variables with valid values,
  • score.phq9: PHQ-9 score (0 – 27),
  • cutoff.phq9: PHQ-9 severety levels (minimal, mild, moderate, moderately severe, severe)
items.phq9 <- paste0('phq9_', seq(1, 9, 1))
df.phq9 <- df.phq9 %>%
  scoring_phq9(., items.phq9)
glimpse(df.phq9)
## Observations: 1,337
## Variables: 12
## $ phq9_1      <int> 1, 3, 2, 0, 0, 0, 1, 0, 0, 2, 1, 1, 0, 3, 0, 0, 0,...
## $ phq9_2      <int> 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ phq9_3      <int> 3, 2, 2, 2, 1, 0, 1, 3, 1, 0, 1, 1, 0, 3, 1, 0, 0,...
## $ phq9_4      <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 2, 1, 3, 0, 1, 0, 0, 0,...
## $ phq9_5      <int> 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 2, 0, 1, 0, 0, 0,...
## $ phq9_6      <int> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ phq9_7      <int> 0, 1, 1, 1, 0, 1, 0, 0, 0, 3, 1, 1, 0, 1, 0, 0, 0,...
## $ phq9_8      <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,...
## $ phq9_9      <int> 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nvalid.phq9 <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,...
## $ score.phq9  <int> 7, 10, 7, 9, 3, 2, 3, 4, 5, 7, 7, 8, 0, 11, 1, 0, ...
## $ cutoff.phq9 <fct> mild, moderate, mild, mild, minimal, minimal, mini...

Visualization

PHQ-9 Score

ggplot(df.phq9, aes(score.phq9)) +
  geom_density(fill = 'blue', alpha = 0.2) +
  scale_x_continuous(limits = c(0, 27), breaks = c(0,5,10,15,20,27)) +
  labs(x = 'PHQ-9 Score', y = 'Density') +
  theme_bw()

plot of chunk unnamed-chunk-4

PHQ-9 Severety Levels

ggplot(df.phq9, aes(x = cutoff.phq9, fill = cutoff.phq9)) +
  geom_bar(colour = 'black') +
  scale_fill_brewer(type = 'seq') +
  labs(x = NULL, y = NULL, fill = NULL) +
  theme_bw()

plot of chunk unnamed-chunk-5

Advertisements
Posted in Indroduction | Tagged , | Leave a comment

Sample Size Calculation Using R

Preample

“Sample Size Calculation Using R” – the title of this blog post – sounds very comprehensive. However, this post only deals with the sample size collection for one particular test (two sample t-test). A collection of functions for basic sample size calculation can be found in the pwr package.

Introduction

A couple of months ago, I attended some statistics course about sample size calculation and sample size adjustment. Amongst others, we spoke about a randomized placebo-controlled trial with patients suffering of depression. The trial's primary end point was the difference between placebo and treatment group in the HAM-D total score (some depression score) between baseline and the end of therapy. During the course, the professor asked us to write a “two-liner” returning the sample size per group for a two-sided, two sample t-test with the following parameters:

  • alpha = 0.05,
  • sigma = 8 ,
  • delta = 4 and
  • power ≤ 0.90

Here is my solution:

In a first step, I wrote a loop returning the power rising from 0.1 to 0.9 and the corresponding sample sizes respectively. The loop may either be a repeat or a while loop.

While Loop

A while loop runs and repeats while a specified condition returns TRUE. The loop terminates only when the condition evaluates to FALSE. In our case, the loop stops evaluating as soon as p > 0.9.

p <- 0.05
while(p <= 0.9){
  p <- p + 0.05
  t <- power.t.test(sd = 8, delta = 4, sig.level=0.05, power = p, type="two.sample", alternative = "two.sided")
  print(paste(t$power, ceiling(t$n), sep = ' - '))
}

Repeat Loop

In a repeat loop it is required to define a so-called break declaration to stop repeating the code. Just like the while loop, our repeat loop stops evaluating as soon as p > 0.9.

Both, the while and the repeat loop return the same result.

p <- 0.05
repeat{
  p <- p + 0.05
  t <- power.t.test(sd = 8, delta = 4, sig.level=0.05, power = p, type="two.sample", alternative = "two.sided")
  print(paste(t$power, ceiling(t$n), sep = ' - '))
  if(p > 0.90){
    break
  }
}
## [1] "0.1 - 5"
## [1] "0.15 - 8"
## [1] "0.2 - 12"
## [1] "0.25 - 15"
## [1] "0.3 - 18"
## [1] "0.35 - 21"
## [1] "0.4 - 25"
## [1] "0.45 - 28"
## [1] "0.5 - 32"
## [1] "0.55 - 36"
## [1] "0.6 - 41"
## [1] "0.65 - 45"
## [1] "0.7 - 51"
## [1] "0.75 - 57"
## [1] "0.8 - 64"
## [1] "0.85 - 73"
## [1] "0.9 - 86"

For a very good introduction to loops, see Davies (2016): The Book of R. No Starch Press: San Francisco.

Storing the Results

In case we want to use the results for whatever purpose, we need to store them in a vector. The following example heavily borrows from StackOverflow. The loop returns a data frame containing two variables: pwr (power) and num (sample size).

p <- 0.05
pwr <- c()
num <- c()
while(p <= 0.9){
  p <- p + 0.05
  t <- power.t.test(sd = 8, delta = 4, sig.level=0.05, power = p, type="two.sample", alternative = "two.sided")
  pwr <- c(pwr, t$power)
  num <- c(num, ceiling(t$n))
  df <- data.frame(pwr = pwr, n = num)
}
rm(list = c(setdiff(ls(), c("df"))))
dplyr::glimpse(df)
## Observations: 17
## Variables: 2
## $ pwr <dbl> 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50, 0.55...
## $ n   <dbl> 5, 8, 12, 15, 18, 21, 25, 28, 32, 36, 41, 45, 51, 57, 64, ...

Plotting Power and Sample Size

Finally, we plot the relation between power and sample size using the ggplot2 package. The theme I use, stems from the ggpubr package.

ggplot(df, aes(pwr, n)) +
  geom_point() +
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.1)) +
  scale_y_continuous(breaks = seq(0, max(df$n)+10, 10)) +
  geom_text(aes(label = n), nudge_y = 4) +
  ggpubr::theme_pubr() +
  labs(x = 'Power', y = 'Sample Size', 
       title = 'Power and Sample Size', 
        subtitle = expression('Two-sided, two sample t-test with' ~ sigma ~ '= 8,' ~ delta ~ '= 4,' ~ alpha ~ '= 0.05'))

plot of chunk unnamed-chunk-4

As we see, a power of 0.90 requires a sample size of 86 patients per treatment group for a two-sided, two sample t-test with sd = 8, delta = 4, and alpha = 0.05.

Posted in Indroduction | Tagged , , , | Leave a comment

How to Plot Venn Diagrams Using R, ggplot2 and ggforce

Intro

Venn diagrams – named after the English logician and philosopher John Venn – “illustrate the logical relationships between two or more sets of items” with overlapping circles.

In this tutorial, I'll show how to plot a three set venn diagram using R and the ggplot2 package.

Packages and Data

For the R code to run, we need to install and load three R packages. Unlike tidyverse and ggforce, the limma package must be installed from Bioconductor rather than from CRAN.

Moreover, we create a random data frame using the rbinom() function.

source("http://www.bioconductor.org/biocLite.R")
biocLite("limma")
library(limma)
library(tidyverse)
library(ggforce)
set.seed((123))
mydata <- data.frame(A = rbinom(100, 1, 0.8),
                     B = rbinom(100, 1, 0.7),
                     C = rbinom(100, 1, 0.6)) %>%
                       mutate_all(., as.logical)

Drawing the Circles

Next, we create a data frame defining the x and y coordinates for the three circles we want to draw and a variable defining the labels. For plotting the circles – the basic structure of our venn diagram – we need the geom_circle() function of the ggforce package. We employ the geom_circle()-function of the ggforce package to actually draw the circles. With the parameter r (= 1.5), we define the radius of the circles.

df.venn <- data.frame(x = c(0, 0.866, -0.866),
                      y = c(1, -0.5, -0.5),
                      labels = c('A', 'B', 'C'))
ggplot(df.venn, aes(x0 = x, y0 = y, r = 1.5, fill = labels)) +
    geom_circle(alpha = .3, size = 1, colour = 'grey') +
      coord_fixed() +
        theme_void()

plot of chunk unnamed-chunk-2

Furthermore, we need a data frame with the values we want the plot and the coordinates for plotting these values. The values can be obtained using the vennCounts() function of the limma package. Since ggplot2 requires data frames we need to first transform the vdc object (class VennCounts) into a matrix and then into a data frame. In addition, we need to add the x and y coordinates for plotting the values.

vdc <- vennCounts(mydata)
class(vdc) <- 'matrix'
df.vdc <- as.data.frame(vdc)[-1,] %>%
  mutate(x = c(0, 1.2, 0.8, -1.2, -0.8, 0, 0),
         y = c(1.2, -0.6, 0.5, -0.6, 0.5, -1, 0))

The final Plot

Finally, we'll customize the look of our venn diagram and plot the values.

ggplot(df.venn) +
  geom_circle(aes(x0 = x, y0 = y, r = 1.5, fill = labels), alpha = .3, size = 1, colour = 'grey') +
  coord_fixed() +
  theme_void() +
  theme(legend.position = 'bottom') +
  scale_fill_manual(values = c('cornflowerblue', 'firebrick',  'gold')) +
  scale_colour_manual(values = c('cornflowerblue', 'firebrick', 'gold'), guide = FALSE) +
  labs(fill = NULL) +
  annotate("text", x = df.vdc$x, y = df.vdc$y, label = df.vdc$Counts, size = 5)

plot of chunk unnamed-chunk-4

Posted in Visualizing Data | Tagged , | Leave a comment

How to plot GPS data using R, ggplot2 and ggmap

Intro

A couple of months ago, I bought a mobile GPS navigation device. The device makes it possible to record the routes we travel with our car. Since the routes can by exported as .gpx files, it is rather easy to plot them using R.

Packages

To replicate this blog post, three R packages are required: plotKML to import the data into R and ggplot2 and ggmap to plot the data.

library(plotKML)
library(ggplot2)
library(ggmap)

As a prerequisite for the installation of the plotKML package on my Linux Mint 18.3 OS (based on Ubuntu 16.04), I needed to add the ubuntugis PPA.

sudo add-apt-repository ppa:ubuntugis/ppa
sudo apt-get update
sudo apt-get dist-upgrade

Import

All what's required for importing the .gpx file into R is the readGPX() function of the plotKML package. After importing the file into R, we receive a list (lst.rd) conatining some meta information and one data frame. To be able to plot the GPS data using ggplot2, we need to subset this list and save the data frame into a new object (df).

lst.rd <- readGPX('29 Dez. 1413.gpx')
df <- lst.rd$tracks[[1]][[1]] 

Using the head function, we have a short look on the data:

head(df)
##        lon      lat                 time
## 1 12.98376 51.05915 2017-12-29T12:16:28Z
## 2 12.98377 51.05912 2017-12-29T12:16:29Z
## 3 12.98374 51.05912 2017-12-29T12:16:30Z
## 4 12.98372 51.05914 2017-12-29T12:16:31Z
## 5 12.98372 51.05914 2017-12-29T12:16:32Z
## 6 12.98372 51.05914 2017-12-29T12:16:33Z

As we can see, our data frame consists of three variables: longitude, latitude and time.

Plotting

Whithout much efford, we can plot our GPS data using ggplot2. Please note that GPS coordinates must be plotted as points.

ggplot(df, aes(x = lon, y = lat)) +
  coord_quickmap() +
  geom_point()

plot of chunk plot-1

In addition, the ggmap package offers some functionality to plot the data on maps.

The first example shows how to plot the data on a map provided by Google Maps.

mapImageData <- get_googlemap(center = c(lon = mean(df$lon), lat = mean(df$lat)),
                              zoom = 10,
                              color = 'bw',
                              scale = 1,
                              maptype = "terrain")
ggmap(mapImageData, extent = "device") + # removes axes, etc.
  geom_point(aes(x = lon,
                 y = lat),
             data = df,
             colour = "red3",
             alpha = .1,
             size = .1)

plot of chunk plot-2

The second example shows how to plot the data on a Stamen map.

mapImage <- get_map(location = c(lon = mean(df$lon) - 0.05, lat = mean(df$lat)),
                    source = "stamen",
                    maptype = "toner",
                    zoom = 10)

ggmap(mapImage, extent = "device") + 
  geom_point(aes(x = lon,
                 y = lat),
             data = df,
             colour = "red3",
             size = .2) 

plot of chunk plot-3

References

This blog post heavily borrows from the tutorial Mapping GPS Tracks in R. Thanks very much!

Posted in Visualizing Data | Tagged , | Leave a comment

R Markdown: How to place two tables side by side using ‘knitr’ and ‘kableExtra’

Intro

When I was recently writing some report using R Markdown, I wanted to place two rather small tables side by side. Since I usually use the kable()-function of the knitr package and the kableExtra package to print tables, I tried to find a solution for my problem using both packages.

Since my Google search (“two tables side by side with kableExtra” or something similar) did not return a helpful result, I experimented with some table formating options provided by the kableExtra package. Here is my solution.

Packages and data

For printing the tables we need to install and load two packages: knitr and kableExtra. The dplyr packages is required for some data manipulation. The data we want to put into the tables stem from the bundesligR package which contains final tables of Germany's highest football (soccer) league. We want to place the final tables of two seasons (1985/86 and 2015/16) side by side.

df <- bundesligR::bundesligR 
table.1985 <- df %>%
  filter(Season == 1985) %>%
    select(Position, Team, Points)
table.2015 <- df %>%
  filter(Season == 2015) %>%
    select(Position, Team, Points)

Now, we place both tables side by side using some functionality of the kableExtra package:

table.1985 %>%
  kable("html", align = 'clc', caption = 'Bundesliga, Season 1985/86') %>%
    kable_styling(full_width = F, position = "float_left")

table.2015 %>%
  kable("html", align = 'clc', caption = 'Bundesliga, Season 2015/16') %>%
    kable_styling(full_width = F, position = "right")
Bundesliga, Season 1985/86
Position Team Points
1 FC Bayern Muenchen 70
2 Werder Bremen 69
3 FC Bayer 05 Uerdingen 64
4 Borussia Moenchengladbach 57
5 VfB Stuttgart 58
6 TSV Bayer 04 Leverkusen 55
7 Hamburger SV 56
8 SV Waldhof Mannheim 44
9 VfL Bochum 46
10 FC Schalke 04 41
11 1. FC Kaiserslautern 40
12 1. FC Nuernberg 41
13 1. FC Koeln 38
14 Fortuna Duesseldorf 40
15 Eintracht Frankfurt 35
16 Borussia Dortmund 38
17 1. FC Saarbruecken 27
18 Hannover 96 23
Bundesliga, Season 2015/16
Position Team Points
1 FC Bayern Muenchen 88
2 Borussia Dortmund 78
3 Bayer 04 Leverkusen 60
4 Borussia Moenchengladbach 55
5 FC Schalke 04 52
6 1. FSV Mainz 05 50
7 Hertha BSC 50
8 VfL Wolfsburg 45
9 1. FC Koeln 43
10 Hamburger SV 41
11 FC Ingolstadt 04 40
12 FC Augsburg 38
13 Werder Bremen 38
14 SV Darmstadt 98 38
15 TSG 1899 Hoffenheim 37
16 Eintracht Frankfurt 36
17 VfB Stuttgart 33
18 Hannover 96 25

The trick is to set the position argument to float_left (left table) and right (right table). Furthermore, the argument full_width must be set to FALSE in both tables.

To Do

Unfortunately, the given example only works for rendering HTML documents. Does anyone know how to place two tables side by side when the output format is PDF/LaTeX?

Posted in Tips & Tricks | Tagged , | 3 Comments

Analyse der DFG Fachkollegienwahl 2015 — Teil I

Die Deutsche Forschungsgemeinschaft (DFG) ist die europaweit größte Forschungsförderungsorganisation mit einem Jahresetat von aktuell etwa 3 Milliarden Euro (2016). Die Aufgabe der DFG besteht in der Auswahl und Finanzierung wissenschaftlicher Forschungsvorhaben an Hochschulen und Forschungsinstituten.

Laut DFG werden eingereichte Förderanträge in einem mehrstufigen Entscheidungsverfahren “von ehrenamtlich tätigen Gutachterinnen und Gutachtern nach ausschließlich wissenschaftlichen Kriterien beurteilt.” In einem zweiten Schritt werden diese Fachgutachten von Mitgliedern sogenannter Fachkollegien bewertet. Diese treffen eine Vorauswahl und entscheiden, ob ein Antragsverfahren weitergeführt oder eingestellt wird. Die eigentliche Förderentscheidung liegt jedoch beim DFG-Hauptausschuss.

Für die Wahl der Fachkollegien wurde im Jahr 2014 eine neue Wahlordnung verabschiedet, nach der jene Kandidaten mit den meisten Wahlvorschlägen einen Platz auf den Wahllisten bekommen (stimmt das?). Dies habe, so der Bamberger Historiker Hartwin Brandt in zwei Beiträgen für die Frankfurter Allgemeine Zeitung (FAZ, 17.05.2017, 18.10.2017) zu inoffiziellen Absprachen zwischen Hochschulvertetern geführt – mit dem Ziel, die eigenen Kanditaten auf die Wahllisten zu platzieren.

Brandt, der für den Bereich “Klassische Philologie” einen “Nordverbund” bestehend aus den Universitäten Bremen, Greifswald, Hamburg, Kiel und Rostock identifiziert, vermutet, dass eine “detaillierte Analyse der mehr als 600 Seiten umfassenden DFG-Liste der Kandidierenden etliche weitere Auffälligkeiten und Kompensationsgeschäfte zutage fördern” dürfte und mutmaßt, dass “soziologische Netzwerk- und Freundschaftsforscher ihre Freude an diesem Datensatz haben werden”.

Die von der DFG veröffentlichte “Liste der Kandidierenden für die DFG-Fachkollegienwahl 2015” (Stand 17.09.2015) ist ein 617 A4-Seiten umdassendes PDF-Dokument. Sie kann als PDF-Datei von den Internetseiten der DFG heruntergeladen werden (Link). Dazu kommt eine ebenfalls von der DFG veröffentlichte Liste mit den Wahlergebnissen.

Um die in beiden Dokumenten enthaltenen Angaben statistisch auswerten zu können, müssen diese zunächst in ein maschinenlesbares Format überführt werden. Mit der Funktion exctract_text() aus dem R-Paket tabulizer können die Bezeichnungen und Nummern der einzelnen Fachgebiete ausgelesen werden. Zwar lassen sich mit der exctract_tables()-Funktion prinzipiell auch in PDF-Dateien enthaltene Tabellen auslesen. Meine Versuche, die auf das entsprechende PDF-Dokument anzuwenden, waren jedoch leider nicht von Erfolg gekrönt, sodass ich die Daten im Copy-and-Paste-Verfahren manuell in eine Excel-Datei üoberführt habe. Da Copy-and-Paste ein vergleichsweise fehleranfälliges Verfahren ist, würde ich mich freuen, von den Lesern dieses Beitrags auf eventuelle Fehler hingewiesen zu werden. Der Datensatz kann von meinem Github-Repository heruntergeladen werden (zur Zeit nur im Format .rda).

Der Datensatz df.dfg enthält die folgenden Variablen:

  • wb.nummer: Nummer des Wissenschaftsbereichs;
  • wb.name: Name des Wissenschaftsbereichs;
  • fg.nummer: Nummer des Fachgebiets;
  • fg.name: Name des Fachgebiets;
  • fk.nummer: Nummer des Fachkollegiums;
  • fg.name : Name des Fachkollegiums;
  • fach.nummer: Nummer des Fachs;
  • fach.name: Name des Fachs;
  • person.name: Name und Vorname der kandidierenden Person;
  • person.sex: Geschlecht der kandidierenden Person;
  • person.inst: Institution der kandidierenden Person;
  • inst.von: vorschlagende Institutionen
  • inst.von.kat: Kategorisierung der vorschlagenden Institution in: Hochschule vs. Fachgesellschaft vs. Sonstiges;
  • anzahl.plaetze: Anzahl wählbarer Personen pro Fachkollegium;
  • anzahl.kandidaten: Anzahl kandidierender Personen pro Fachkollegium;
  • person.platz: Platzierung der kandidierende Person (Wahlergebnis)
  • person.stimmen: Anzahl der Stimmen, mit denen der Kandidierende gewählt wurde (Wahlergebnis);
  • person.gewaehlt: Wurde die kandidierende Person gewählt?
## Observations: 9,695
## Variables: 18
## $ wb.nummer         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ wb.name           <fct> Geistes- und Sozialwissenschaften, Geistes- ...
## $ fg.nummer         <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, ...
## $ fg.name           <fct> Geisteswissenschaften, Geisteswissenschaften...
## $ fk.nummer         <chr> "FN_101", "FN_101", "FN_101", "FN_101", "FN_...
## $ fk.name           <chr> "Alte Kulturen", "Alte Kulturen", "Alte Kult...
## $ fach.nummer       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ fach.name         <chr> "Ur- und Frühgeschichte (weltweit)", "Ur- un...
## $ person.name       <chr> "Bemmann, Jan", "Brather, Sebastian", "Brath...
## $ person.sex        <fct> m, m, m, w, m, m, m, m, m, w, w, w, m, m, m,...
## $ person.inst       <chr> "U Bonn", "U Freiburg", "U Freiburg", "RGZM"...
## $ inst.von          <chr> "U Bonn", "U Freiburg", "WGL", "WGL", "Dt. V...
## $ inst.von.kat      <chr> "Hochschule", "Hochschule", "Sonstiges", "So...
## $ anzahl.plaetze    <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,...
## $ anzahl.kandidaten <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,...
## $ person.platz      <dbl> 7, 5, 5, 6, 2, 2, 9, 8, 8, 1, 1, 1, 3, 3, 4,...
## $ person.stimmen    <dbl> NA, NA, NA, NA, 195, 195, NA, NA, NA, 255, 2...
## $ person.gewaehlt   <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALS...

Im Bereich der medizinischen Fächer gibt es die Besonderheit, dass für Kandidaten, die an einem Universitätsklinikum beschäftigt sind, als Institution das Klinikum, und nicht die Universität genannt wird. Um tatsächlich vorhandene Netzwerkstrukturen rekonstruieren zu können, habe ich immer dann, wenn sich ein Klinikum eindeutig einer Universität zuordnen ließ, in der Variable Institution den Namen der Universität abgespeichert (z.B. “Universitätsklinikum Münster als "Universität Münster”). Falls keine eindeutige Zuordnung möglich war, habe ich den Namen des Klinikums nicht überschrieben (z.B. “Universitätsklinikum Schleswig-Holstein”).

Die Liste der Kandidierenden lässt sich in vier “Wissenschaftsbereiche” untergliedern, denen die Fachkollegien zugeordnet sind. Zuguterletzt ist jedem Fachkollegium eine Reihe von Fächern zugeordnet (z.B. “Literaturwissenschaften” in “Ältere deutsche Literatur”, “Neuere deutsche Literatur” etc.).

  • Geistes- und Sozialwissenschaften
    • Geisteswissenschaften (Fachkollegium 101–108)
    • Sozial- und Verhaltenswissenschaften (Fachkollegium 109–113)
  • Lebenswissenschaften
    • Biologie (Fachkollegium 201–203)
    • Medizin (Fachkollegium 204–206)
    • Agrar-, Forstwissenschaften und Tiermedizin (Fachkollegium 207)
  • Naturwissenschaften
    • Chemie (Fachkollegium 301–307)
    • Physik (Fachkollegium 308–311)
    • Mathematik (Fachkollegium 312)
    • Geowissenschaften (Fachkollegium 313–318)
  • Ingenieurswissenschaften
    • Maschinenbau und Produktionstechnik (Fachkollegium 401–402)
    • Wärmetechnik/Verfahrenstechnik (Fachkollegium 403–404)
    • Materialwissenschaft und Werkstofftechnik (Fachkollegium 405–406)
    • Informatik, System- und Elektrotechnik (Fachkollegium 407–409)
    • Bauwesen und Architektur (Fachkollegium 410)

In den nächsten Monaten werde ich mich in einer Reihe von Blogposts an einer statistischen Analyse der DFG-Kandidierendenliste versuchen.

Aktualisiert am: 17.02.2018

Posted in Visualizing Data | Tagged , | Leave a comment

How to draw a CONSORT flow diagram using R and GraphViz

Intro

The acronym CONSORT stands for “Consolidated Standards of Reporting Trials”. The CONSORT statement is published by the CONSORT group and gives recommendations aiming to improve the quality of reports of randomized clinical trials. Besides a 25 item checklist, the current CONSORT statement (2010) includes the CONSORT flow diagram which is to visualize the progress through the phases of a parallel randomised clinical trial of two groups (see Figure 1).

In this blog post, I will do both show how to “draw” the CONSORT flow diagram using R and discuss the pros and cons of this approach.

Figure 1: CONSORT 2010 Flow Diagram

R Packages

The following R packages are required to create, plot and save flow diagrams: DiagrammeR supports the Graphviz software which consists of a graph description language called DOT. DiagrammeRsvg and rsvg are required to save diagrams to a hard drive.

library(DiagrammeR)
library(DiagrammeRsvg)
library(rsvg)

In case we want to include the flow diagram into a .Rmd document and render it as PDF, we need to install the phantomjs JavaScript library. This can be done using the install_phantomjs() function of the webshot package.

webshot::install_phantomjs()

GraphViz is very flexible regarding the definition of line colors, arrow shapes, node shapes, and many other layout features. However, since the actual purpose of GraphViz is the visualization of graphs (networks), it is difficult to control the position of the nodes.

Unlike in a Cartesian plot, the position of the nodes is not defined by x and y coordinates. Instead, the nodes' position can only be determined by edges (relationships between nodes) and their attributes.

Creating a grid

When I looked at the CONSORT flow diagram (see Figure 1), I had the impression that the nodes may be placed using a grid-based system.

Figure 2: Placement of nodes using a grid
plot of chunk structure

The CONSORT flow diagram conceptualized as grid consists of 21 nodes. That is:

  • 4 nodes with plain text (blue, T1-T4) describing the phase of trial (Enrollment, Allocation, Follow-Up, Analysis) and
  • 9 boxed nodes (black, B1-B9) containing text (“Randomized”) and numerical data (“n=200”).

In order to populate each cell of the grid, we need to add:

  • 4 nodes rendered as tiny and, thus, invisible points (green, P1-P4). We use the points to mimic edges brancing-off from other edges.
  • 4 completely invisible nodes (grey, I1_I4).

The numbers in brackets denote the nodes' ID numbers.

Eventually, the grid structure is created using both vertical and horizontal edges.

Creating the CONSORT Flow Diagram

When creating the actual CONSORT diagram we gonna make use of the grid structure. All we have to do now is to change some attributes of the nodes and edges.

Preparation

To make the flow diagram dynamic, we need to save constant elements (text labels) and dynamic elements (numeric values) into separate vectors (text, values). The paste1()-function is a simple wrapper of the base R paste0()-function making it more convenient to concatinate text labels and numeric values.

# Values ------------------------------------------------------------------
values <- c(210, 10, 200, 100, 100, 10, 10, 90, 90)

# Defining Text Labels ----------------------------------------------------
text <- c('Assessment for\neligibility',
          'Excluded',
          'Randomized',
          'Allocated to\nintervention',
          'Allocated to\nintervention',
          'Lost to follow-up',
          'Lost to follow-up',
          'Analysed',
          'Analysed')

# Defining Function -------------------------------------------------------
paste1 <- function(x, y){
  paste0(x, ' (n=', y, ')')
}

# Concatenating Values and Text Labels ------------------------------------
LABS <- paste1(text, values)
LABS
## [1] "Assessment for\neligibility (n=210)"
## [2] "Excluded (n=10)"                    
## [3] "Randomized (n=200)"                 
## [4] "Allocated to\nintervention (n=100)" 
## [5] "Allocated to\nintervention (n=100)" 
## [6] "Lost to follow-up (n=10)"           
## [7] "Lost to follow-up (n=10)"           
## [8] "Analysed (n=90)"                    
## [9] "Analysed (n=90)"

With the next step, we create a node data frame (ndf) using the create_node_df()-function of the DiagrammR package. The attributes of this function are pretty much self-explanatory.

Node Data Frame

ndf <-
  create_node_df(
    n = 21,
    label = c('Enrollment', 'Allocation', 'Follow-Up', 'Analysis',
              LABS, rep("", 8)),
    style = c(rep("solid", 13), rep('invis', 8)),
    shape = c(rep("plaintext", 4), 
              rep("box", 9),
              rep("point", 8)),
    width = c(rep(2, 4), rep(2.5, 9), rep(0.001, 8)),
    hight = c(rep(0.5, 13), rep(0.001, 8)),
    fontsize = c(rep(14, 4), rep(10, 17)),
    fontname = c(rep('Arial Rounded MT Bold', 4), rep('Courier New', 17)),
    penwidth = 2.0,
    fixedsize = "true")

Edge Data Frame

Furthermore, we create an edge data frame (edf) using the create_edge_df()-function. Edges we don't want to be visible are colored white (with alpha = 0) and don't get an arrowhead. To make edges horizontal, the constraint attribute needs to be set to false. from and to are vectors containing node IDs from which edges are outbound and incoming. They determine the relationships between the nodes. To arrange the code more clearly, I've ordered outbound and incoming edges by columns and rows according to Figure 2.

edf <-
  create_edge_df(
    arrowhead = c(rep('none', 3), rep("vee", 3), rep('none', 2), "vee", rep('none', 6),
                  rep("vee", 3), rep("none", 3), "vee", rep("none", 10)),
    color = c(rep('#00000000', 3), rep('black', 6), rep('#00000000', 6),
              rep('black', 3), rep('#00000000', 3), rep('black', 1),
              rep('#00000000', 2), rep('black', 2), 
              rep('#00000000', 6)),
    constraint = c(rep("true", 18), rep('false', 14)),
    from = c(1, 19, 20, 16, 8, 10, # column 1
             5, 14, 7, 15, 2, 3, # column 2
             18, 6, 21, 17, 9, 11, # column 3
             1, 5, # row 1
             19, 14, # row 2
             20, 7, # row 3
             16, 15, # row 4
             8, 2, # row 5
             10, 3, # row 6
             12, 4), # row 7
    to = c(19, 20, 16, 8, 10, 12, # column 1
           14, 7, 15, 2, 3, 4, # column 2
           6, 21, 17, 9, 11, 13, # column 3
           5, 18, # row 1
           14, 6, # row 2
           7, 21, # row 3
           15, 17, # row 4
           2, 9, # row 5
           3, 11, # row 6
           4, 13)) # row 7

Plotting

Before we are going to plot the flow diagram, we need to create a graph using the create_graph()-function. The graph can be plotted using the render_graph()-function.

# Create Graph ------------------------------------------------------------
g <- create_graph(ndf, 
                  edf,
                  attr_theme = NULL)

# Plotting ----------------------------------------------------------------
render_graph(g)

Figure 3: Simplified CONSORT Flow Diagram
plot of chunk unnamed-chunk-1

Unfortunately, it didn't find a way to left align text within the text boxes. Thus, I didn't manage to draw a fully featured CONSORT flow diagram using R.

Export

Saving the flow diagram to a hard drive is quite straight forward:

export_graph(g, file_name = "CONSORT.png")

File types to be exported are PNG, PDF, SVG, PS and GEXF (graph file format).

Discussion

As demonstrated in this blog post, “drawing” a CONSORT flow diagram using R and GraphViz is rather cumbersome. Because of some limitations regarding the alignment of text within boxes, only a simplyfied form of the CONSORT flow diagram can be drawn. However, the use of R seems to be the only way to make flow diagrams dynamic.

Posted in Visualizing Data | Tagged | Leave a comment