How to draw a CONSORT flow diagram using R and GraphViz


The acronym CONSORT stands for “Consolidated Standards of Reporting Trials”. The CONSORT statement is published by the CONSORT group and gives recommendations aiming to improve the quality of reports of randomized clinical trials. Besides a 25 item checklist, the current CONSORT statement (2010) includes the CONSORT flow diagram which is to visualize the progress through the phases of a parallel randomised clinical trial of two groups (see Figure 1).

In this blog post, I will do both show how to “draw” the CONSORT flow diagram using R and discuss the pros and cons of this approach.

Figure 1: CONSORT 2010 Flow Diagram

R Packages

The following R packages are required to create, plot and save flow diagrams: DiagrammeR supports the Graphviz software which consists of a graph description language called DOT. DiagrammeRsvg and rsvg are required to save diagrams to a hard drive.


In case we want to include the flow diagram into a .Rmd document and render it as PDF, we need to install the phantomjs JavaScript library. This can be done using the install_phantomjs() function of the webshot package.


GraphViz is very flexible regarding the definition of line colors, arrow shapes, node shapes, and many other layout features. However, since the actual purpose of GraphViz is the visualization of graphs (networks), it is difficult to control the position of the nodes.

Unlike in a Cartesian plot, the position of the nodes is not defined by x and y coordinates. Instead, the nodes' position can only be determined by edges (relationships between nodes) and their attributes.

Creating a grid

When I looked at the CONSORT flow diagram (see Figure 1), I had the impression that the nodes may be placed using a grid-based system.

Figure 2: Placement of nodes using a grid
plot of chunk structure

The CONSORT flow diagram conceptualized as grid consists of 21 nodes. That is:

  • 4 nodes with plain text (blue, T1-T4) describing the phase of trial (Enrollment, Allocation, Follow-Up, Analysis) and
  • 9 boxed nodes (black, B1-B9) containing text (“Randomized”) and numerical data (“n=200”).

In order to populate each cell of the grid, we need to add:

  • 4 nodes rendered as tiny and, thus, invisible points (green, P1-P4). We use the points to mimic edges brancing-off from other edges.
  • 4 completely invisible nodes (grey, I1_I4).

The numbers in brackets denote the nodes' ID numbers.

Eventually, the grid structure is created using both vertical and horizontal edges.

Creating the CONSORT Flow Diagram

When creating the actual CONSORT diagram we gonna make use of the grid structure. All we have to do now is to change some attributes of the nodes and edges.


To make the flow diagram dynamic, we need to save constant elements (text labels) and dynamic elements (numeric values) into separate vectors (text, values). The paste1()-function is a simple wrapper of the base R paste0()-function making it more convenient to concatinate text labels and numeric values.

# Values ------------------------------------------------------------------
values <- c(210, 10, 200, 100, 100, 10, 10, 90, 90)

# Defining Text Labels ----------------------------------------------------
text <- c('Assessment for\neligibility',
          'Allocated to\nintervention',
          'Allocated to\nintervention',
          'Lost to follow-up',
          'Lost to follow-up',

# Defining Function -------------------------------------------------------
paste1 <- function(x, y){
  paste0(x, ' (n=', y, ')')

# Concatenating Values and Text Labels ------------------------------------
LABS <- paste1(text, values)
## [1] "Assessment for\neligibility (n=210)"
## [2] "Excluded (n=10)"                    
## [3] "Randomized (n=200)"                 
## [4] "Allocated to\nintervention (n=100)" 
## [5] "Allocated to\nintervention (n=100)" 
## [6] "Lost to follow-up (n=10)"           
## [7] "Lost to follow-up (n=10)"           
## [8] "Analysed (n=90)"                    
## [9] "Analysed (n=90)"

With the next step, we create a node data frame (ndf) using the create_node_df()-function of the DiagrammR package. The attributes of this function are pretty much self-explanatory.

Node Data Frame

ndf <-
    n = 21,
    label = c('Enrollment', 'Allocation', 'Follow-Up', 'Analysis',
              LABS, rep("", 8)),
    style = c(rep("solid", 13), rep('invis', 8)),
    shape = c(rep("plaintext", 4), 
              rep("box", 9),
              rep("point", 8)),
    width = c(rep(2, 4), rep(2.5, 9), rep(0.001, 8)),
    hight = c(rep(0.5, 13), rep(0.001, 8)),
    fontsize = c(rep(14, 4), rep(10, 17)),
    fontname = c(rep('Arial Rounded MT Bold', 4), rep('Courier New', 17)),
    penwidth = 2.0,
    fixedsize = "true")

Edge Data Frame

Furthermore, we create an edge data frame (edf) using the create_edge_df()-function. Edges we don't want to be visible are colored white (with alpha = 0) and don't get an arrowhead. To make edges horizontal, the constraint attribute needs to be set to false. from and to are vectors containing node IDs from which edges are outbound and incoming. They determine the relationships between the nodes. To arrange the code more clearly, I've ordered outbound and incoming edges by columns and rows according to Figure 2.

edf <-
    arrowhead = c(rep('none', 3), rep("vee", 3), rep('none', 2), "vee", rep('none', 6),
                  rep("vee", 3), rep("none", 3), "vee", rep("none", 10)),
    color = c(rep('#00000000', 3), rep('black', 6), rep('#00000000', 6),
              rep('black', 3), rep('#00000000', 3), rep('black', 1),
              rep('#00000000', 2), rep('black', 2), 
              rep('#00000000', 6)),
    constraint = c(rep("true", 18), rep('false', 14)),
    from = c(1, 19, 20, 16, 8, 10, # column 1
             5, 14, 7, 15, 2, 3, # column 2
             18, 6, 21, 17, 9, 11, # column 3
             1, 5, # row 1
             19, 14, # row 2
             20, 7, # row 3
             16, 15, # row 4
             8, 2, # row 5
             10, 3, # row 6
             12, 4), # row 7
    to = c(19, 20, 16, 8, 10, 12, # column 1
           14, 7, 15, 2, 3, 4, # column 2
           6, 21, 17, 9, 11, 13, # column 3
           5, 18, # row 1
           14, 6, # row 2
           7, 21, # row 3
           15, 17, # row 4
           2, 9, # row 5
           3, 11, # row 6
           4, 13)) # row 7


Before we are going to plot the flow diagram, we need to create a graph using the create_graph()-function. The graph can be plotted using the render_graph()-function.

# Create Graph ------------------------------------------------------------
g <- create_graph(ndf, 
                  attr_theme = NULL)

# Plotting ----------------------------------------------------------------

Figure 3: Simplified CONSORT Flow Diagram
plot of chunk unnamed-chunk-1

Unfortunately, it didn't find a way to left align text within the text boxes. Thus, I didn't manage to draw a fully featured CONSORT flow diagram using R.


Saving the flow diagram to a hard drive is quite staight forward:

export_graph(g, file_name = "CONSORT.png")

File types to be exported are PNG, PDF, SVG, PS and GEXF (graph file format).


As demonstrated in this blog post, “drawing” a CONSORT flow diagram using R and GraphViz is rather cumbersome. Because of some limitations regarding the alignment of text within boxes, only a simplyfied form of the CONSORT flow diagram can be drawn. However, the use of R seems to be the only way to make flow diagrams dynamic.

Posted in Visualizing Data | Tagged | Leave a comment

How to Assign Variable Labels in R


Defining variable labels is a useful way to describe and document datasets. Unlike SPSS, which makes it very easy to define variable labels using the data editor, base R doesn't provide any function to define variable labels (as far as I know).

However, Daniel Luedecke's R package sjlablled fills this gap. Let's give an example.

Defining variable labels

First, we load the mtcars data frame and define variable labels for all of the 11 variables:

labs <- c("Miles/(US) gallon", "Number of cylinders", "Displacement (", 
    "Gross horsepower", "Rear axle ratio", "Weight (1000 lbs)", "1/4 mile time", 
    "V/S", "Transmission", "Number of forward gears", "Number of carburetors")

Assigning labels to variables

Second, we assign the variable labels to the variables of the mtcars data frame:

mtcars <- set_label(mtcars, label = labs)

When we have a look at the mtcars data frame using RStudio's data viewer, we find the variable labels placed right underneath the variable names:

Moreover, we may as well save both variable names and labels into a data frame:

library(dplyr) # for data manipulation
library(knitr) # for printing tables
df <- get_label(mtcars) %>%
        data.frame() %>%
          rename_at(vars(1), funs(paste0('var.labs'))) %>%
            mutate(var.names = colnames(mtcars)) 
kable(df, align = 'lc')
var.labs var.names
Miles/(US) gallon mpg
Number of cylinders cyl
Displacement ( disp
Gross horsepower hp
Rear axle ratio drat
Weight (1000 lbs) wt
¼ mile time qsec
V/S vs
Transmission am
Number of forward gears gear
Number of carburetors carb
Posted in Data Management, Fav R Packages, Tips & Tricks | Tagged | Leave a comment

Multiple Imputations of missing values — ein Erfahrungsbericht


In der Forschungspraxis werden Datenpunkte mit fehlenden Werten oft von der Analyse ausgeschlossen oder durch einen zentralen Lageparameter (Mittelwert, Median) ersetzt. Als Alternative dazu hat sich das Verfahren der Multiplen Imputation etabliert, welches in diesem Beitrag erklärt wird.

Die im Folgenden geschilderten Erfahrungen habe ich mit der Statistiksoftware R und dem mice-Paket gesammelt [2].

R Pakete

In dieser Blog-Post habe ich die folgenden R-Pakete verwendet:



Bei der Durchführung statistischer Auswertungen ist man oft mit dem Auftreten fehlender Werte konfrontiert. Schließt man Datenpunkte mit fehlenden Werten von der Analyse aus, führt das zu einer Reduktion der Fallzahl und damit zu einer geringeren Power der statistischen Tests. Da Werte nur selten zufällig fehlen, kann ein Fallausschluss zu einer Verzerrung der Ergebnisse führen. Ersetzt man fehlende Werte hingegen durch den Mittelwert oder Median der entsprechenden Messgröße führt dies zu einer Reduktion der Varianz, womit sich das Risiko für einen alpha-Fehler erhöht.

Als Alternative zu diesen und anderen Methoden wurde in den letzten Jahrzehnten die Methode der Multiplen Imputation (MI) entwickelt [5]. Ausgehend von einem Datensatz mit fehlenden Werten wird eine festgelegte Anzahl von kompletten Datensätzen (m) erzeugt, indem die fehlenden Werte durch sogenannte plausible Werte ersetzt werden. Diese plausiblen Werte werden einer Datenverteilung entnommen, die für die jeden fehlenden Wert modelliert wird. Die m imputierten Datensätze gleichen sich hinsichtlich der beobachteten Datenpunkte, unterscheiden sich jedoch in den imputierten fehlenden Werten, wobei die Unterschiede der Unsicherheit des Imputationsprozesses Rechnung tragen [1]. Das Verfahren der MI lässt sich mittlerweile mit (fast) allen gängigen Softwarepaketen (R, SAS, SPSS, STATA) durchführen.


Entscheidend für die Qualität einer MI ist die Spezifikation des Imputationsmodells, welche in Abhängigkeit vom Variablentyp erfolgen muss. Für die Imputation numerischer Variablen kann z.B. die Methode des “Predictive mean matching” (pmm) verwendet werden. Diese Methode hat den Vorteil, dass die imputierten Werte nur innerhalb der durch die Variable vorgegebenen Spannweite liegen. Imputiert man z.B. die fehlenden Werte der Skala eines EORTC-Fragebogens, liegen die imputierten Werte im Bereich 0 bis 100.

Datensätze, in denen Längsschnittdaten erfasst werden, können in zwei verschiedenen Datenformaten vorliegen. So ist in einem flachen Datensatz für jeden Fall (z.B. Patienten) genau eine Zeile reserviert, während pro Messgröße (z.B. Lebensqualitätsskala) und Befragungszeitpunkt jeweils eine Variable existiert. In einem tiefen Datensatz hingegen entspricht die Anzahl der Zeilen pro Fall der Anzahl der Messzeitpunkte, während pro Messgröße nur eine Variable angelegt wird, welche die Messungen aller Messzeitpunkte enthält. Der Zeitpunkt der Intervention bzw. Befragung wird in einer separaten Variable angegeben. Für die Durchführung einer MI empfiehlt es sich, mit einem tiefen Datensatz zu arbeiten.

Arbeitet man mit Lebensqualitätsfragebögen, so steht man vor der Entscheidung, die Imputationen entweder auf Ebene der Einzelitems oder daraus berechneter Skalen durchzuführen. Da statistische Analysen in der Regel auf der Grundlage von Skalen erfolgen, empfiehlt es sich, die fehlenden Skalenwerte zu imputieren. Dazu kommt, dass die Anzahl der Skalen deutlich geringer ist als die Anzahl der Einzelitems. Mit der Anzahl der zu imputierenden Variablen reduziert sich auch die für die MI benötigte Zeit bzw. Rechnerleistung [4]. Darüber hinaus erscheint es sinnvoll, alle Variablen mit fehlenden Werten in einem “Durchgang” zu imputieren.


Imputiert man fehlende Werte eines Datensatzes in tiefer Form, erhält man ein Objekt der Klasse mids. Dieses enthält neben den ursprünglichen, nicht-imputierten Daten auch die Daten der m imputierten Datensätze. Grundlage für die Datenanalyse ist eine gepoolte Version der imputierten Datensätze. Benötigt man aber für die Datenanalyse den Baselinewert einer Variable als eigene Variable, so steht man vor dem Problem, die Daten umformen zu müssen. Da dies unter Beibehaltung der m imputierten Datensätze geschehen soll, muss dass mids-Objekt zunächst in eine gestaplete Matrix (auch stacked oder tall matrix) umgewandelt werden [3]. Hat man diese Matrix in die gewünschte Form gebracht, so kann sie problemlos in ein mids-Objekt zurückverwandelt werden, auf dessen Grundlage statistische Modelle berechnet werden können.

Eigene Erfahrungen haben außerdem gezeigt, dass eine zu starke Collinearität der imputierten Variablen zu einer Verzerrung der imputierten Werte führen kann. Möchte man beispielsweise fehlende Werte von Sub-Skalen eines Fragebogens sowie der entsprechenden Gesamtskala imputieren, empfielt es sich, die Gesamtskala als informations-gebende Variable aus dem Imputationsprozess auszuschließen.


Um die Güte der MI beurteilen zu können sollte geprüft werden, ob die Verteilung der nicht-imputierten in etwa mit jener der imputierten Werte übereinstimmt. Dies kann z.B. mit Hilfe von Histogrammen oder einer Kombination aus Stripcharts und Violinplots erfolgen* (vgl. Abb. 1).

Abb. 1: Stripchart zur Güteüberprüfung der MI

Abb. 1 zeigt eine Kombination aus Stripchart und Violinplot für eine Variable (n=200) mit 50 fehlenden Werten, die 5 mal imputiert wurden (Methode: pmm). Es wird deutlich, dass die Verteilung der gepoolten imputierten Werte in etwa der Verteilung der nicht-imputierten Werte entspricht.

Fazit für die Praxis

Im Vergleich zum traditionellen Umgang mit fehlenden Werten (Fallausschluss, einfache Imputationen) führen Multiple Imputationen weder zu einem Verlust statistischer Power noch zu einer Unterschätzung der Varianz. Auch die Schätzwerte werden in der Regel nicht zu stark verzerrt. In der Forschungspraxis bieten Multiplen Imputationen eine gute Möglichkeit im Umgang mit fehlenden Werten.


  1. Buuren, Stef van. 2012. Flexible Imputation of Missing Data. CRC Press.

  2. Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. Mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software 45 (3).

  3. Errickson, Josh. 2017. Multiple Imputation.

  4. Graham, John W. 2009. Missing Data Analysis: Making It Work in the Real World. Annual Review of Psychology 60. Annual Reviews: 549-76.

  5. Rubin, Donald B. 1987. Multiple Imputation for Nonresponse in Surveys. Vol. 81. John Wiley & Sons.

*PS: Vielen Dank an Axel Klenk für den Tipp.

Posted in Indroduction | Tagged , | 2 Comments

Scoring the St George’s Respiratory Questionnaire (SGRQ) using R


plot of chunk sri

The St George's Respiratory Questionnaire (SGRQ) is an instrument for the measuring of Health-Related Quality-of-Life in patients with diseases of airways obstruction. The SGRQ contains 50 items covering three domains:

  • Symptoms (8 items),
  • Activity (16 items), and
  • Impacts (26 items).

In addition, a total summary scale may be computed (1, 2).

All scales have a score range between 0 and 100 with higher scores indicating a worse quality of life [2]. The items are either scored on 3-point-, 4-point-, and 5-point Likert scales, or they are binary-choice items that must be answered with either “yes” or “no”. Each item has an empirically derived regression weight.

Scoring the SGRQ

Based on the SGRQ Scoring Manual, I have written the R-package sgrqr for calculating the SGRQ scores.


The package is hosted on GitHub and may be installed using the following code:


Functions and data

The core of sgrqr is the function scoring_sgrq(). It must be applied to a data frame with the 50 SGRQ items and one id variable. Moreover, the package contains two data frames with simulated values. Unlike sgrq.full, has some missing values.

##  [1] "id"       "sgrq.1"   "sgrq.2"   "sgrq.3"   "sgrq.4"   "sgrq.5"  
##  [7] "sgrq.6"   "sgrq.7"   "sgrq.8"   "sgrq.11a" "sgrq.11b" "sgrq.11c"
## [13] "sgrq.11d" "sgrq.11e" "sgrq.11f" "sgrq.11g" "sgrq.15a" "sgrq.15b"
## [19] "sgrq.15c" "sgrq.15d" "sgrq.15e" "sgrq.15f" "sgrq.15g" "sgrq.15h"
## [25] "sgrq.15i" "sgrq.9"   "sgrq.10"  "sgrq.12a" "sgrq.12b" "sgrq.12c"
## [31] "sgrq.12d" "sgrq.12e" "sgrq.12f" "sgrq.13a" "sgrq.13b" "sgrq.13c"
## [37] "sgrq.13d" "sgrq.13e" "sgrq.13f" "sgrq.13g" "sgrq.13h" "sgrq.14a"
## [43] "sgrq.14b" "sgrq.14c" "sgrq.14d" "sgrq.16a" "sgrq.16b" "sgrq.16c"
## [49] "sgrq.16d" "sgrq.16e" "sgrq.17"
##   id sgrq.1 sgrq.2 sgrq.3 sgrq.4 sgrq.5
## 1  1      1      1      1      1      1
## 2  2      1     NA      2      4      3
## 3  3      5     NA      1      4      3
## 4  4      2      5     NA      2      2
## 5  5      4      2      5      3     NA
## 6  6      2      1      3      2      2


When applied to a data frame, the function returns a data frame containing the SGRQ score values and an id variable.

df <- scoring_sgrq(sgrq.full, id = 'id')
##   id sgrq.ts
## 1  1    90.6    56.4    45.0    56.0
## 2  2    67.1    57.3    65.3    63.2
## 3  3    54.8    56.1    49.6    52.4
## 4  4    62.7    56.5    47.2    52.6
## 5  5    52.1    30.7    64.7    52.3
## 6  6    67.7    45.1    61.4    57.5

If no id variable is specified, a data frame containing the score values only is returned.

df <- scoring_sgrq(
## sgrq.ts
## 1    90.6    56.4    45.0    56.0
## 2    74.1    57.3    65.3    64.4
## 3    57.4    63.6    57.3    59.2
## 4    65.1    62.7    54.7    58.9
## 5    56.1      NA    72.4    64.8
## 6    67.7    45.1    65.3    59.5

Difficulties in handling missing values

In the SGRQ scoring manual it says:

The Symptoms component will tolerate a maximum of 2 missed items. The weight for the missed item is subtracted from the total possible weight for the Symptoms component (662.5) and from the Total weight (3989.4).

Since item weights depend on the actual answers given, it remains unclear (at least for me) how to determine the weight of a missing item. The weight of the item “If you have a wheeze, is it worse in the morning?”, for example, is “0.0” vs. “62.0” depending on the answer “no” vs. “yes”. The algorithm implemented in scoring_sgrq() ascribes the missing item the highest weight possible (so 62.0 rather than 0.0). In order to be able to substract the weight of the missing item “from the total possible weight for the Symptoms component and from the Total weight”, it needs to be checked whether no more than 2 items are missing, and if so, which items are missing. Since this is very extensive to implement, I decided to program the algorithm the quick and dirty way.

First, I check whether no more than 2 items are missing:

  # return position of first item 
  a <- which(names(X)=="sgrq.1")   
  # return position of last item
  z <- which(names(X)=="sgrq.8")
  # calculate number of missing items
  Y$ <- rowSums([, c(a:z)]))

Second, I replace all missing values with the corresponding highest item weight:

  # replace missing values with highest weight
  for (i in a:z) {
    for (j in 1:nrow(X)){
      X[j, i] <- ifelse([j, i] == TRUE), repl.val[i-1], X[j, i])

Third, I calculate the score:

 # calculate score
  Y$ <- rowSums(X[, vars]) / 662.5 * 100

And finally, I replace the score value by NA if more than 2 items of the Symptom score are missing:

 Y$ <- ifelse(Y$ > 2, NA, Y$

Rather than substracting the weight of the missing item “form the total possible weight”, I “add” the highest possible item weight to the missing item, but only if no more than 2 items are missing.

I'm looking forward to getting some feedback to this post. I'm sure there is a better solution.


  1. Jones, P. W., F. H. Quirk, and C. M. Baveystock. 1991. The St George Respiratory Questionnaire. Respiratory Medicine 85 (September): 25-31. doi:10.1016/S0954-6111(06)80166-6.

  2. Jones, Paul W, Frances H Quirk, Chlo M Baveystock, and Peter Littlejohns. 1992. A Self-Complete Measure of Health Status for Chronic Airflow Limitation. Am Rev Respir Dis 145 (6): 1321-7.

Posted in Fav R Packages | Tagged , | Leave a comment

Scoring the Severe Respiratory Insufficiency Questionnaire (SRI) using R


The SRI is a multidimensional general health questionnaire “to assess HRQL in patients with chronic respiratory failure due to various underlying diseases” [1]. Based on 49 items, seven sub scales addressing the following domains are calculated:

plot of chunk sri
  • Respiratory Complaints (8 items);
  • Physical Function (6 items);
  • Attendant Symptoms and Sleep (7 items);
  • Social Relationships (6 items);
  • Anxiety (5 items);
  • Psychological Well-Being (9 items);
  • Social Functioning (8 items).

Based on the sub scales, a total summary scale is calculated. All scales have a score range between 0 and 100 with higher scores indicating a better quality of life [2]. All items are scored on a 5-point Likert scale ranging from 1 (completely untrue) to 5 (always true). The majority of items need to be recoded (recoded value = 6 – raw value).

Scoring the SRI

Based on the SRI Scoring Manual, I have written the R-package srir for calculating the SRI scores.


The package is hosted on GitHub and may be installed using the following code:


Functions and data

The core of srir is the function scoring_sri(). It must be applied to a data frame with the 49 SRI items and one id variable. Moreover, the package contains two data frames with simulated values. Unlike df.full, has some missing values.

##  [1] "id"     "sri.1"  "sri.2"  "sri.3"  "sri.4"  "sri.5"  "sri.6" 
##  [8] "sri.7"  "sri.8"  "sri.9"  "sri.10" "sri.11" "sri.12" "sri.13"
## [15] "sri.14" "sri.15" "sri.16" "sri.17" "sri.18" "sri.19" "sri.20"
## [22] "sri.21" "sri.22" "sri.23" "sri.24" "sri.25" "sri.26" "sri.27"
## [29] "sri.28" "sri.29" "sri.30" "sri.31" "sri.32" "sri.33" "sri.34"
## [36] "sri.35" "sri.36" "sri.37" "sri.38" "sri.39" "sri.40" "sri.41"
## [43] "sri.42" "sri.43" "sri.44" "sri.45" "sri.46" "sri.47" "sri.48"
## [50] "sri.49"
##   id sri.1 sri.2 sri.3 sri.4 sri.5
## 1  1     1     2     1     5     4
## 2  2     5     3     5     3     2
## 3  3     2     2     4    NA     5
## 4  4    NA     3     2     4     4
## 5  5     3     4     3     5     2
## 6  6     2     2     2     3     4


When applied to a data frame, the function returns a data frame containing the SRI score values and an id variable.

df <- scoring_sri(df.full, id = 'id')
##   id sri.rc sri.wb sri.sf
## 1  1   37.5   50.0   46.4   95.8     35   52.8   28.1   49.4
## 2  2   56.2   45.8   60.7   41.7     45   66.7   59.4   53.6
## 3  3   50.0   50.0   35.7   45.8     35   58.3   53.1   46.9
## 4  4   46.9   54.2   39.3   75.0     55   38.9   43.8   50.4
## 5  5   71.9   50.0   28.6   58.3     50   33.3   40.6   47.5
## 6  6   53.1   66.7   28.6   45.8     30   47.2   37.5   44.1

If no id variable is specified, a data frame containing the score values only is returned.

df <- scoring_sri(
##   sri.rc sri.wb sri.sf
## 1   37.5   50.0   46.4   95.8   35.0   52.8   28.1   49.4
## 2   56.2   45.8   70.8   25.0   31.2   66.7   67.9   52.0
## 3   50.0   50.0   33.3   50.0   31.2   53.1   46.4   44.9
## 4   46.4     NA   37.5   75.0   68.8   38.9   42.9     NA
## 5   65.0   50.0   12.5   55.0   50.0   33.3   40.6   43.8
## 6   60.7   70.0   25.0   37.5   33.3   40.6   37.5   43.5


  1. Struik, Fransien M., Huib A.M. Kerstjens, Gerrie Bladder, Roy Sprooten, Marianne Zijnen, Jerryll Asin, Thys van der Molen, and Peter J. Wijkstra. 2013. “The Severe Respiratory Insufficiency Questionnaire Scored Best in the Assessment of Health-Related Quality of Life in Chronic Obstructive Pulmonary Disease.” Journal of Clinical Epidemiology 66 (10): 1166–74.

  2. Windisch, Wolfram, Klaus Freidel, Bernd Schucher, Hansjrg Baumann, Matthias Wiebel, Heinrich Matthys, and Franz Petermann. 2003. “The Severe Respiratory Insufficiency (SRI) Questionnaire a Specific Measure of Health-Related Quality of Life in Patients Receiving Home Mechanical Ventilation.” Journal of Clinical Epidemiology 56 (8): 752–59.

Posted in Fav R Packages | Tagged , | Leave a comment

How to fix problems after updating to R 3.4.1 on Linux Mint 18.1 / Ubuntu 16.04

The Problem

After updating R to version 3.4.1 (“Single Candle”) on Linux Mint 18.1, RStudio could neither find my installed R packages nor was it possible to install R packages into some other directory.

The Solution

Fortunately, this problem could be solved very easily. I just needed to tell R where to find the libraries of version 3.4 again. On my 64 bit machine, I had to first open the Renviron file using xed (Linux Mint text editor).

sudo xed /usr/lib/R/etc/Renviron 

Second, I had to enter the following code line and then save and close the file.


That was it. 🙂

Posted in Tips & Tricks | Tagged , | Leave a comment

How to calculate Odds Ratios in Case-Control Studies using R


In June 2017 I've started working at the Clinical Trial Centre Leipzig at Leipzig University. Since my knowledge in statistics is rather poor, my employer offered me to attend some seminars in Medical Biometry at the University of Heidelberg. The first seminar I attended was called “Basics of Epidemiology”. At the first day, we learned how to calculate so called odds ratios in case-control studies using a simple pocket calculator.

In this blog post, I will show, how to calculate a simple odds ratio with 95% CI using R.

Data simulation

The data I'm using in this blog post were simulated using the wakefield package. The following code returns a data frame with 2 binary variables (Exposition and Disease) and 1.000 cases.


mydata <- data.frame(Exposition = group(n = 1000, x = c('yes', 'no'), 
                             prob = c(0.75, 0.25)),
                     Disease = group(n = 1000, x = c('yes', 'no'), 
                             prob = c(0.75, 0.25)))
## [1] 1000    2
##   Exposition Disease
## 1        yes     yes
## 2         no      no
## 3         no     yes
## 4        yes     yes
## 5        yes     yes
## 6        yes     yes

Based on this data frame, we calculate a table showing how many patients with exposition vs. no exposition developed a disease vs. no disease.

tab <-table(mydata$Exposition, mydata$Disease)
##       yes  no
##   yes 569 210
##   no  163  58

Odds Ratio Calculation

In order to get to know whether the risk for developing a disease is significantly higher in patients having a certain exposition, we need to calculate the odds ratio and its 95% CI.

The following function will return a data frame containing these values.

# return odds ratio with 95%ci
f <- function(x) {
  or <- round((x[1] * x[4]) / (x[2] * x[3]), 2)
  cil <- round(exp(log(or) - 1.96 * sqrt(1/x[1] + 1/x[2] + 1/x[3] + 1/x[4])), 2)
  ciu <- round(exp(log(or) + 1.96 * sqrt(1/x[1] + 1/x[2] + 1/x[3] + 1/x[4])), 2)
  df <- data.frame(matrix(ncol = 3, nrow = 1, 
                              dimnames = list(NULL, c('CI_95_lower', 'OR', 'CI_95_upper'))))
  df[1,] <- rbind(c(cil, or, ciu))
  df <-

Now, we can deploy the function on our table tab.

df.or <- f(tab)
knitr::kable(df.or, align = 'c')
CI_95_lower OR CI_95_upper
0.68 0.96 1.35

As the results indicate, patients with a disposition have no higher risk to develop a disease than patients having no disposition.

Posted in Indroduction | Tagged , | Leave a comment