How to Check if a Date is Within a List of Intervals in R

Intro

I'm currently involved in a research project called EFFECT. EFFECT is a multicentre, cluster-randomised, placebo-controlled cross-over trial evaluating antiseptic body wash of patients on intensive care units (ICU). The trial is to test whether daily antiseptic body wash reduces the risk of intensive care unit (ICU)-acquired primary bacteraemia and ICU-acquired multidrug-resistant organisms. EFFECT requires two types of data: (1) The patients' individual ward-movement history and
(2) microbiological test results (see Meissner 2017).

According to the study protocol, positive blood tests do count as infection unless there is a negative blood test within 48 hours after the positive blood test.

In this blog post, I show how to solve this problem on a computational level.

The Problem

The following code chunk provides an hypothetical example of the microbiological data I have to deal with. The data frame df.mibi contains 4 variables:

  • ID: Patient id (only 1 patient in this example);
  • ORGANISM: name of skin commensal organism found in some blood sample,
  • RESULT: laboratory test result (POS vs. NEG);
  • DATE: date of laboratory test
library(tidyverse)
library(lubridate)

df.mibi <- tibble(
  ID = paste0("ID_", rep(1, 11)),
  ORGANISM = c(rep('Propionibacterium acnes', 2), 
               rep('Staphylococcus epidermidis', 2),
               rep('Staphylococcus capitis', 2),
               rep('', 5)),
  RESULT = c(rep('POS', 6), rep('NEG', 5)),
  DATE = ymd(c(
    "2018-02-07", "2018-02-12", "2018-02-13", "2018-02-20",
    "2018-02-21", "2018-03-18", "2018-02-01", "2018-02-06",
    "2018-02-10", "2018-02-21", "2018-04-05")
  )
)

My Idea

In a first step, I separated df.mibi into two data frames:

  • df.POS: containing positive blood tests only
  • df.NEG: containing negative blood tests only
df.POS <- df.mibi %>%
  filter(RESULT == 'POS')
df.NEG <- df.mibi %>%
  filter(RESULT == 'NEG')

In a second step, I removed two variables from df.NEG (RESULT, ORGANISM), grouped the data frame by ID, and put all dates belonging to one ID into the list column data using the nest() function of the tidyr package

df.NEG <- df.NEG %>%
  select(ID, DATE) %>%
    group_by(ID) %>%
      nest()

This is how both data frames look like:

df.POS
## # A tibble: 6 x 4
##   ID    ORGANISM                   RESULT DATE      
##   <chr> <chr>                      <chr>  <date>    
## 1 ID_1  Propionibacterium acnes    POS    2018-02-07
## 2 ID_1  Propionibacterium acnes    POS    2018-02-12
## 3 ID_1  Staphylococcus epidermidis POS    2018-02-13
## 4 ID_1  Staphylococcus epidermidis POS    2018-02-20
## 5 ID_1  Staphylococcus capitis     POS    2018-02-21
## 6 ID_1  Staphylococcus capitis     POS    2018-03-18
df.NEG
## # A tibble: 1 x 2
##   ID    data            
##   <chr> <list>          
## 1 ID_1  <tibble [5 x 1]>

In a third step, I tried to check whether one of the negative test (stored in the list variable data) lies within the time interval positive test + 48 hours (TIME).
I did the mapping using the map2() function of the purrr package:

# merging and mapping
df.TOTAL <- df.POS %>%
  left_join(df.NEG, by = 'ID') %>%
    mutate(TIME = interval(DATE, DATE + days(2)),
           RESULT = map2(data, "DATE", TIME, ~ .x %within% .y)) 

Unfortunaltely, my code did not work. The RESULT variable should be logical and return TRUE in case of a negative test result up to 2 days after the positive test. Instead it is a list and returns NULL.

df.TOTAL
## # A tibble: 6 x 6
##   ID    ORGANISM   RESULT DATE       data   TIME                          
##   <chr> <chr>      <list> <date>     <list> <S4: Interval>                
## 1 ID_1  Propionib~ <NULL> 2018-02-07 <tibb~ 2018-02-07 UTC--2018-02-09 UTC
## 2 ID_1  Propionib~ <NULL> 2018-02-12 <tibb~ 2018-02-12 UTC--2018-02-14 UTC
## 3 ID_1  Staphyloc~ <NULL> 2018-02-13 <tibb~ 2018-02-13 UTC--2018-02-15 UTC
## 4 ID_1  Staphyloc~ <NULL> 2018-02-20 <tibb~ 2018-02-20 UTC--2018-02-22 UTC
## 5 ID_1  Staphyloc~ <NULL> 2018-02-21 <tibb~ 2018-02-21 UTC--2018-02-23 UTC
## 6 ID_1  Staphyloc~ <NULL> 2018-03-18 <tibb~ 2018-03-18 UTC--2018-03-20 UTC

The Solution

Not even one hour after I posted my question to StackOverflow, a user who calles himself “utubun” found the following solution:

df.TOTAL <- df.POS %>%
  left_join(df.NEG, by = 'ID') %>%
    mutate(TIME = interval(DATE, DATE + days(2)),
           RESULT = map2_lgl(data, TIME, ~ any(.x$DATE %within% .y)))
df.TOTAL
## # A tibble: 6 x 6
##   ID    ORGANISM   RESULT DATE       data   TIME                          
##   <chr> <chr>      <lgl>  <date>     <list> <S4: Interval>                
## 1 ID_1  Propionib~ FALSE  2018-02-07 <tibb~ 2018-02-07 UTC--2018-02-09 UTC
## 2 ID_1  Propionib~ FALSE  2018-02-12 <tibb~ 2018-02-12 UTC--2018-02-14 UTC
## 3 ID_1  Staphyloc~ FALSE  2018-02-13 <tibb~ 2018-02-13 UTC--2018-02-15 UTC
## 4 ID_1  Staphyloc~ TRUE   2018-02-20 <tibb~ 2018-02-20 UTC--2018-02-22 UTC
## 5 ID_1  Staphyloc~ TRUE   2018-02-21 <tibb~ 2018-02-21 UTC--2018-02-23 UTC
## 6 ID_1  Staphyloc~ FALSE  2018-03-18 <tibb~ 2018-03-18 UTC--2018-03-20 UTC

It works!!! Thank you very much! 🙂

Advertisements
Posted in Tips & Tricks | Tagged , , | Leave a comment

Drawing a Fish Curve using R and ggplot2

Intro

Recently, I wondered whether there is a way to draw a fish shape using a mathematical function. Since I did not find a ready-made R function, I tried to write the function by myself. The equations, I've used for writing this function can be found on WolframMathWorld.

The function

The fish_curve() function requires the ggplot2 and the dplyr package. It creates a data frame with two variables (x and y) and 10.000 observations. Finally, the data points are plotted using ggplot2.

fish_curve <- function(colour='black', size = 5){
  library(ggplot2)
  library(dplyr)
  data.frame(
    x = cos(1:10000) - sin(1:10000)^2 / sqrt(2),
    y = cos(1:10000) * sin(1:10000)
  ) %>%
    ggplot(., aes(x, y)) +
    geom_point(colour = colour, size = size) +
    theme_void()
}

Function call with default parameters

With colour and size the fish_curve() function allows the user to specify two parameters; that is colour and size of the plotted points. The default values are black for colour and 5 for size.

(p1 <- fish_curve())

plot of chunk fish-1

Customization

In the following example, we customize colour and size of the fish shape:

(p2 <- fish_curve(colour = 'blue', size = 1))

plot of chunk fish-2

And finally, we place the two plots side by side using the patchwork package:

library(patchwork)
p1 + p2

plot of chunk unnamed-chunk-1

Posted in Tips & Tricks, Visualizing Data | Tagged , | Leave a comment

How to order factors by level frequency and level name

Intro

Quite frequently, factor variables are ordered by level frequency. However, factor levels having only a few observations are sometimes collapsed into one level usually named “others”. Since this level is usually not of particular interest, it may be a good idea to put this level in the last position of the plot rather than ordering it by level frequency. In this blog post, I’m going to show how to order a factor variable by level frequency and level name.

To replicate the R code I’m going to use in this post, four R packages must be loaded:

library(dplyr) # for data manipulation
library(ggplot2) # for plotting data
library(gghighlight) # ggplot2 extension for highlighting values

The dataset I’m going to use in this post (mtcars) is part of the datasets package.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

In the first code chunk, we:

  • extract the first word of each car name and write it into a new variable called “brand”,
  • rename all car brands starting with “M” (Mazda, Merc, Maserati) to “Others” and
  • calculate the median miles per gallon (mpg) for each car brand.
df.mtcars %
  mutate(name = str_extract(rownames(.), "^\\w+\\b"),
         brand = str_replace(name, "^M\\w+", 'Others')) %>%
  group_by(brand) %>%
  summarize(mpg = median(mpg))
df.mtcars$brand
##  [1] "AMC"      "Cadillac" "Camaro"   "Chrysler" "Datsun"   "Dodge"   
##  [7] "Duster"   "Ferrari"  "Fiat"     "Ford"     "Honda"    "Hornet"  
## [13] "Lincoln"  "Lotus"    "Others"   "Pontiac"  "Porsche"  "Toyota"  
## [19] "Valiant"  "Volvo"

The following code chunk is to reorder the brand variable by level frequency using the reorder() function.

df.mtcars %
  mutate(brand = as.factor(brand),
         brand = reorder(brand, mpg))
levels(df.mtcars$brand)
##  [1] "Cadillac" "Lincoln"  "Camaro"   "Duster"   "Chrysler" "AMC"     
##  [7] "Dodge"    "Ford"     "Valiant"  "Others"   "Pontiac"  "Ferrari" 
## [13] "Hornet"   "Volvo"    "Datsun"   "Porsche"  "Toyota"   "Fiat"    
## [19] "Honda"    "Lotus"

As we can see, the bar representing the “Others” level is roughly in the middle of the plot.

ggplot(df.mtcars, aes(brand, mpg, fill = brand)) +
  coord_flip() +
  geom_col(width = 0.5) +
  gghighlight(brand == 'Others', unhighlighted_colour = "cornflowerblue") +
  scale_fill_manual(values = c("grey")) +
  theme_bw() +
  theme(legend.position = 'none') +
  labs(x = NULL, 
       y = 'Miles per Gallon',
       title = "Factor variable ordered by level frequency")

plot of chunk unnamed-chunk-4

To put the bar representing the “Others” level at the bottom of the plot, we have to set “Others” as reference category using the relevel() function.

df.mtcars %
  mutate(brand = relevel(brand, ref = "Others"))
levels(df.mtcars$brand)
##  [1] "Others"   "Cadillac" "Lincoln"  "Camaro"   "Duster"   "Chrysler"
##  [7] "AMC"      "Dodge"    "Ford"     "Valiant"  "Pontiac"  "Ferrari" 
## [13] "Hornet"   "Volvo"    "Datsun"   "Porsche"  "Toyota"   "Fiat"    
## [19] "Honda"    "Lotus"

Finally, the bar representing the “Others” level appears at the desired position.

ggplot(df.mtcars, aes(brand, mpg, fill = brand)) +
  coord_flip() +
  geom_col(width = 0.5) +
  gghighlight(brand == 'Others', unhighlighted_colour = "cornflowerblue") +
  scale_fill_manual(values = c("grey")) +
  theme_bw() +
  theme(legend.position = 'none') +
  labs(x = NULL, 
       y = 'Miles per Gallon',
       title = "Factor variable ordered by level frequency and level name")

plot of chunk unnamed-chunk-6

PS: In both plots, the gghighlight() function of the gghighlight package was used to highlight the desired factor level.

Posted in Tips & Tricks, Visualizing Data | Tagged | 2 Comments

Postleitzahlen mit fĂĽhrender Null richtig formatieren

Intro

Importiert man Postleitzahlen aus anderen Datenformaten (z.B. Excel, Access) in R, ist es nicht selten, dass Postleitzahlen programmintern tatsächlich auch als Zahlen abgespeichert werden. Wie ein Blick auf einen im Internet frei verfügbaren Datensatz zeigt, kann dies zu folgendem Problem führen:

library(dplyr)
library(readxl)
mydata <- readxl::read_xlsx("Liste-der-PLZ-in-Excel-Karte-Deutschland-Postleitzahlen.xlsx")
head(mydata)
## # A tibble: 6 x 4
##     PLZ Bundesland Kreis   Typ  
##   <dbl> <chr>      <chr>   <chr>
## 1  1067 Sachsen    Dresden Stadt
## 2  1069 Sachsen    Dresden Stadt
## 3  1097 Sachsen    Dresden Stadt
## 4  1099 Sachsen    Dresden Stadt
## 5  1108 Sachsen    Dresden Stadt
## 6  1109 Sachsen    Dresden Stadt

Die Postleitzahlen der Städte und Gemeinden in den Bundesländern Sachsen, Sachsen-Anhalt und Thüringen werden der führenden Null beraubt und als vierstellige Zahlen dargestellt.

Um den entsprechenden PLZ die fĂĽhrende Null zurĂĽckzugeben, habe ich die Funktion plz_repair() geschrieben.

plz_repair <- function(x){
  x = ifelse(nchar(x) == 4, paste0('0', x), as.character(x))
}

Die Funktion prüft zunächst, ob die PLZ vierstellig ist. Wenn diese Bedingung erfüllt ist, wird die PLZ um eine führende Null erweitert, sodass eine fünfstellige PLZ entsteht. Bereits fünfstellige PLZ bleiben unverändert. Die reparierte PLZ-Variable wird als character string abgespeichert.

mydata <- mydata %>%
  mutate(PLZ = plz_repair(PLZ))
head(mydata)
## # A tibble: 6 x 4
##   PLZ   Bundesland Kreis   Typ  
##   <chr> <chr>      <chr>   <chr>
## 1 01067 Sachsen    Dresden Stadt
## 2 01069 Sachsen    Dresden Stadt
## 3 01097 Sachsen    Dresden Stadt
## 4 01099 Sachsen    Dresden Stadt
## 5 01108 Sachsen    Dresden Stadt
## 6 01109 Sachsen    Dresden Stadt
tail(mydata)
## # A tibble: 6 x 4
##   PLZ   Bundesland Kreis                 Typ  
##   <chr> <chr>      <chr>                 <chr>
## 1 99986 ThĂĽringen  Unstrut-Hainich-Kreis Kreis
## 2 99988 ThĂĽringen  Unstrut-Hainich-Kreis Kreis
## 3 99991 ThĂĽringen  Unstrut-Hainich-Kreis Kreis
## 4 99994 ThĂĽringen  Unstrut-Hainich-Kreis Kreis
## 5 99996 ThĂĽringen  Unstrut-Hainich-Kreis Kreis
## 6 99998 ThĂĽringen  Unstrut-Hainich-Kreis Kreis
Posted in Tips & Tricks | Tagged , | 2 Comments

Scoring the PHQ-9 Questionnaire Using R

Intro

The PHQ-9 is the nine-item depression module of the Patient Health Questionnaire. Each of the items is scored on a 4-point Likert scale ranging from 0 (not at all) to 3 (nearly every day). The items are summed to obtain a total score ranging from 0 to 27 with higher scores indicating greater severity of depression. Based on the total score, different levels of severity can be evaluated with 0–4, 5–9, 10–14, 15–19 and 20–27 points indicating “minimal”, “mild”, “moderate”, “moderately severe” and “severe” depression.

The PHQ-9 questionnaire may be found under the following link.

In this blog post, I show how to calculate the PHQ-9 score and the PHQ-9 severety levels.

Packages and data

The dataset we are going to use was published in Plos One. The file has got a Digital Object Identifier (doi) and may be imported into R using the read_delim() function of the readr package.

library(readr)
library(dplyr)
library(ggplot2)

df.phq9 <- readr::read_delim("https://doi.org/10.1371/journal.pone.0156167.s001", 
                             delim = ";", 
                             escape_double = FALSE, 
                             trim_ws = TRUE) %>%
            select(starts_with('phq9'))

glimpse(df.phq9)
## Observations: 1,337
## Variables: 9
## $ phq9_1 <int> 1, 3, 2, 0, 0, 0, 1, 0, 0, 2, 1, 1, 0, 3, 0, 0, 0, 2, 0...
## $ phq9_2 <int> 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ phq9_3 <int> 3, 2, 2, 2, 1, 0, 1, 3, 1, 0, 1, 1, 0, 3, 1, 0, 0, 0, 0...
## $ phq9_4 <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 2, 1, 3, 0, 1, 0, 0, 0, 1, 0...
## $ phq9_5 <int> 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0...
## $ phq9_6 <int> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ phq9_7 <int> 0, 1, 1, 1, 0, 1, 0, 0, 0, 3, 1, 1, 0, 1, 0, 0, 0, 0, 0...
## $ phq9_8 <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0...
## $ phq9_9 <int> 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

The Scoring Function

The scoring_phq9 function requires a data frame containing the PHQ-9 items (data) and a vector containing the items' names (items.phq9) as input parameters.

scoring_phq9 <- function(data, items.phq9) {
  data %>%
    mutate(nvalid.phq9 = rowSums(!is.na(select(., items.phq9))),
           nvalid.phq9 = as.integer(nvalid.phq9),
           mean.temp = rowSums(select(., items.phq9), na.rm = TRUE)/nvalid.phq9,
           phq.01.temp = as.integer(unlist(data[items.phq9[1]])),
           phq.02.temp = as.integer(unlist(data[items.phq9[2]])),
           phq.03.temp = as.integer(unlist(data[items.phq9[3]])),
           phq.04.temp = as.integer(unlist(data[items.phq9[4]])),
           phq.05.temp = as.integer(unlist(data[items.phq9[5]])),
           phq.06.temp = as.integer(unlist(data[items.phq9[6]])),
           phq.07.temp = as.integer(unlist(data[items.phq9[7]])),
           phq.08.temp = as.integer(unlist(data[items.phq9[8]])),
           phq.09.temp = as.integer(unlist(data[items.phq9[9]]))) %>%
    mutate_at(vars(phq.01.temp:phq.09.temp),
              funs(ifelse(is.na(.), round(mean.temp), .))) %>%
    mutate(score.temp = rowSums(select(., phq.01.temp:phq.09.temp), na.rm = TRUE),
           score.phq9 = ifelse(nvalid.phq9 >= 7, as.integer(round(score.temp)), NA),
           cutoff.phq9 = case_when(
             score.phq9 >= 20 ~ 'severe',
             score.phq9 >= 15 ~ 'moderately severe',
             score.phq9 >= 10 ~ 'moderate',
             score.phq9 >= 5 ~ 'mild',
             score.phq9 < 5 ~ 'minimal'),
             cutoff.phq9 = factor(cutoff.phq9, levels = c('minimal', 'mild',
                                                          'moderate', 'moderately severe',
                                                          'severe'))) %>%
    select(-ends_with("temp"))

}

Example

The function adds three variables to the original data frame:

  • nvalid.phq9: Number of variables with valid values,
  • score.phq9: PHQ-9 score (0 – 27),
  • cutoff.phq9: PHQ-9 severety levels (minimal, mild, moderate, moderately severe, severe)
items.phq9 <- paste0('phq9_', seq(1, 9, 1))
df.phq9 <- df.phq9 %>%
  scoring_phq9(., items.phq9)
glimpse(df.phq9)
## Observations: 1,337
## Variables: 12
## $ phq9_1      <int> 1, 3, 2, 0, 0, 0, 1, 0, 0, 2, 1, 1, 0, 3, 0, 0, 0,...
## $ phq9_2      <int> 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ phq9_3      <int> 3, 2, 2, 2, 1, 0, 1, 3, 1, 0, 1, 1, 0, 3, 1, 0, 0,...
## $ phq9_4      <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 2, 1, 3, 0, 1, 0, 0, 0,...
## $ phq9_5      <int> 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 2, 0, 1, 0, 0, 0,...
## $ phq9_6      <int> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ phq9_7      <int> 0, 1, 1, 1, 0, 1, 0, 0, 0, 3, 1, 1, 0, 1, 0, 0, 0,...
## $ phq9_8      <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,...
## $ phq9_9      <int> 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nvalid.phq9 <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,...
## $ score.phq9  <int> 7, 10, 7, 9, 3, 2, 3, 4, 5, 7, 7, 8, 0, 11, 1, 0, ...
## $ cutoff.phq9 <fct> mild, moderate, mild, mild, minimal, minimal, mini...

Visualization

PHQ-9 Score

ggplot(df.phq9, aes(score.phq9)) +
  geom_density(fill = 'blue', alpha = 0.2) +
  scale_x_continuous(limits = c(0, 27), breaks = c(0,5,10,15,20,27)) +
  labs(x = 'PHQ-9 Score', y = 'Density') +
  theme_bw()

plot of chunk unnamed-chunk-4

PHQ-9 Severety Levels

ggplot(df.phq9, aes(x = cutoff.phq9, fill = cutoff.phq9)) +
  geom_bar(colour = 'black') +
  scale_fill_brewer(type = 'seq') +
  labs(x = NULL, y = NULL, fill = NULL) +
  theme_bw()

plot of chunk unnamed-chunk-5

Posted in Indroduction | Tagged , | Leave a comment

Sample Size Calculation Using R

Preample

“Sample Size Calculation Using R” – the title of this blog post – sounds very comprehensive. However, this post only deals with the sample size collection for one particular test (two sample t-test). A collection of functions for basic sample size calculation can be found in the pwr package.

Introduction

A couple of months ago, I attended some statistics course about sample size calculation and sample size adjustment. Amongst others, we spoke about a randomized placebo-controlled trial with patients suffering of depression. The trial's primary end point was the difference between placebo and treatment group in the HAM-D total score (some depression score) between baseline and the end of therapy. During the course, the professor asked us to write a “two-liner” returning the sample size per group for a two-sided, two sample t-test with the following parameters:

  • alpha = 0.05,
  • sigma = 8 ,
  • delta = 4 and
  • power ≤ 0.90

Here is my solution:

In a first step, I wrote a loop returning the power rising from 0.1 to 0.9 and the corresponding sample sizes respectively. The loop may either be a repeat or a while loop.

While Loop

A while loop runs and repeats while a specified condition returns TRUE. The loop terminates only when the condition evaluates to FALSE. In our case, the loop stops evaluating as soon as p > 0.9.

p <- 0.05
while(p <= 0.9){
  p <- p + 0.05
  t <- power.t.test(sd = 8, delta = 4, sig.level=0.05, power = p, type="two.sample", alternative = "two.sided")
  print(paste(t$power, ceiling(t$n), sep = ' - '))
}

Repeat Loop

In a repeat loop it is required to define a so-called break declaration to stop repeating the code. Just like the while loop, our repeat loop stops evaluating as soon as p > 0.9.

Both, the while and the repeat loop return the same result.

p <- 0.05
repeat{
  p <- p + 0.05
  t <- power.t.test(sd = 8, delta = 4, sig.level=0.05, power = p, type="two.sample", alternative = "two.sided")
  print(paste(t$power, ceiling(t$n), sep = ' - '))
  if(p > 0.90){
    break
  }
}
## [1] "0.1 - 5"
## [1] "0.15 - 8"
## [1] "0.2 - 12"
## [1] "0.25 - 15"
## [1] "0.3 - 18"
## [1] "0.35 - 21"
## [1] "0.4 - 25"
## [1] "0.45 - 28"
## [1] "0.5 - 32"
## [1] "0.55 - 36"
## [1] "0.6 - 41"
## [1] "0.65 - 45"
## [1] "0.7 - 51"
## [1] "0.75 - 57"
## [1] "0.8 - 64"
## [1] "0.85 - 73"
## [1] "0.9 - 86"

For a very good introduction to loops, see Davies (2016): The Book of R. No Starch Press: San Francisco.

Storing the Results

In case we want to use the results for whatever purpose, we need to store them in a vector. The following example heavily borrows from StackOverflow. The loop returns a data frame containing two variables: pwr (power) and num (sample size).

p <- 0.05
pwr <- c()
num <- c()
while(p <= 0.9){
  p <- p + 0.05
  t <- power.t.test(sd = 8, delta = 4, sig.level=0.05, power = p, type="two.sample", alternative = "two.sided")
  pwr <- c(pwr, t$power)
  num <- c(num, ceiling(t$n))
  df <- data.frame(pwr = pwr, n = num)
}
rm(list = c(setdiff(ls(), c("df"))))
dplyr::glimpse(df)
## Observations: 17
## Variables: 2
## $ pwr <dbl> 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50, 0.55...
## $ n   <dbl> 5, 8, 12, 15, 18, 21, 25, 28, 32, 36, 41, 45, 51, 57, 64, ...

Plotting Power and Sample Size

Finally, we plot the relation between power and sample size using the ggplot2 package. The theme I use, stems from the ggpubr package.

ggplot(df, aes(pwr, n)) +
  geom_point() +
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.1)) +
  scale_y_continuous(breaks = seq(0, max(df$n)+10, 10)) +
  geom_text(aes(label = n), nudge_y = 4) +
  ggpubr::theme_pubr() +
  labs(x = 'Power', y = 'Sample Size', 
       title = 'Power and Sample Size', 
        subtitle = expression('Two-sided, two sample t-test with' ~ sigma ~ '= 8,' ~ delta ~ '= 4,' ~ alpha ~ '= 0.05'))

plot of chunk unnamed-chunk-4

As we see, a power of 0.90 requires a sample size of 86 patients per treatment group for a two-sided, two sample t-test with sd = 8, delta = 4, and alpha = 0.05.

Posted in Indroduction | Tagged , , , | Leave a comment

How to Plot Venn Diagrams Using R, ggplot2 and ggforce

Intro

Venn diagrams – named after the English logician and philosopher John Venn – “illustrate the logical relationships between two or more sets of items” with overlapping circles.

In this tutorial, I'll show how to plot a three set venn diagram using R and the ggplot2 package.

Packages and Data

For the R code to run, we need to install and load three R packages. Unlike tidyverse and ggforce, the limma package must be installed from Bioconductor rather than from CRAN.

Moreover, we create a random data frame using the rbinom() function.

source("http://www.bioconductor.org/biocLite.R")
biocLite("limma")
library(limma)
library(tidyverse)
library(ggforce)
set.seed((123))
mydata <- data.frame(A = rbinom(100, 1, 0.8),
                     B = rbinom(100, 1, 0.7),
                     C = rbinom(100, 1, 0.6)) %>%
                       mutate_all(., as.logical)

Drawing the Circles

Next, we create a data frame defining the x and y coordinates for the three circles we want to draw and a variable defining the labels. For plotting the circles – the basic structure of our venn diagram – we need the geom_circle() function of the ggforce package. We employ the geom_circle()-function of the ggforce package to actually draw the circles. With the parameter r (= 1.5), we define the radius of the circles.

df.venn <- data.frame(x = c(0, 0.866, -0.866),
                      y = c(1, -0.5, -0.5),
                      labels = c('A', 'B', 'C'))
ggplot(df.venn, aes(x0 = x, y0 = y, r = 1.5, fill = labels)) +
    geom_circle(alpha = .3, size = 1, colour = 'grey') +
      coord_fixed() +
        theme_void()

plot of chunk unnamed-chunk-2

Furthermore, we need a data frame with the values we want the plot and the coordinates for plotting these values. The values can be obtained using the vennCounts() function of the limma package. Since ggplot2 requires data frames we need to first transform the vdc object (class VennCounts) into a matrix and then into a data frame. In addition, we need to add the x and y coordinates for plotting the values.

vdc <- vennCounts(mydata)
class(vdc) <- 'matrix'
df.vdc <- as.data.frame(vdc)[-1,] %>%
  mutate(x = c(0, 1.2, 0.8, -1.2, -0.8, 0, 0),
         y = c(1.2, -0.6, 0.5, -0.6, 0.5, -1, 0))

The final Plot

Finally, we'll customize the look of our venn diagram and plot the values.

ggplot(df.venn) +
  geom_circle(aes(x0 = x, y0 = y, r = 1.5, fill = labels), alpha = .3, size = 1, colour = 'grey') +
  coord_fixed() +
  theme_void() +
  theme(legend.position = 'bottom') +
  scale_fill_manual(values = c('cornflowerblue', 'firebrick',  'gold')) +
  scale_colour_manual(values = c('cornflowerblue', 'firebrick', 'gold'), guide = FALSE) +
  labs(fill = NULL) +
  annotate("text", x = df.vdc$x, y = df.vdc$y, label = df.vdc$Counts, size = 5)

plot of chunk unnamed-chunk-4

Posted in Visualizing Data | Tagged , | Leave a comment