Scoring Questionnaires: QSCORER version 0.0.10 has been released

A new release of the qscorer package is now available on GitHub.

Scoring procedures

The package (version 0.0.10) provides procedures for scoring the following health-related questionnaires:

  • Atrial Fibrillation Effect on QualiTy-of-Life Questionnaire (AFEQT)
  • Beck Depression Inventory (BDI, BDI-II)
  • Behavior Rating Inventory of Executuve Function, adult version (BRIEF-A)
  • Dutch Eating Behavior Questionnaire, German version (DEBQ)
  • Eating Disorder Examination Questionnaire, short form (EDE-Q8)
  • EORTC QLQ-C30 Quality of Life Questionnaire
  • Epworth Sleepiness Scale (ESS)
  • European Quality of Life Five Dimension Three Level Scale Questionnaire (EQ-5D-3L)
  • General Self-Efficacy Scale (GSES)
  • Hospital Anxiety and Depression Scale (HADS)
  • Impact of Weight on Quality of Life-Lite Questionnaire (IWQOL-Lite)
  • Internaltional Index of Erectile Function, short form (IIEF-5)
  • International Physical Activity Questionnaire, short form (IPAQ)
  • Patient Health Questionnaire-9 (PHQ-9)
  • Patient Health Questionnaire-15 (PHQ-15)
  • Rosenberg Self-Esteem Scale (RSES)
  • Severe Respiratory Questionnaire (SRI)
  • Skala zur Selbstregulation (REG) (German)
  • Social Support Questionnaire, short form (F-SozU-K-7) (German)
  • Weight Bias Internalization Scale (WBIS)
  • Weight Self-Stigma Questionnaire (WSSQ)
  • Yale Food Addiction Scale, Version 2.0 (YFAS V2.0)

The scoring functions usually have the following arguments:

  • data: A data frame containing the items of the questionnaire in a pre-specified order. The data.frame may contain further variables.
  • items: A character vector with the item names in a pre-specified order, or a numeric vector indicating the column numbers of the items in data.
  • keep: Logical, whether to keep the single items and whether to return variables containing the number of non-missing items on each scale for each respondent. The default is TRUE.
  • nvalid: A numeric value indicating the number of non-missing items required for score calculations. A default is pre-specified for each questionnaire.
  • digits: Integer of length one: value to round to. No rounding by default.

If there are items that need to be scored reversely:

  • reverse: Items to be scored reversely. These items can be specified either by name or by index. Default is pre-specified.

Data

Furthermore, the package contains real-life data of the following questionnaires:

  • Beck Depression Inventory II (df.test.bdi)
  • Dutch Eating Behavior Questionnaire, German version (df.test.debq)
  • EORTC QLQ-C30 Quality of Life Questionnaire (df.test.eortcc30)
  • European Quality of Life Five Dimension Three Level Scale Questionnaire (df.test.eq5d3l)
  • Hospital Anxiety and Depression Scale (df.test.hads)
  • Patient Health Questionnaire-9 (df.test.phq9)

These real-life data are publicly available; sources can be found in the package documentation. The following output shows the df.test.phq9 data set (data of the PHQ-9 (a.k.a. PHQ-D) questionnaire).

dplyr::glimpse(qscorer::df.test.phq9)
## Observations: 1,337
## Variables: 11
## $ age    <dbl> 79, 62, 71, 65, 63, 68, 52, 88, 71, 77, 69, 55, 71, 65, 69, 65, 71, 53, 86, 58, 69, 70, 69, 69, 74, 74, 60…
## $ gender <fct> f, f, m, f, f, f, m, m, f, f, f, m, f, f, f, f, f, m, f, f, f, m, f, m, m, m, f, f, m, f, m, m, f, f, f, f…
## $ phq9_1 <dbl> 1, 3, 2, 0, 0, 0, 1, 0, 0, 2, 1, 1, 0, 3, 0, 0, 0, 2, 0, 0, 2, 3, 0, 0, 0, 3, 1, 0, 1, 1, 1, 1, 0, 1, 1, 2…
## $ phq9_2 <dbl> 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 2, 0, 1, 0, 1…
## $ phq9_3 <dbl> 3, 2, 2, 2, 1, 0, 1, 3, 1, 0, 1, 1, 0, 3, 1, 0, 0, 0, 0, 0, 1, 3, 1, 0, 0, 3, 0, 0, 1, 1, 1, 1, 3, 3, 1, 1…
## $ phq9_4 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 2, 1, 3, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 3, 0, 0, 1, 1, 1, 2, 2, 2, 1, 2…
## $ phq9_5 <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0, 0, 1, 3, 0, 1, 0, 2, 0, 0, 1, 0, 0, 1, 3, 1, 1, 0…
## $ phq9_6 <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0…
## $ phq9_7 <dbl> 0, 1, 1, 1, 0, 1, 0, 0, 0, 3, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 3, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1…
## $ phq9_8 <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1…
## $ phq9_9 <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Questionnaires without real-life data may be simulated using the simulate_items() function. The following code simulates data for the PHQ-15 questionnaire:

df.phq15 <- qscorer::simulate_items(num_cols = 15, item_name = 'PHQ15', item_range = 0:2)

Two more arguments have default values and don't need to be explicitly specified:

  • num_rows (Number of rows; default = 100)
  • prop_mis (Proportion of missing values to be put into each item; default = 0.05)

The function returns a data frame including the items of the questionnaire and an id variable.

dplyr::glimpse(df.phq15)
## Observations: 100
## Variables: 16
## $ id       <chr> "001", "002", "003", "004", "005", "006", "007", "008", "009", "010", "011", "012", "013", "014", "015",…
## $ PHQ15_1  <int> 2, 2, 2, 1, 2, 1, 1, 1, 2, 0, 1, 1, 0, 1, 2, 0, 2, 2, 0, 0, 0, 0, 2, NA, 2, 1, NA, 1, 2, 1, 0, 2, 2, 0, …
## $ PHQ15_2  <int> 2, 1, 1, 2, 0, 0, 2, NA, 1, 1, 1, NA, 1, 2, 1, 0, 1, 1, 1, 2, 1, 2, 1, 2, 0, 2, 2, 2, 0, 2, 2, 1, 1, 1, …
## $ PHQ15_3  <int> 2, 1, 2, 1, NA, 0, 0, 0, 0, 2, 1, 0, 1, 0, 2, 2, 0, 1, 2, 0, 2, 1, 0, 0, 0, 0, 0, 2, 1, 1, 1, 2, 0, 1, 1…
## $ PHQ15_4  <int> 1, 0, 1, 1, 1, 1, 2, 2, NA, 2, 0, 2, 2, 2, 1, 2, 2, 2, 0, NA, 2, 1, 2, 1, 0, NA, 1, 1, NA, 0, 0, 2, 2, 0…
## $ PHQ15_5  <int> 0, 0, 2, 0, 1, 1, 2, 1, 0, 2, 0, 0, 2, 1, 0, 0, 1, 1, 1, 1, 0, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 2,…
## $ PHQ15_6  <int> 1, 1, 2, 1, 2, 2, 1, 2, 2, 0, NA, 0, 1, 0, 2, 2, 1, 1, 0, 1, 2, 2, 2, 1, 0, 1, 2, 1, 2, 2, 0, 1, 1, 1, 1…
## $ PHQ15_7  <int> 1, 2, 0, 0, 2, 0, 1, 1, 2, 0, 2, 0, 2, NA, 0, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 0, 0, 2, 0, 1, 0, 1, 1, 0, 0…
## $ PHQ15_8  <int> 2, 1, 1, 1, 2, 2, 1, 0, 0, NA, 1, 0, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 0, 2, 0, 2, 1, 0, 0, 0, 2, 0, 1…
## $ PHQ15_9  <int> 0, 2, 2, 1, 1, 1, 0, 2, 0, 1, 2, 0, 0, 1, 2, 0, 1, 0, 0, 2, 2, NA, 2, 2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 0…
## $ PHQ15_10 <int> 0, 2, 2, 0, 2, 0, 0, 1, 0, 0, NA, NA, 0, 0, 2, 2, 0, NA, 2, 1, 1, 2, 2, 1, 2, 2, 0, 1, 1, 1, 1, 0, 0, 1,…
## $ PHQ15_11 <int> 1, 2, NA, 0, 2, 0, 2, 2, 1, 2, 2, 1, 2, 0, 1, 1, 0, 0, 2, 2, 2, 1, 2, 2, 1, 2, 0, 2, 2, 2, 1, 1, 0, 0, 2…
## $ PHQ15_12 <int> 2, 1, 2, 2, 0, 0, 0, 0, 1, 1, 2, 2, 1, 0, 0, 2, 1, 0, 0, 1, 2, 0, 1, 0, 1, 2, 1, 0, 2, NA, 0, 1, 0, 0, 0…
## $ PHQ15_13 <int> 2, 0, 0, 1, 1, 0, 0, 2, 2, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 2, 1, 0, 1, 2, 0, 1, 2, NA, 1, 2, 0, 1, 0, 2…
## $ PHQ15_14 <int> 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 2, 1, 0, 0, 0, 1, 0, 0, 2, 1, 2, 2, 2, 1, 0, 2, 1, 0, 2, 1, 2, 2, 2,…
## $ PHQ15_15 <int> NA, 1, 0, 1, 2, 0, 1, 1, 2, 0, 2, 0, 0, 1, 2, 2, 0, 1, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0, NA, 0, 2, NA, 0, NA…

Now, the corresponding scoring function (scoring_phq15()) can be apllied to this data:

library(dplyr)
df.phq15 <- df.phq15 %>% 
  qscorer::scoring_phq15(.,items = 2:16, keep = FALSE)

The function returns a data frame including the PHQ-15 score and the corresponding cut-off values (severe, moderate, mild, …):

dplyr::glimpse(df.phq15)
## Observations: 100
## Variables: 3
## $ id           <chr> "001", "002", "003", "004", "005", "006", "007", "008", "009", "010", "011", "012", "013", "014", "0…
## $ score.phq15  <dbl> 17.142857, 17.000000, 19.285714, 13.000000, 20.357143, 8.000000, 14.000000, 16.071429, 15.000000, 11…
## $ cutoff.phq15 <fct> Severe, Severe, Severe, Moderate, Severe, Mild, Moderate, Severe, Severe, Moderate, Severe, Mild, Se…

Documentation

The package documentation is hosted on GitHub Pages.

Scoring questionnaires using the ‘qscorer’ package

Motivation

Health-related questionnaires are used in many psychological studies and clinical trials. I know from my own experience that information about scoring procedures are often scattered among numerous sources and, thus, hard to find. Since I frequently work with questionnaires, I have decided to write qscorer, an R package with scoring procedures for health-related questionnaires.

plot of chunk unnamed-chunk-1

Scoring procedures

The qscorer package (version 0.3.0) provides procedures for scoring the following health-related questionnaires:

  • Atrial Fibrillation Effect on QualiTy-of-Life Questionnaire (AFEQT)
  • Beck Depression Inventory (BDI, BDI-II)
  • Behavior Rating Inventory of Executuve Function, adult version (BRIEF-A)
  • Dutch Eating Behavior Questionnaire, German version (DEBQ)
  • Eating Disorder Examination Questionnaire, short form (EDE-Q8)
  • Epworth Sleepiness Scale (ESS)
  • European Quality of Life Five Dimension Three Level Scale Questionnaire (EQ-5D-3L)
  • General Self-Efficacy Scale (GSES)
  • Hospital Anxiety and Depression Scale (HADS)
  • Impact of Weight on Quality of Life-Lite Questionnaire (IWQOL-Lite)
  • International Physical Activity Questionnaire, short form (IPAQ)
  • Patient Health Questionnaire-9 (PHQ-9)
  • Patient Health Questionnaire-15 (PHQ-15)
  • Rosenberg Self-Esteem Scale (RSES)
  • Severe Respiratory Questionnaire (SRI)
  • Skala zur Selbstregulation (REG) (German)
  • Social Support Questionnaire, short form (F-SozU-K-7) (German)
  • Weight Bias Internalization Scale (WBIS)
  • Weight Self-Stigma Questionnaire (WSSQ)
  • Yale Food Addiction Scale, Version 2.0 (YFAS V2.0)

Documentation

The package documentation is hosted by GitHub Pages.

Installation

You can install the development version of qscorer from GitHub with:

devtools::install_github('nrkoehler/qscorer')

To Do

In the near future, I will add control structures to the scoring functions in order to prevent items with out-of-range values to be scored. qscorer is a growing package. Thus, further scoring procedures will be added.

How to Assign Variable Labels in R

Intro

Defining variable labels is a useful way to describe and document datasets. Unlike SPSS, which makes it very easy to define variable labels using the data editor, base R doesn't provide any function to define variable labels (as far as I know).

However, Daniel Luedecke's R package sjlablled fills this gap. Let's give an example.

Defining variable labels

First, we load the mtcars data frame and define variable labels for all of the 11 variables:

data(mtcars)
labs <- c("Miles/(US) gallon", "Number of cylinders", "Displacement (cu.in.)", 
    "Gross horsepower", "Rear axle ratio", "Weight (1000 lbs)", "1/4 mile time", 
    "V/S", "Transmission", "Number of forward gears", "Number of carburetors")

Assigning labels to variables

Second, we assign the variable labels to the variables of the mtcars data frame:

library(sjlabelled)
mtcars <- set_label(mtcars, label = labs)

When we have a look at the mtcars data frame using RStudio's data viewer, we find the variable labels placed right underneath the variable names:


Moreover, we may as well save both variable names and labels into a data frame:

library(dplyr) # for data manipulation
library(knitr) # for printing tables
df <- get_label(mtcars) %>%
        data.frame() %>%
          rename_at(vars(1), funs(paste0('var.labs'))) %>%
            mutate(var.names = colnames(mtcars)) 
kable(df, align = 'lc')
var.labs var.names
Miles/(US) gallon mpg
Number of cylinders cyl
Displacement (cu.in.) disp
Gross horsepower hp
Rear axle ratio drat
Weight (1000 lbs) wt
¼ mile time qsec
V/S vs
Transmission am
Number of forward gears gear
Number of carburetors carb

Scoring the St George’s Respiratory Questionnaire (SGRQ) using R

Background

plot of chunk sgrq

The St George's Respiratory Questionnaire (SGRQ) is an instrument for the measuring of Health-Related Quality-of-Life in patients with diseases of airways obstruction. The SGRQ contains 50 items covering three domains:

  • Symptoms (8 items),
  • Activity (16 items), and
  • Impacts (26 items).

In addition, a total summary scale may be computed (1, 2).

All scales have a score range between 0 and 100 with higher scores indicating a worse quality of life [2]. The items are either scored on 3-point-, 4-point-, and 5-point Likert scales, or they are binary-choice items that must be answered with either “yes” or “no”. Each item has an empirically derived regression weight.

Scoring the SGRQ

Based on the SGRQ Scoring Manual, I have written the R-package sgrqr for calculating the SGRQ scores.

Installation

The package is hosted on GitHub and may be installed using the following code:

devtools::install_github("nrkoehler/sgrqr")
library(sgrqr)

Functions and data

The core of sgrqr is the function scoring_sgrq(). It must be applied to a data frame with the 50 SGRQ items and one id variable. Moreover, the package contains two data frames with simulated values. Unlike sgrq.full, sgrq.na has some missing values.

names(sgrq.full)
##  [1] "id"       "sgrq.1"   "sgrq.2"   "sgrq.3"   "sgrq.4"   "sgrq.5"  
##  [7] "sgrq.6"   "sgrq.7"   "sgrq.8"   "sgrq.11a" "sgrq.11b" "sgrq.11c"
## [13] "sgrq.11d" "sgrq.11e" "sgrq.11f" "sgrq.11g" "sgrq.15a" "sgrq.15b"
## [19] "sgrq.15c" "sgrq.15d" "sgrq.15e" "sgrq.15f" "sgrq.15g" "sgrq.15h"
## [25] "sgrq.15i" "sgrq.9"   "sgrq.10"  "sgrq.12a" "sgrq.12b" "sgrq.12c"
## [31] "sgrq.12d" "sgrq.12e" "sgrq.12f" "sgrq.13a" "sgrq.13b" "sgrq.13c"
## [37] "sgrq.13d" "sgrq.13e" "sgrq.13f" "sgrq.13g" "sgrq.13h" "sgrq.14a"
## [43] "sgrq.14b" "sgrq.14c" "sgrq.14d" "sgrq.16a" "sgrq.16b" "sgrq.16c"
## [49] "sgrq.16d" "sgrq.16e" "sgrq.17"
head(sgrq.na[1:6])
##   id sgrq.1 sgrq.2 sgrq.3 sgrq.4 sgrq.5
## 1  1      1      1      1      1      1
## 2  2      2      2      2      5      4
## 3  3      2     NA      4      3      5
## 4  4      1      1      1      5      4
## 5  5      3      2      2      4      1
## 6  6      4      2      2      5      4

Usage

When applied to a data frame, the function returns a data frame containing the SGRQ score values and an id variable.

df <- scoring_sgrq(sgrq.full, id = 'id')
head(df)
##   id sgrq.ss sgrq.as sgrq.is sgrq.ts
## 1  1   100.0    63.4    38.9    56.5
## 2  2    36.1    62.9    37.2    44.8
## 3  3    56.0    44.8    45.7    47.1
## 4  4    52.3    35.8    51.1    46.7
## 5  5    65.8    50.0    63.0    59.5
## 6  6    53.4    51.3    37.1    44.1

If no id variable is specified, a data frame containing the score values only is returned.

df <- scoring_sgrq(sgrq.na)
head(df)
##   sgrq.ss sgrq.as sgrq.is sgrq.ts
## 1   100.0    63.4    38.9    56.5
## 2    36.1    68.8    45.1    50.8
## 3    58.5    44.8    49.6    49.6
## 4    52.3    35.8    51.1    46.7
## 5    65.8    56.9    65.3    62.8
## 6    53.4    58.1    40.8    48.2

Difficulties in handling missing values

In the SGRQ scoring manual it says:

The Symptoms component will tolerate a maximum of 2 missed items. The weight for the missed item is subtracted from the total possible weight for the Symptoms component (662.5) and from the Total weight (3989.4).

Since item weights depend on the actual answers given, it remains unclear (at least for me) how to determine the weight of a missing item. The weight of the item “If you have a wheeze, is it worse in the morning?”, for example, is “0.0” vs. “62.0” depending on the answer “no” vs. “yes”. The algorithm implemented in scoring_sgrq() ascribes the missing item the highest weight possible (so 62.0 rather than 0.0). In order to be able to substract the weight of the missing item “from the total possible weight for the Symptoms component and from the Total weight”, it needs to be checked whether no more than 2 items are missing, and if so, which items are missing. Since this is very extensive to implement, I decided to program the algorithm the quick and dirty way.

First, I check whether no more than 2 items are missing:

  # return position of first item 
  a <- which(names(X)=="sgrq.1")   
  # return position of last item
  z <- which(names(X)=="sgrq.8")
  # calculate number of missing items
  Y$NMISS.ss <- rowSums(is.na(X[, c(a:z)]))

Second, I replace all missing values with the corresponding highest item weight:

  # replace missing values with highest weight
  for (i in a:z) {
    for (j in 1:nrow(X)){
      X[j, i] <- ifelse(is.na(X[j, i] == TRUE), repl.val[i-1], X[j, i])
    }}

Third, I calculate the score:

 # calculate score
  Y$sgrq.ss <- rowSums(X[, vars]) / 662.5 * 100

And finally, I replace the score value by NA if more than 2 items of the Symptom score are missing:

 Y$sgrq.ss <- ifelse(Y$NMISS.ss > 2, NA, Y$sgrq.ss)

Rather than substracting the weight of the missing item “form the total possible weight”, I “add” the highest possible item weight to the missing item, but only if no more than 2 items are missing.

I'm looking forward to getting some feedback to this post. I'm sure there is a better solution.

References

  1. Jones, P. W., F. H. Quirk, and C. M. Baveystock. 1991. The St George Respiratory Questionnaire. Respiratory Medicine 85 (September): 25-31. doi:10.1016/S0954-6111(06)80166-6.

  2. Jones, Paul W, Frances H Quirk, Chlo M Baveystock, and Peter Littlejohns. 1992. A Self-Complete Measure of Health Status for Chronic Airflow Limitation. Am Rev Respir Dis 145 (6): 1321-7.

Scoring the Severe Respiratory Insufficiency Questionnaire (SRI) using R

Background

The SRI is a multidimensional general health questionnaire “to assess HRQL in patients with chronic respiratory failure due to various underlying diseases” [1]. Based on 49 items, seven sub scales addressing the following domains are calculated:

plot of chunk sri
  • Respiratory Complaints (8 items);
  • Physical Function (6 items);
  • Attendant Symptoms and Sleep (7 items);
  • Social Relationships (6 items);
  • Anxiety (5 items);
  • Psychological Well-Being (9 items);
  • Social Functioning (8 items).

Based on the sub scales, a total summary scale is calculated. All scales have a score range between 0 and 100 with higher scores indicating a better quality of life [2]. All items are scored on a 5-point Likert scale ranging from 1 (completely untrue) to 5 (always true). The majority of items need to be recoded (recoded value = 6 – raw value).

Scoring the SRI

Based on the SRI Scoring Manual, I have written the R-package srir for calculating the SRI scores.

Installation

The package is hosted on GitHub and may be installed using the following code:

devtools::install_github("nrkoehler/srir")
library(srir)

Functions and data

The core of srir is the function scoring_sri(). It must be applied to a data frame with the 49 SRI items and one id variable. Moreover, the package contains two data frames with simulated values. Unlike df.full, df.na has some missing values.

names(df.full)
##  [1] "id"     "sri.1"  "sri.2"  "sri.3"  "sri.4"  "sri.5"  "sri.6" 
##  [8] "sri.7"  "sri.8"  "sri.9"  "sri.10" "sri.11" "sri.12" "sri.13"
## [15] "sri.14" "sri.15" "sri.16" "sri.17" "sri.18" "sri.19" "sri.20"
## [22] "sri.21" "sri.22" "sri.23" "sri.24" "sri.25" "sri.26" "sri.27"
## [29] "sri.28" "sri.29" "sri.30" "sri.31" "sri.32" "sri.33" "sri.34"
## [36] "sri.35" "sri.36" "sri.37" "sri.38" "sri.39" "sri.40" "sri.41"
## [43] "sri.42" "sri.43" "sri.44" "sri.45" "sri.46" "sri.47" "sri.48"
## [50] "sri.49"
head(df.na[1:6])
##   id sri.1 sri.2 sri.3 sri.4 sri.5
## 1  1     1     2     1     5     4
## 2  2     5     3     5     3     2
## 3  3     2     2     4    NA     5
## 4  4    NA     3     2     4     4
## 5  5     3     4     3     5     2
## 6  6     2     2     2     3     4

Usage

When applied to a data frame, the function returns a data frame containing the SRI score values and an id variable.

df <- scoring_sri(df.full, id = 'id')
head(df)
##   id sri.rc sri.pf sri.as sri.sr sri.ax sri.wb sri.sf sri.ss
## 1  1   37.5   50.0   46.4   95.8     35   52.8   28.1   49.4
## 2  2   56.2   45.8   60.7   41.7     45   66.7   59.4   53.6
## 3  3   50.0   50.0   35.7   45.8     35   58.3   53.1   46.9
## 4  4   46.9   54.2   39.3   75.0     55   38.9   43.8   50.4
## 5  5   71.9   50.0   28.6   58.3     50   33.3   40.6   47.5
## 6  6   53.1   66.7   28.6   45.8     30   47.2   37.5   44.1

If no id variable is specified, a data frame containing the score values only is returned.

df <- scoring_sri(df.na)
head(df)
##   sri.rc sri.pf sri.as sri.sr sri.ax sri.wb sri.sf sri.ss
## 1   37.5   50.0   46.4   95.8   35.0   52.8   28.1   49.4
## 2   56.2   45.8   70.8   25.0   31.2   66.7   67.9   52.0
## 3   50.0   50.0   33.3   50.0   31.2   53.1   46.4   44.9
## 4   46.4     NA   37.5   75.0   68.8   38.9   42.9     NA
## 5   65.0   50.0   12.5   55.0   50.0   33.3   40.6   43.8
## 6   60.7   70.0   25.0   37.5   33.3   40.6   37.5   43.5

References

  1. Struik, Fransien M., Huib A.M. Kerstjens, Gerrie Bladder, Roy Sprooten, Marianne Zijnen, Jerryll Asin, Thys van der Molen, and Peter J. Wijkstra. 2013. “The Severe Respiratory Insufficiency Questionnaire Scored Best in the Assessment of Health-Related Quality of Life in Chronic Obstructive Pulmonary Disease.” Journal of Clinical Epidemiology 66 (10): 1166–74.

  2. Windisch, Wolfram, Klaus Freidel, Bernd Schucher, Hansjrg Baumann, Matthias Wiebel, Heinrich Matthys, and Franz Petermann. 2003. “The Severe Respiratory Insufficiency (SRI) Questionnaire a Specific Measure of Health-Related Quality of Life in Patients Receiving Home Mechanical Ventilation.” Journal of Clinical Epidemiology 56 (8): 752–59.

How to install and use the hexSticker package

Intro

A couple of days ago, the hexSticker package was published on CRAN. The package provides some functions to plot hexagon stickers that may be used to promote R packages. As described on GitHub, the stickers can be plotted either using base R's plotting function, the lattice package or the ggplot2 package. Moreover, it is also possible to plot image files.

Since I found it quite demanding to install the hexSticker package on a current Linux os (Linux Mint 18.1), I decided to write a short tutorial explaining how to install and use the package on Linux Ubuntu-based operating systems.

Linux packages required

In a first step, we need to open the terminal to install the following software packages. While texinfo is required to build R packages from source, libudunits2-dev, fftw-dev and mffm-fftw1 are needed to install some R packages the hexSticker package depends on (ggforce, fftwtools).

sudo apt-get install texinfo libudunits2-dev fftw-dev mffm-fftw1 libfftw3-dev libtiff5-dev

R packages required

Recently, the fftwtools package was added to CRAN. Thus, it can be installed the usual way:

installed.packages('fftwtools', dep = TRUE)

Finally, the EBImage package must be installed from the Bioconductor repository and the packages ggimage, ggforce and hexSticker must be installed from CRAN.

source("https://bioconductor.org/biocLite.R")
biocLite("EBImage")
install.packages("ggimage")
install.packages("ggforce")
install.packages("hexSticker")

For plotting an example hexsticker we need some data provided by the streetsofle package which must be installed from GitHub:

if (!require("devtools")) install.packages("devtools")
devtools::install_github("nrkoehler/streetsofle")

Plotting

With the following code chunk we create two hexstickers for the streetsofle package. The colours chosen stem from the city flag of Leipzig. The arguments of the sticker() function are explained within the code's comments.

library(hexSticker)
library(ggplot2)
library(streetsofle)
data(streetsofle)

p.1 <- ggplot(aes(x = lon, y = lat), data = shape.ortsteile) + 
  theme_map_le() + 
  coord_quickmap() + 
  geom_polygon(aes(x = lon, y = lat, group = group), 
               fill = NA, 
               size = 0.2, 
               color = "#FFCB00") + 
  geom_polygon(aes(x = lon, y = lat, group = group),
               color = "#FFCB00", 
               size = 1, 
               fill = NA, 
               data = shape.bezirke) 

p.1 <- sticker(p.1,
               package="streetsofle", 
               s_x = 1, # horizontal position of subplot
               s_y = 1.1, # vertical position of subplot
               s_width = 1.4, # width of subplot
               s_height = 1.4, # height of subplot
               p_x = 1, # horizontal position of font
               p_y = .43, # vertical position of font
               p_size = 6, # font size
               p_color = "#FFCB00", # font colour
               h_size = 3, # hexagon border size
               h_fill = "#004CFF", # hexagon fill colour
               h_color = "#FFCB00") # hexagon border colour

p.2 <- ggplot(aes(x = lon, y = lat), data = shape.ortsteile) + 
  theme_map_le() + 
  coord_quickmap() + 
  geom_polygon(aes(x = lon, y = lat, group = group), 
               fill = NA, 
               size = 0.2, 
               color = "#004CFF") + 
  geom_polygon(aes(x = lon, y = lat, group = group),
               color = "#004CFF", 
               size = 1, 
               fill = NA, 
               data = shape.bezirke) 

p.2 <- sticker(p.2,
               package="streetsofle", 
               s_x = 1, # horizontal position of subplot
               s_y = 1.1, # vertical position of subplot
               s_width = 1.4, # width of subplot
               s_height = 1.4, # height of subplot
               p_x = 1, # horizontal position of font
               p_y = .43, # vertical position of font
               p_size = 6, # font size
               p_color = "#004CFF", # font color
               h_size = 3, # hexagon border size
               h_fill = "#FFCB00", # hexagon fill colour
               h_color = "#004CFF") # hexagon border colour

Finally, both plots are put into a grid layout using the grid.arrange() function of the gridExtra package.

library(gridExtra)
grid.arrange(p.1, p.2, ncol = 2, respect = TRUE)

plot of chunk unnamed-chunk-6

I'm not sure which sticker looks better. What do you think?

My first R package: streetsofle

Intro

A couple of days ago I published my first R package on GitHub. I’ve named the package streetsofle standing for streets of Leipzig because it includes a data set containing the street directory of the German city of Leipzig. The street directory was published by the Statistical Bureau of Leipzig and may be downloaded as PDF file from the following website.

Leipzig is divided into 10 greater adminsitrative districts (“Stadtbezirke”) and 63 smaller local districts (“Ortsteile”). The names of both smaller and greater districts can be found under the following link.

Figure 1: Map of Leipzig

plot of chunk map

Furthermore, the Leipzig area is covered by 34 postal codes (“Postleitzahlen”). That is:

##  [1] "04103" "04105" "04107" "04109" "04129" "04155" "04157" "04158"
##  [9] "04159" "04177" "04178" "04179" "04205" "04207" "04209" "04229"
## [17] "04249" "04275" "04277" "04279" "04288" "04289" "04299" "04315"
## [25] "04316" "04317" "04318" "04319" "04328" "04329" "04347" "04349"
## [33] "04356" "04357"

Installation

if (!require("devtools")) install.packages("devtools")
devtools::install_github("nrkoehler/streetsofle")

The Dataset

The data frame streetsofle contains 3391 observations and 8 variables. That is:

  • plz: postal code,
  • street_key: street identification number,
  • street_name: street name,
  • street_num: a list of street numbers,
  • bz_key: identification number for greater districts (‘Stadtbezirke’) ,
  • bz_name: names of greater districts (‘Bezirke’),
  • ot_key: identification number for smaller districts (‘Ortsteile’) ,
  • ot_name: names of smaller districts (‘Ortsteile’).

Since street sections without addresses are usually not covered by postal codes, the variable plz contains some missing values (n=169).

Next steps

Writing functions

Within the next couple of weeks I will write some functions to analyse the street directory data. The next code snipped, for example, shows how to calculate the number of smaller districts (‘Ortsteile’) traversed by the streets.

f <- function(x){length(unique(x))}
df <- data.frame(ot_num = tapply(streetsofle$ot_name, streetsofle$street_name, f))
psych::headTail(df)
##                      ot_num
## Aachener Straße           1
## Abrahamstraße             1
## Abtnaundorfer Straße      1
## Achatstraße               1
## ...                     ...
## Zwergmispelstraße         1
## Zwetschgenweg             1
## Zwickauer Straße          3
## Zwiebelweg                1

The following table shows the number of smaller districts traversed by the streets of Leipzig.

no. of districts no. of streets
1 2721
2 207
3 49
4 11
5 4
6 2
7 2
10 1

While 91% of the streets don’t traverse any district border, one street traverses the borders of 10 districts.

Shiny Web-App

Based on these functions I’m planning to write a Shiny Web-App providing several search functions.

How to install the ‘RWordPress’ package in R

The RWordPress package is a very convenient tool for publishing blog posts from R to WordPress. In his blog post Publish blog posts from R + knitr to WordPress, Yihui Xie explains how to install and use the package. Furthermore, the blog post How to publish with R Markdown in WordPress gives some additional information on how to use the package.

However, the package repository http://www.omegahat.org/R does not seem to exist anymore (2016-08-30).

Fortunatelly, the RWordPress package is also available from Github and, thus, can be easily installed using the devtools package.

Since RWordPress depends on the packages RCurl, XML, and XMLRPC, these packages need to be installed before we can actually install RWordPress.

Unlike RCurl and XML, the XMLRPC package is not available from the CRAN repository. Instead, it is available from Github.

Here is the code to install all required packages:

install.packages("devtools")
install.packages("RCurl")
install.packages("XML")
devtools:::install_github("duncantl/XMLRPC")
devtools:::install_github("duncantl/RWordPress")