Scoring the PHQ-9 Questionnaire Using R

Intro

The PHQ-9 is the nine-item depression module of the Patient Health Questionnaire. Each of the items is scored on a 4-point Likert scale ranging from 0 (not at all) to 3 (nearly every day). The items are summed to obtain a total score ranging from 0 to 27 with higher scores indicating greater severity of depression. Based on the total score, different levels of severity can be evaluated with 0–4, 5–9, 10–14, 15–19 and 20–27 points indicating “minimal”, “mild”, “moderate”, “moderately severe” and “severe” depression.

The PHQ-9 questionnaire may be found under the following link.

In this blog post, I show how to calculate the PHQ-9 score and the PHQ-9 severety levels.

Packages and data

The dataset we are going to use was published in Plos One. The file has got a Digital Object Identifier (doi) and may be imported into R using the read_delim() function of the readr package.

library(readr)
library(dplyr)
library(ggplot2)

df.phq9 <- readr::read_delim("https://doi.org/10.1371/journal.pone.0156167.s001", 
                             delim = ";", 
                             escape_double = FALSE, 
                             trim_ws = TRUE) %>%
            select(starts_with('phq9'))

glimpse(df.phq9)
## Observations: 1,337
## Variables: 9
## $ phq9_1 <int> 1, 3, 2, 0, 0, 0, 1, 0, 0, 2, 1, 1, 0, 3, 0, 0, 0, 2, 0...
## $ phq9_2 <int> 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ phq9_3 <int> 3, 2, 2, 2, 1, 0, 1, 3, 1, 0, 1, 1, 0, 3, 1, 0, 0, 0, 0...
## $ phq9_4 <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 2, 1, 3, 0, 1, 0, 0, 0, 1, 0...
## $ phq9_5 <int> 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0...
## $ phq9_6 <int> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ phq9_7 <int> 0, 1, 1, 1, 0, 1, 0, 0, 0, 3, 1, 1, 0, 1, 0, 0, 0, 0, 0...
## $ phq9_8 <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0...
## $ phq9_9 <int> 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

The Scoring Function

The scoring_phq9 function requires a data frame containing the PHQ-9 items (data) and a vector containing the items' names (items.phq9) as input parameters.

scoring_phq9 <- function(data, items.phq9) {
  data %>%
    mutate(nvalid.phq9 = rowSums(!is.na(select(., items.phq9))),
           nvalid.phq9 = as.integer(nvalid.phq9),
           mean.temp = rowSums(select(., items.phq9), na.rm = TRUE)/nvalid.phq9,
           phq.01.temp = as.integer(unlist(data[items.phq9[1]])),
           phq.02.temp = as.integer(unlist(data[items.phq9[2]])),
           phq.03.temp = as.integer(unlist(data[items.phq9[3]])),
           phq.04.temp = as.integer(unlist(data[items.phq9[4]])),
           phq.05.temp = as.integer(unlist(data[items.phq9[5]])),
           phq.06.temp = as.integer(unlist(data[items.phq9[6]])),
           phq.07.temp = as.integer(unlist(data[items.phq9[7]])),
           phq.08.temp = as.integer(unlist(data[items.phq9[8]])),
           phq.09.temp = as.integer(unlist(data[items.phq9[9]]))) %>%
    mutate_at(vars(phq.01.temp:phq.09.temp),
              funs(ifelse(is.na(.), round(mean.temp), .))) %>%
    mutate(score.temp = rowSums(select(., phq.01.temp:phq.09.temp), na.rm = TRUE),
           score.phq9 = ifelse(nvalid.phq9 >= 7, as.integer(round(score.temp)), NA),
           cutoff.phq9 = case_when(
             score.phq9 >= 20 ~ 'severe',
             score.phq9 >= 15 ~ 'moderately severe',
             score.phq9 >= 10 ~ 'moderate',
             score.phq9 >= 5 ~ 'mild',
             score.phq9 < 5 ~ 'minimal'),
             cutoff.phq9 = factor(cutoff.phq9, levels = c('minimal', 'mild',
                                                          'moderate', 'moderately severe',
                                                          'severe'))) %>%
    select(-ends_with("temp"))

}

Example

The function adds three variables to the original data frame:

  • nvalid.phq9: Number of variables with valid values,
  • score.phq9: PHQ-9 score (0 – 27),
  • cutoff.phq9: PHQ-9 severety levels (minimal, mild, moderate, moderately severe, severe)
items.phq9 <- paste0('phq9_', seq(1, 9, 1))
df.phq9 <- df.phq9 %>%
  scoring_phq9(., items.phq9)
glimpse(df.phq9)
## Observations: 1,337
## Variables: 12
## $ phq9_1      <int> 1, 3, 2, 0, 0, 0, 1, 0, 0, 2, 1, 1, 0, 3, 0, 0, 0,...
## $ phq9_2      <int> 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ phq9_3      <int> 3, 2, 2, 2, 1, 0, 1, 3, 1, 0, 1, 1, 0, 3, 1, 0, 0,...
## $ phq9_4      <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 2, 1, 3, 0, 1, 0, 0, 0,...
## $ phq9_5      <int> 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 2, 0, 1, 0, 0, 0,...
## $ phq9_6      <int> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ phq9_7      <int> 0, 1, 1, 1, 0, 1, 0, 0, 0, 3, 1, 1, 0, 1, 0, 0, 0,...
## $ phq9_8      <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,...
## $ phq9_9      <int> 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nvalid.phq9 <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,...
## $ score.phq9  <int> 7, 10, 7, 9, 3, 2, 3, 4, 5, 7, 7, 8, 0, 11, 1, 0, ...
## $ cutoff.phq9 <fct> mild, moderate, mild, mild, minimal, minimal, mini...

Visualization

PHQ-9 Score

ggplot(df.phq9, aes(score.phq9)) +
  geom_density(fill = 'blue', alpha = 0.2) +
  scale_x_continuous(limits = c(0, 27), breaks = c(0,5,10,15,20,27)) +
  labs(x = 'PHQ-9 Score', y = 'Density') +
  theme_bw()

plot of chunk unnamed-chunk-4

PHQ-9 Severety Levels

ggplot(df.phq9, aes(x = cutoff.phq9, fill = cutoff.phq9)) +
  geom_bar(colour = 'black') +
  scale_fill_brewer(type = 'seq') +
  labs(x = NULL, y = NULL, fill = NULL) +
  theme_bw()

plot of chunk unnamed-chunk-5

Scoring the St George’s Respiratory Questionnaire (SGRQ) using R

Background

plot of chunk sgrq

The St George's Respiratory Questionnaire (SGRQ) is an instrument for the measuring of Health-Related Quality-of-Life in patients with diseases of airways obstruction. The SGRQ contains 50 items covering three domains:

  • Symptoms (8 items),
  • Activity (16 items), and
  • Impacts (26 items).

In addition, a total summary scale may be computed (1, 2).

All scales have a score range between 0 and 100 with higher scores indicating a worse quality of life [2]. The items are either scored on 3-point-, 4-point-, and 5-point Likert scales, or they are binary-choice items that must be answered with either “yes” or “no”. Each item has an empirically derived regression weight.

Scoring the SGRQ

Based on the SGRQ Scoring Manual, I have written the R-package sgrqr for calculating the SGRQ scores.

Installation

The package is hosted on GitHub and may be installed using the following code:

devtools::install_github("nrkoehler/sgrqr")
library(sgrqr)

Functions and data

The core of sgrqr is the function scoring_sgrq(). It must be applied to a data frame with the 50 SGRQ items and one id variable. Moreover, the package contains two data frames with simulated values. Unlike sgrq.full, sgrq.na has some missing values.

names(sgrq.full)
##  [1] "id"       "sgrq.1"   "sgrq.2"   "sgrq.3"   "sgrq.4"   "sgrq.5"  
##  [7] "sgrq.6"   "sgrq.7"   "sgrq.8"   "sgrq.11a" "sgrq.11b" "sgrq.11c"
## [13] "sgrq.11d" "sgrq.11e" "sgrq.11f" "sgrq.11g" "sgrq.15a" "sgrq.15b"
## [19] "sgrq.15c" "sgrq.15d" "sgrq.15e" "sgrq.15f" "sgrq.15g" "sgrq.15h"
## [25] "sgrq.15i" "sgrq.9"   "sgrq.10"  "sgrq.12a" "sgrq.12b" "sgrq.12c"
## [31] "sgrq.12d" "sgrq.12e" "sgrq.12f" "sgrq.13a" "sgrq.13b" "sgrq.13c"
## [37] "sgrq.13d" "sgrq.13e" "sgrq.13f" "sgrq.13g" "sgrq.13h" "sgrq.14a"
## [43] "sgrq.14b" "sgrq.14c" "sgrq.14d" "sgrq.16a" "sgrq.16b" "sgrq.16c"
## [49] "sgrq.16d" "sgrq.16e" "sgrq.17"
head(sgrq.na[1:6])
##   id sgrq.1 sgrq.2 sgrq.3 sgrq.4 sgrq.5
## 1  1      1      1      1      1      1
## 2  2      2      2      2      5      4
## 3  3      2     NA      4      3      5
## 4  4      1      1      1      5      4
## 5  5      3      2      2      4      1
## 6  6      4      2      2      5      4

Usage

When applied to a data frame, the function returns a data frame containing the SGRQ score values and an id variable.

df <- scoring_sgrq(sgrq.full, id = 'id')
head(df)
##   id sgrq.ss sgrq.as sgrq.is sgrq.ts
## 1  1   100.0    63.4    38.9    56.5
## 2  2    36.1    62.9    37.2    44.8
## 3  3    56.0    44.8    45.7    47.1
## 4  4    52.3    35.8    51.1    46.7
## 5  5    65.8    50.0    63.0    59.5
## 6  6    53.4    51.3    37.1    44.1

If no id variable is specified, a data frame containing the score values only is returned.

df <- scoring_sgrq(sgrq.na)
head(df)
##   sgrq.ss sgrq.as sgrq.is sgrq.ts
## 1   100.0    63.4    38.9    56.5
## 2    36.1    68.8    45.1    50.8
## 3    58.5    44.8    49.6    49.6
## 4    52.3    35.8    51.1    46.7
## 5    65.8    56.9    65.3    62.8
## 6    53.4    58.1    40.8    48.2

Difficulties in handling missing values

In the SGRQ scoring manual it says:

The Symptoms component will tolerate a maximum of 2 missed items. The weight for the missed item is subtracted from the total possible weight for the Symptoms component (662.5) and from the Total weight (3989.4).

Since item weights depend on the actual answers given, it remains unclear (at least for me) how to determine the weight of a missing item. The weight of the item “If you have a wheeze, is it worse in the morning?”, for example, is “0.0” vs. “62.0” depending on the answer “no” vs. “yes”. The algorithm implemented in scoring_sgrq() ascribes the missing item the highest weight possible (so 62.0 rather than 0.0). In order to be able to substract the weight of the missing item “from the total possible weight for the Symptoms component and from the Total weight”, it needs to be checked whether no more than 2 items are missing, and if so, which items are missing. Since this is very extensive to implement, I decided to program the algorithm the quick and dirty way.

First, I check whether no more than 2 items are missing:

  # return position of first item 
  a <- which(names(X)=="sgrq.1")   
  # return position of last item
  z <- which(names(X)=="sgrq.8")
  # calculate number of missing items
  Y$NMISS.ss <- rowSums(is.na(X[, c(a:z)]))

Second, I replace all missing values with the corresponding highest item weight:

  # replace missing values with highest weight
  for (i in a:z) {
    for (j in 1:nrow(X)){
      X[j, i] <- ifelse(is.na(X[j, i] == TRUE), repl.val[i-1], X[j, i])
    }}

Third, I calculate the score:

 # calculate score
  Y$sgrq.ss <- rowSums(X[, vars]) / 662.5 * 100

And finally, I replace the score value by NA if more than 2 items of the Symptom score are missing:

 Y$sgrq.ss <- ifelse(Y$NMISS.ss > 2, NA, Y$sgrq.ss)

Rather than substracting the weight of the missing item “form the total possible weight”, I “add” the highest possible item weight to the missing item, but only if no more than 2 items are missing.

I'm looking forward to getting some feedback to this post. I'm sure there is a better solution.

References

  1. Jones, P. W., F. H. Quirk, and C. M. Baveystock. 1991. The St George Respiratory Questionnaire. Respiratory Medicine 85 (September): 25-31. doi:10.1016/S0954-6111(06)80166-6.

  2. Jones, Paul W, Frances H Quirk, Chlo M Baveystock, and Peter Littlejohns. 1992. A Self-Complete Measure of Health Status for Chronic Airflow Limitation. Am Rev Respir Dis 145 (6): 1321-7.

Scoring the Severe Respiratory Insufficiency Questionnaire (SRI) using R

Background

The SRI is a multidimensional general health questionnaire “to assess HRQL in patients with chronic respiratory failure due to various underlying diseases” [1]. Based on 49 items, seven sub scales addressing the following domains are calculated:

plot of chunk sri
  • Respiratory Complaints (8 items);
  • Physical Function (6 items);
  • Attendant Symptoms and Sleep (7 items);
  • Social Relationships (6 items);
  • Anxiety (5 items);
  • Psychological Well-Being (9 items);
  • Social Functioning (8 items).

Based on the sub scales, a total summary scale is calculated. All scales have a score range between 0 and 100 with higher scores indicating a better quality of life [2]. All items are scored on a 5-point Likert scale ranging from 1 (completely untrue) to 5 (always true). The majority of items need to be recoded (recoded value = 6 – raw value).

Scoring the SRI

Based on the SRI Scoring Manual, I have written the R-package srir for calculating the SRI scores.

Installation

The package is hosted on GitHub and may be installed using the following code:

devtools::install_github("nrkoehler/srir")
library(srir)

Functions and data

The core of srir is the function scoring_sri(). It must be applied to a data frame with the 49 SRI items and one id variable. Moreover, the package contains two data frames with simulated values. Unlike df.full, df.na has some missing values.

names(df.full)
##  [1] "id"     "sri.1"  "sri.2"  "sri.3"  "sri.4"  "sri.5"  "sri.6" 
##  [8] "sri.7"  "sri.8"  "sri.9"  "sri.10" "sri.11" "sri.12" "sri.13"
## [15] "sri.14" "sri.15" "sri.16" "sri.17" "sri.18" "sri.19" "sri.20"
## [22] "sri.21" "sri.22" "sri.23" "sri.24" "sri.25" "sri.26" "sri.27"
## [29] "sri.28" "sri.29" "sri.30" "sri.31" "sri.32" "sri.33" "sri.34"
## [36] "sri.35" "sri.36" "sri.37" "sri.38" "sri.39" "sri.40" "sri.41"
## [43] "sri.42" "sri.43" "sri.44" "sri.45" "sri.46" "sri.47" "sri.48"
## [50] "sri.49"
head(df.na[1:6])
##   id sri.1 sri.2 sri.3 sri.4 sri.5
## 1  1     1     2     1     5     4
## 2  2     5     3     5     3     2
## 3  3     2     2     4    NA     5
## 4  4    NA     3     2     4     4
## 5  5     3     4     3     5     2
## 6  6     2     2     2     3     4

Usage

When applied to a data frame, the function returns a data frame containing the SRI score values and an id variable.

df <- scoring_sri(df.full, id = 'id')
head(df)
##   id sri.rc sri.pf sri.as sri.sr sri.ax sri.wb sri.sf sri.ss
## 1  1   37.5   50.0   46.4   95.8     35   52.8   28.1   49.4
## 2  2   56.2   45.8   60.7   41.7     45   66.7   59.4   53.6
## 3  3   50.0   50.0   35.7   45.8     35   58.3   53.1   46.9
## 4  4   46.9   54.2   39.3   75.0     55   38.9   43.8   50.4
## 5  5   71.9   50.0   28.6   58.3     50   33.3   40.6   47.5
## 6  6   53.1   66.7   28.6   45.8     30   47.2   37.5   44.1

If no id variable is specified, a data frame containing the score values only is returned.

df <- scoring_sri(df.na)
head(df)
##   sri.rc sri.pf sri.as sri.sr sri.ax sri.wb sri.sf sri.ss
## 1   37.5   50.0   46.4   95.8   35.0   52.8   28.1   49.4
## 2   56.2   45.8   70.8   25.0   31.2   66.7   67.9   52.0
## 3   50.0   50.0   33.3   50.0   31.2   53.1   46.4   44.9
## 4   46.4     NA   37.5   75.0   68.8   38.9   42.9     NA
## 5   65.0   50.0   12.5   55.0   50.0   33.3   40.6   43.8
## 6   60.7   70.0   25.0   37.5   33.3   40.6   37.5   43.5

References

  1. Struik, Fransien M., Huib A.M. Kerstjens, Gerrie Bladder, Roy Sprooten, Marianne Zijnen, Jerryll Asin, Thys van der Molen, and Peter J. Wijkstra. 2013. “The Severe Respiratory Insufficiency Questionnaire Scored Best in the Assessment of Health-Related Quality of Life in Chronic Obstructive Pulmonary Disease.” Journal of Clinical Epidemiology 66 (10): 1166–74.

  2. Windisch, Wolfram, Klaus Freidel, Bernd Schucher, Hansjrg Baumann, Matthias Wiebel, Heinrich Matthys, and Franz Petermann. 2003. “The Severe Respiratory Insufficiency (SRI) Questionnaire a Specific Measure of Health-Related Quality of Life in Patients Receiving Home Mechanical Ventilation.” Journal of Clinical Epidemiology 56 (8): 752–59.