How to Assign Variable Labels in R

Intro

Defining variable labels is a useful way to describe and document datasets. Unlike SPSS, which makes it very easy to define variable labels using the data editor, base R doesn't provide any function to define variable labels (as far as I know).

However, Daniel Luedecke's R package sjlablled fills this gap. Let's give an example.

Defining variable labels

First, we load the mtcars data frame and define variable labels for all of the 11 variables:

data(mtcars)
labs <- c("Miles/(US) gallon", "Number of cylinders", "Displacement (cu.in.)", 
    "Gross horsepower", "Rear axle ratio", "Weight (1000 lbs)", "1/4 mile time", 
    "V/S", "Transmission", "Number of forward gears", "Number of carburetors")

Assigning labels to variables

Second, we assign the variable labels to the variables of the mtcars data frame:

library(sjlabelled)
mtcars <- set_label(mtcars, label = labs)

When we have a look at the mtcars data frame using RStudio's data viewer, we find the variable labels placed right underneath the variable names:


Moreover, we may as well save both variable names and labels into a data frame:

library(dplyr) # for data manipulation
library(knitr) # for printing tables
df <- get_label(mtcars) %>%
        data.frame() %>%
          rename_at(vars(1), funs(paste0('var.labs'))) %>%
            mutate(var.names = colnames(mtcars)) 
kable(df, align = 'lc')
var.labs var.names
Miles/(US) gallon mpg
Number of cylinders cyl
Displacement (cu.in.) disp
Gross horsepower hp
Rear axle ratio drat
Weight (1000 lbs) wt
¼ mile time qsec
V/S vs
Transmission am
Number of forward gears gear
Number of carburetors carb
Advertisements
Posted in Data Management, Fav R Packages, Tips & Tricks | Tagged | Leave a comment

Multiple Imputations of missing values — ein Erfahrungsbericht

Intro

In der Forschungspraxis werden Datenpunkte mit fehlenden Werten oft von der Analyse ausgeschlossen oder durch einen zentralen Lageparameter (Mittelwert, Median) ersetzt. Als Alternative dazu hat sich das Verfahren der Multiplen Imputation etabliert, welches in diesem Beitrag erklärt wird.

Die im Folgenden geschilderten Erfahrungen habe ich mit der Statistiksoftware R und dem mice-Paket gesammelt [2].

R Pakete

In dieser Blog-Post habe ich die folgenden R-Pakete verwendet:

library(dplyr)
library(ggplot2)
library(mice)

Hintergrund

Bei der Durchführung statistischer Auswertungen ist man oft mit dem Auftreten fehlender Werte konfrontiert. Schließt man Datenpunkte mit fehlenden Werten von der Analyse aus, führt das zu einer Reduktion der Fallzahl und damit zu einer geringeren Power der statistischen Tests. Da Werte nur selten zufällig fehlen, kann ein Fallausschluss zu einer Verzerrung der Ergebnisse führen. Ersetzt man fehlende Werte hingegen durch den Mittelwert oder Median der entsprechenden Messgröße führt dies zu einer Reduktion der Varianz, womit sich das Risiko für einen alpha-Fehler erhöht.

Als Alternative zu diesen und anderen Methoden wurde in den letzten Jahrzehnten die Methode der Multiplen Imputation (MI) entwickelt [5]. Ausgehend von einem Datensatz mit fehlenden Werten wird eine festgelegte Anzahl von kompletten Datensätzen (m) erzeugt, indem die fehlenden Werte durch sogenannte plausible Werte ersetzt werden. Diese plausiblen Werte werden einer Datenverteilung entnommen, die für die jeden fehlenden Wert modelliert wird. Die m imputierten Datensätze gleichen sich hinsichtlich der beobachteten Datenpunkte, unterscheiden sich jedoch in den imputierten fehlenden Werten, wobei die Unterschiede der Unsicherheit des Imputationsprozesses Rechnung tragen [1]. Das Verfahren der MI lässt sich mittlerweile mit (fast) allen gängigen Softwarepaketen (R, SAS, SPSS, STATA) durchführen.

Vorgehen

Entscheidend für die Qualität einer MI ist die Spezifikation des Imputationsmodells, welche in Abhängigkeit vom Variablentyp erfolgen muss. Für die Imputation numerischer Variablen kann z.B. die Methode des “Predictive mean matching” (pmm) verwendet werden. Diese Methode hat den Vorteil, dass die imputierten Werte nur innerhalb der durch die Variable vorgegebenen Spannweite liegen. Imputiert man z.B. die fehlenden Werte der Skala eines EORTC-Fragebogens, liegen die imputierten Werte im Bereich 0 bis 100.

Datensätze, in denen Längsschnittdaten erfasst werden, können in zwei verschiedenen Datenformaten vorliegen. So ist in einem flachen Datensatz für jeden Fall (z.B. Patienten) genau eine Zeile reserviert, während pro Messgröße (z.B. Lebensqualitätsskala) und Befragungszeitpunkt jeweils eine Variable existiert. In einem tiefen Datensatz hingegen entspricht die Anzahl der Zeilen pro Fall der Anzahl der Messzeitpunkte, während pro Messgröße nur eine Variable angelegt wird, welche die Messungen aller Messzeitpunkte enthält. Der Zeitpunkt der Intervention bzw. Befragung wird in einer separaten Variable angegeben. Für die Durchführung einer MI empfiehlt es sich, mit einem tiefen Datensatz zu arbeiten.

Arbeitet man mit Lebensqualitätsfragebögen, so steht man vor der Entscheidung, die Imputationen entweder auf Ebene der Einzelitems oder daraus berechneter Skalen durchzuführen. Da statistische Analysen in der Regel auf der Grundlage von Skalen erfolgen, empfiehlt es sich, die fehlenden Skalenwerte zu imputieren. Dazu kommt, dass die Anzahl der Skalen deutlich geringer ist als die Anzahl der Einzelitems. Mit der Anzahl der zu imputierenden Variablen reduziert sich auch die für die MI benötigte Zeit bzw. Rechnerleistung [4]. Darüber hinaus erscheint es sinnvoll, alle Variablen mit fehlenden Werten in einem “Durchgang” zu imputieren.

Probleme

Imputiert man fehlende Werte eines Datensatzes in tiefer Form, erhält man ein Objekt der Klasse mids. Dieses enthält neben den ursprünglichen, nicht-imputierten Daten auch die Daten der m imputierten Datensätze. Grundlage für die Datenanalyse ist eine gepoolte Version der imputierten Datensätze. Benötigt man aber für die Datenanalyse den Baselinewert einer Variable als eigene Variable, so steht man vor dem Problem, die Daten umformen zu müssen. Da dies unter Beibehaltung der m imputierten Datensätze geschehen soll, muss dass mids-Objekt zunächst in eine gestaplete Matrix (auch stacked oder tall matrix) umgewandelt werden [3]. Hat man diese Matrix in die gewünschte Form gebracht, so kann sie problemlos in ein mids-Objekt zurückverwandelt werden, auf dessen Grundlage statistische Modelle berechnet werden können.

Eigene Erfahrungen haben außerdem gezeigt, dass eine zu starke Collinearität der imputierten Variablen zu einer Verzerrung der imputierten Werte führen kann. Möchte man beispielsweise fehlende Werte von Sub-Skalen eines Fragebogens sowie der entsprechenden Gesamtskala imputieren, empfielt es sich, die Gesamtskala als informations-gebende Variable aus dem Imputationsprozess auszuschließen.

Güteüberprüfung

Um die Güte der MI beurteilen zu können sollte geprüft werden, ob die Verteilung der nicht-imputierten in etwa mit jener der imputierten Werte übereinstimmt. Dies kann z.B. mit Hilfe von Histogrammen oder einer Kombination aus Stripcharts und Violinplots erfolgen* (vgl. Abb. 1).

Abb. 1: Stripchart zur Güteüberprüfung der MI

Abb. 1 zeigt eine Kombination aus Stripchart und Violinplot für eine Variable (n=200) mit 50 fehlenden Werten, die 5 mal imputiert wurden (Methode: pmm). Es wird deutlich, dass die Verteilung der gepoolten imputierten Werte in etwa der Verteilung der nicht-imputierten Werte entspricht.

Fazit für die Praxis

Im Vergleich zum traditionellen Umgang mit fehlenden Werten (Fallausschluss, einfache Imputationen) führen Multiple Imputationen weder zu einem Verlust statistischer Power noch zu einer Unterschätzung der Varianz. Auch die Schätzwerte werden in der Regel nicht zu stark verzerrt. In der Forschungspraxis bieten Multiplen Imputationen eine gute Möglichkeit im Umgang mit fehlenden Werten.

Literatur

  1. Buuren, Stef van. 2012. Flexible Imputation of Missing Data. CRC Press.

  2. Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. Mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software 45 (3). http://doc.utwente.nl/78938/.

  3. Errickson, Josh. 2017. Multiple Imputation. http://dept.stat.lsa.umich.edu/jerrick/courses/stat701/notes/mi.html.

  4. Graham, John W. 2009. Missing Data Analysis: Making It Work in the Real World. Annual Review of Psychology 60. Annual Reviews: 549-76.

  5. Rubin, Donald B. 1987. Multiple Imputation for Nonresponse in Surveys. Vol. 81. John Wiley & Sons.

*PS: Vielen Dank an Axel Klenk für den Tipp.

Posted in Indroduction | Tagged , | 2 Comments

Scoring the St George’s Respiratory Questionnaire (SGRQ) using R

Background

plot of chunk sri

The St George's Respiratory Questionnaire (SGRQ) is an instrument for the measuring of Health-Related Quality-of-Life in patients with diseases of airways obstruction. The SGRQ contains 50 items covering three domains:

  • Symptoms (8 items),
  • Activity (16 items), and
  • Impacts (26 items).

In addition, a total summary scale may be computed (1, 2).

All scales have a score range between 0 and 100 with higher scores indicating a worse quality of life [2]. The items are either scored on 3-point-, 4-point-, and 5-point Likert scales, or they are binary-choice items that must be answered with either “yes” or “no”. Each item has an empirically derived regression weight.

Scoring the SGRQ

Based on the SGRQ Scoring Manual, I have written the R-package sgrqr for calculating the SGRQ scores.

Installation

The package is hosted on GitHub and may be installed using the following code:

devtools::install_github("nrkoehler/sgrqr")
library(srir)

Functions and data

The core of sgrqr is the function scoring_sgrq(). It must be applied to a data frame with the 50 SGRQ items and one id variable. Moreover, the package contains two data frames with simulated values. Unlike sgrq.full, sgrq.na has some missing values.

names(sgrq.full)
##  [1] "id"       "sgrq.1"   "sgrq.2"   "sgrq.3"   "sgrq.4"   "sgrq.5"  
##  [7] "sgrq.6"   "sgrq.7"   "sgrq.8"   "sgrq.11a" "sgrq.11b" "sgrq.11c"
## [13] "sgrq.11d" "sgrq.11e" "sgrq.11f" "sgrq.11g" "sgrq.15a" "sgrq.15b"
## [19] "sgrq.15c" "sgrq.15d" "sgrq.15e" "sgrq.15f" "sgrq.15g" "sgrq.15h"
## [25] "sgrq.15i" "sgrq.9"   "sgrq.10"  "sgrq.12a" "sgrq.12b" "sgrq.12c"
## [31] "sgrq.12d" "sgrq.12e" "sgrq.12f" "sgrq.13a" "sgrq.13b" "sgrq.13c"
## [37] "sgrq.13d" "sgrq.13e" "sgrq.13f" "sgrq.13g" "sgrq.13h" "sgrq.14a"
## [43] "sgrq.14b" "sgrq.14c" "sgrq.14d" "sgrq.16a" "sgrq.16b" "sgrq.16c"
## [49] "sgrq.16d" "sgrq.16e" "sgrq.17"
head(sgrq.na[1:6])
##   id sgrq.1 sgrq.2 sgrq.3 sgrq.4 sgrq.5
## 1  1      1      1      1      1      1
## 2  2      1     NA      2      4      3
## 3  3      5     NA      1      4      3
## 4  4      2      5     NA      2      2
## 5  5      4      2      5      3     NA
## 6  6      2      1      3      2      2

Usage

When applied to a data frame, the function returns a data frame containing the SGRQ score values and an id variable.

df <- scoring_sgrq(sgrq.full, id = 'id')
head(df)
##   id sgrq.ss sgrq.as sgrq.is sgrq.ts
## 1  1    90.6    56.4    45.0    56.0
## 2  2    67.1    57.3    65.3    63.2
## 3  3    54.8    56.1    49.6    52.4
## 4  4    62.7    56.5    47.2    52.6
## 5  5    52.1    30.7    64.7    52.3
## 6  6    67.7    45.1    61.4    57.5

If no id variable is specified, a data frame containing the score values only is returned.

df <- scoring_sgrq(sgrq.na)
head(df)
##   sgrq.ss sgrq.as sgrq.is sgrq.ts
## 1    90.6    56.4    45.0    56.0
## 2    74.1    57.3    65.3    64.4
## 3    57.4    63.6    57.3    59.2
## 4    65.1    62.7    54.7    58.9
## 5    56.1      NA    72.4    64.8
## 6    67.7    45.1    65.3    59.5

Difficulties in handling missing values

In the SGRQ scoring manual it says:

The Symptoms component will tolerate a maximum of 2 missed items. The weight for the missed item is subtracted from the total possible weight for the Symptoms component (662.5) and from the Total weight (3989.4).

Since item weights depend on the actual answers given, it remains unclear (at least for me) how to determine the weight of a missing item. The weight of the item “If you have a wheeze, is it worse in the morning?”, for example, is “0.0” vs. “62.0” depending on the answer “no” vs. “yes”. The algorithm implemented in scoring_sgrq() ascribes the missing item the highest weight possible (so 62.0 rather than 0.0). In order to be able to substract the weight of the missing item “from the total possible weight for the Symptoms component and from the Total weight”, it needs to be checked whether no more than 2 items are missing, and if so, which items are missing. Since this is very extensive to implement, I decided to program the algorithm the quick and dirty way.

First, I check whether no more than 2 items are missing:

  # return position of first item 
  a <- which(names(X)=="sgrq.1")   
  # return position of last item
  z <- which(names(X)=="sgrq.8")
  # calculate number of missing items
  Y$NMISS.ss <- rowSums(is.na(X[, c(a:z)]))

Second, I replace all missing values with the corresponding highest item weight:

  # replace missing values with highest weight
  for (i in a:z) {
    for (j in 1:nrow(X)){
      X[j, i] <- ifelse(is.na(X[j, i] == TRUE), repl.val[i-1], X[j, i])
    }}

Third, I calculate the score:

 # calculate score
  Y$sgrq.ss <- rowSums(X[, vars]) / 662.5 * 100

And finally, I replace the score value by NA if more than 2 items of the Symptom score are missing:

 Y$sgrq.ss <- ifelse(Y$NMISS.ss > 2, NA, Y$sgrq.ss)

Rather than substracting the weight of the missing item “form the total possible weight”, I “add” the highest possible item weight to the missing item, but only if no more than 2 items are missing.

I'm looking forward to getting some feedback to this post. I'm sure there is a better solution.

References

  1. Jones, P. W., F. H. Quirk, and C. M. Baveystock. 1991. The St George Respiratory Questionnaire. Respiratory Medicine 85 (September): 25-31. doi:10.1016/S0954-6111(06)80166-6.

  2. Jones, Paul W, Frances H Quirk, Chlo M Baveystock, and Peter Littlejohns. 1992. A Self-Complete Measure of Health Status for Chronic Airflow Limitation. Am Rev Respir Dis 145 (6): 1321-7.

Posted in Fav R Packages | Tagged , | Leave a comment

Scoring the Severe Respiratory Insufficiency Questionnaire (SRI) using R

Background

The SRI is a multidimensional general health questionnaire “to assess HRQL in patients with chronic respiratory failure due to various underlying diseases” [1]. Based on 49 items, seven sub scales addressing the following domains are calculated:

plot of chunk sri
  • Respiratory Complaints (8 items);
  • Physical Function (6 items);
  • Attendant Symptoms and Sleep (7 items);
  • Social Relationships (6 items);
  • Anxiety (5 items);
  • Psychological Well-Being (9 items);
  • Social Functioning (8 items).

Based on the sub scales, a total summary scale is calculated. All scales have a score range between 0 and 100 with higher scores indicating a better quality of life [2]. All items are scored on a 5-point Likert scale ranging from 1 (completely untrue) to 5 (always true). The majority of items need to be recoded (recoded value = 6 – raw value).

Scoring the SRI

Based on the SRI Scoring Manual, I have written the R-package srir for calculating the SRI scores.

Installation

The package is hosted on GitHub and may be installed using the following code:

devtools::install_github("nrkoehler/srir")
library(srir)

Functions and data

The core of srir is the function scoring_sri(). It must be applied to a data frame with the 49 SRI items and one id variable. Moreover, the package contains two data frames with simulated values. Unlike df.full, df.na has some missing values.

names(df.full)
##  [1] "id"     "sri.1"  "sri.2"  "sri.3"  "sri.4"  "sri.5"  "sri.6" 
##  [8] "sri.7"  "sri.8"  "sri.9"  "sri.10" "sri.11" "sri.12" "sri.13"
## [15] "sri.14" "sri.15" "sri.16" "sri.17" "sri.18" "sri.19" "sri.20"
## [22] "sri.21" "sri.22" "sri.23" "sri.24" "sri.25" "sri.26" "sri.27"
## [29] "sri.28" "sri.29" "sri.30" "sri.31" "sri.32" "sri.33" "sri.34"
## [36] "sri.35" "sri.36" "sri.37" "sri.38" "sri.39" "sri.40" "sri.41"
## [43] "sri.42" "sri.43" "sri.44" "sri.45" "sri.46" "sri.47" "sri.48"
## [50] "sri.49"
head(df.na[1:6])
##   id sri.1 sri.2 sri.3 sri.4 sri.5
## 1  1     1     2     1     5     4
## 2  2     5     3     5     3     2
## 3  3     2     2     4    NA     5
## 4  4    NA     3     2     4     4
## 5  5     3     4     3     5     2
## 6  6     2     2     2     3     4

Usage

When applied to a data frame, the function returns a data frame containing the SRI score values and an id variable.

df <- scoring_sri(df.full, id = 'id')
head(df)
##   id sri.rc sri.pf sri.as sri.sr sri.ax sri.wb sri.sf sri.ss
## 1  1   37.5   50.0   46.4   95.8     35   52.8   28.1   49.4
## 2  2   56.2   45.8   60.7   41.7     45   66.7   59.4   53.6
## 3  3   50.0   50.0   35.7   45.8     35   58.3   53.1   46.9
## 4  4   46.9   54.2   39.3   75.0     55   38.9   43.8   50.4
## 5  5   71.9   50.0   28.6   58.3     50   33.3   40.6   47.5
## 6  6   53.1   66.7   28.6   45.8     30   47.2   37.5   44.1

If no id variable is specified, a data frame containing the score values only is returned.

df <- scoring_sri(df.na)
head(df)
##   sri.rc sri.pf sri.as sri.sr sri.ax sri.wb sri.sf sri.ss
## 1   37.5   50.0   46.4   95.8   35.0   52.8   28.1   49.4
## 2   56.2   45.8   70.8   25.0   31.2   66.7   67.9   52.0
## 3   50.0   50.0   33.3   50.0   31.2   53.1   46.4   44.9
## 4   46.4     NA   37.5   75.0   68.8   38.9   42.9     NA
## 5   65.0   50.0   12.5   55.0   50.0   33.3   40.6   43.8
## 6   60.7   70.0   25.0   37.5   33.3   40.6   37.5   43.5

References

  1. Struik, Fransien M., Huib A.M. Kerstjens, Gerrie Bladder, Roy Sprooten, Marianne Zijnen, Jerryll Asin, Thys van der Molen, and Peter J. Wijkstra. 2013. “The Severe Respiratory Insufficiency Questionnaire Scored Best in the Assessment of Health-Related Quality of Life in Chronic Obstructive Pulmonary Disease.” Journal of Clinical Epidemiology 66 (10): 1166–74.

  2. Windisch, Wolfram, Klaus Freidel, Bernd Schucher, Hansjrg Baumann, Matthias Wiebel, Heinrich Matthys, and Franz Petermann. 2003. “The Severe Respiratory Insufficiency (SRI) Questionnaire a Specific Measure of Health-Related Quality of Life in Patients Receiving Home Mechanical Ventilation.” Journal of Clinical Epidemiology 56 (8): 752–59.

Posted in Fav R Packages | Tagged , | Leave a comment

How to fix problems after updating to R 3.4.1 on Linux Mint 18.1 / Ubuntu 16.04

The Problem

After updating R to version 3.4.1 (“Single Candle”) on Linux Mint 18.1, RStudio could neither find my installed R packages nor was it possible to install R packages into some other directory.

The Solution

Fortunately, this problem could be solved very easily. I just needed to tell R where to find the libraries of version 3.4 again. On my 64 bit machine, I had to first open the Renviron file using xed (Linux Mint text editor).

sudo xed /usr/lib/R/etc/Renviron 

Second, I had to enter the following code line and then save and close the file.

R_LIBS_USER=${R_LIBS_USER-'~/R/x86_64-pc-linux-gnu-library/3.4'}

That was it. 🙂

Posted in Tips & Tricks | Tagged , | Leave a comment

How to calculate Odds Ratios in Case-Control Studies using R

Intro

In June 2017 I've started working at the Clinical Trial Centre Leipzig at Leipzig University. Since my knowledge in statistics is rather poor, my employer offered me to attend some seminars in Medical Biometry at the University of Heidelberg. The first seminar I attended was called “Basics of Epidemiology”. At the first day, we learned how to calculate so called odds ratios in case-control studies using a simple pocket calculator.

In this blog post, I will show, how to calculate a simple odds ratio with 95% CI using R.

Data simulation

The data I'm using in this blog post were simulated using the wakefield package. The following code returns a data frame with 2 binary variables (Exposition and Disease) and 1.000 cases.

library(wakefield)

mydata <- data.frame(Exposition = group(n = 1000, x = c('yes', 'no'), 
                             prob = c(0.75, 0.25)),
                     Disease = group(n = 1000, x = c('yes', 'no'), 
                             prob = c(0.75, 0.25)))
dim(mydata)
## [1] 1000    2
head(mydata)
##   Exposition Disease
## 1        yes     yes
## 2         no      no
## 3         no     yes
## 4        yes     yes
## 5        yes     yes
## 6        yes     yes

Based on this data frame, we calculate a table showing how many patients with exposition vs. no exposition developed a disease vs. no disease.

tab <-table(mydata$Exposition, mydata$Disease)
tab
##      
##       yes  no
##   yes 569 210
##   no  163  58

Odds Ratio Calculation

In order to get to know whether the risk for developing a disease is significantly higher in patients having a certain exposition, we need to calculate the odds ratio and its 95% CI.

The following function will return a data frame containing these values.

# return odds ratio with 95%ci
f <- function(x) {
  or <- round((x[1] * x[4]) / (x[2] * x[3]), 2)
  cil <- round(exp(log(or) - 1.96 * sqrt(1/x[1] + 1/x[2] + 1/x[3] + 1/x[4])), 2)
  ciu <- round(exp(log(or) + 1.96 * sqrt(1/x[1] + 1/x[2] + 1/x[3] + 1/x[4])), 2)
  df <- data.frame(matrix(ncol = 3, nrow = 1, 
                              dimnames = list(NULL, c('CI_95_lower', 'OR', 'CI_95_upper'))))
  df[1,] <- rbind(c(cil, or, ciu))
  df <- as.data.frame(df)
}

Now, we can deploy the function on our table tab.

df.or <- f(tab)
knitr::kable(df.or, align = 'c')
CI_95_lower OR CI_95_upper
0.68 0.96 1.35

As the results indicate, patients with a disposition have no higher risk to develop a disease than patients having no disposition.

Posted in Indroduction | Tagged , | Leave a comment

How to parse Citavi files using R

Intro

In academic writing, the use of reference management software has become essential. A couple of years ago, I started using the free open-source software Zotero. However, most of my workmates at Leipzig University work with Citavi, a commercial software which is widely used at German, Austrian, and Swiss universities. The current version of Citavi – Citavi 5 – was released in April 2015.

plot of chunk pic

In this blog post, I show how to import Citavi files into R.

Packages required

Citavi organizes references in SQLite databases. Although Citavi files end with .ctv5 rather than .sqlite, they may be parsed as sqlite files. For reproducing the code of this blog post, the following R packages are required.

library(RSQLite)
library(DBI)
library(tidyverse)
library(stringr)

Data

Import

A connection to the database myrefs.ctv5 may be established using the dbConnect() function of the DBI and the SQLite() function of the RSQLite package.

# connect to Citavi file
con <- DBI::dbConnect(RSQLite::SQLite(), 'myrefs.ctv5')

A list of all tables may be returned using the dbListTables() function of the DBI package. Out database contains 38 tables.

# get a list of all tables
dbListTables(con) 

[1] “Annotation” “Category”
[3] “CategoryCategory” “Changeset”
[5] “Collection” “DBVersion”
[7] “EntityLink” “Group”
[9] “ImportGroup” “ImportGroupReference”
[11] “Keyword” “KnowledgeItem”
[13] “KnowledgeItemCategory” “KnowledgeItemCollection”
[15] “KnowledgeItemGroup” “KnowledgeItemKeyword”
[17] “Library” “Location”
[19] “Periodical” “Person”
[21] “ProjectSettings” “ProjectUserSettings”
[23] “Publisher” “Reference”
[25] “ReferenceAuthor” “ReferenceCategory”
[27] “ReferenceCollaborator” “ReferenceCollection”
[29] “ReferenceEditor” “ReferenceGroup”
[31] “ReferenceKeyword” “ReferenceOrganization”
[33] “ReferenceOthersInvolved” “ReferencePublisher”
[35] “ReferenceReference” “SeriesTitle”
[37] “SeriesTitleEditor” “TaskItem”

Creating data frames

For reading the tables of the database, we need the dbGetQuery() function of the DBI package. Each table contains a number of variables. In case we want to save all variables into a data frame, we select them using the asterisk.

df.author <- dbGetQuery(con,'select * from Person')

In case we are only interested in a couple of variables, we need to specify their names separated by commas.

df.author <- dbGetQuery(con,'select ID, FirstName, LastName, Sex from Person')
df.keyword <- dbGetQuery(con,'select ID, Name from Keyword')
df.refs <- dbGetQuery(con,'select ID, Title, Year, Abstract, CreatedOn, ISBN, PageCount, PlaceOfPublication, ReferenceType   from Reference')

Data wrangling

In the next step, we try to join our data frames using the dplyr package.

mydata <- df.refs %>%
  left_join(df.author, by = 'ID') %>%
    left_join(df.keyword, by = 'ID') 
## Error in eval(expr, envir, enclos): Can't join on 'ID' x 'ID' because of incompatible types (list / list)

However, R returns an error message disclosing that the data frames cannot be joined by their ID variable because of incompatible types.

When we take a closer look at the ID variables we see that they are organized as list containing 1603 elements of type raw. Moreover, each of the list elements consists of 16 alphanumerical elements.

typeof(df.author$ID)
## [1] "list"
str(df.author$ID[[1]])
##  raw [1:16] 41 5b 18 51 ...
length(df.author$ID)
## [1] 1603

In order to be able to join our data frames we need to convert the type of the ID variables from list to character. Furthermore, we collapse the 16 list elements into single strings separated by hyphens.

# df.author
for (i in 1:nrow(df.author)){
  df.author$ID[[i]] <- as.character(df.author$ID[[i]])
  df.author$ID[[i]] <- str_c(df.author$ID[[i]], sep = "", collapse = "-")
}
df.author$ID <- unlist(df.author$ID)
# df.keyword
for (i in 1:nrow(df.keyword)){
  df.keyword$ID[[i]] <- as.character(df.keyword$ID[[i]])
  df.keyword$ID[[i]] <- str_c(df.keyword$ID[[i]], sep = "", collapse = "-")
}
df.keyword$ID <- unlist(df.keyword$ID)
# df.refs
for (i in 1:nrow(df.refs)){
  df.refs$ID[[i]] <- as.character(df.refs$ID[[i]])
  df.refs$ID[[i]] <- str_c(df.refs$ID[[i]], sep = "", collapse = "-")
}
df.refs$ID <- unlist(df.refs$ID)
typeof(df.refs$ID)
## [1] "character"
head(df.refs$ID)
## [1] "34-26-a7-db-e2-79-ac-4c-ad-21-45-91-37-3f-e9-b0"
## [2] "75-e6-c1-fd-65-0a-82-48-a8-c8-e9-80-2a-79-db-2c"
## [3] "0b-8f-73-86-82-0a-c8-48-84-f7-d2-7a-04-df-12-65"
## [4] "7a-79-2b-1e-4a-ef-ae-40-ae-06-d5-bd-5d-e9-a7-cb"
## [5] "12-3c-0e-70-a2-2a-f9-48-ad-e4-b7-7e-85-63-97-96"
## [6] "bc-ae-06-78-89-48-df-47-88-b6-5f-92-20-9b-4c-7c"

Joining the data frames

Finally, our data frames may be joined by their ID variables.

mydata <- df.refs %>%
  left_join(df.author, by = 'ID') %>%
    left_join(df.keyword, by = 'ID') 

Results

Our final data frame contains 1066 cases and 13 variables, that is:

colnames(mydata)
##  [1] "ID"                 "Title"              "Year"              
##  [4] "Abstract"           "CreatedOn"          "ISBN"              
##  [7] "PageCount"          "PlaceOfPublication" "ReferenceType"     
## [10] "FirstName"          "LastName"           "Sex"               
## [13] "Name"
Posted in Data Management | Tagged , | Leave a comment