How to parse Citavi files using R

Intro

In academic writing, the use of reference management software has become essential. A couple of years ago, I started using the free open-source software Zotero. However, most of my workmates at Leipzig University work with Citavi, a commercial software which is widely used at German, Austrian, and Swiss universities. The current version of Citavi – Citavi 5 – was released in April 2015.

plot of chunk pic

In this blog post, I show how to import Citavi files into R.

Packages required

Citavi organizes references in SQLite databases. Although Citavi files end with .ctv5 rather than .sqlite, they may be parsed as sqlite files. For reproducing the code of this blog post, the following R packages are required.

library(RSQLite)
library(DBI)
library(tidyverse)
library(stringr)

Data

Import

A connection to the database myrefs.ctv5 may be established using the dbConnect() function of the DBI and the SQLite() function of the RSQLite package.

# connect to Citavi file
con <- DBI::dbConnect(RSQLite::SQLite(), 'myrefs.ctv5')

A list of all tables may be returned using the dbListTables() function of the DBI package. Out database contains 38 tables.

# get a list of all tables
dbListTables(con)

[1] “Annotation” “Category”
[3] “CategoryCategory” “Changeset”
[5] “Collection” “DBVersion”
[7] “EntityLink” “Group”
[9] “ImportGroup” “ImportGroupReference”
[11] “Keyword” “KnowledgeItem”
[13] “KnowledgeItemCategory” “KnowledgeItemCollection”
[15] “KnowledgeItemGroup” “KnowledgeItemKeyword”
[17] “Library” “Location”
[19] “Periodical” “Person”
[21] “ProjectSettings” “ProjectUserSettings”
[23] “Publisher” “Reference”
[25] “ReferenceAuthor” “ReferenceCategory”
[27] “ReferenceCollaborator” “ReferenceCollection”
[29] “ReferenceEditor” “ReferenceGroup”
[31] “ReferenceKeyword” “ReferenceOrganization”
[33] “ReferenceOthersInvolved” “ReferencePublisher”
[35] “ReferenceReference” “SeriesTitle”
[37] “SeriesTitleEditor” “TaskItem”

Creating data frames

For reading the tables of the database, we need the dbGetQuery() function of the DBI package. Each table contains a number of variables. In case we want to save all variables into a data frame, we select them using the asterisk.

df.author <- dbGetQuery(con,'select * from Person')

In case we are only interested in a couple of variables, we need to specify their names separated by commas.

df.author <- dbGetQuery(con,'select ID, FirstName, LastName, Sex from Person')
df.keyword <- dbGetQuery(con,'select ID, Name from Keyword')
df.refs <- dbGetQuery(con,'select ID, Title, Year, Abstract, CreatedOn, ISBN, PageCount, PlaceOfPublication, ReferenceType   from Reference')

Data wrangling

In the next step, we try to join our data frames using the dplyr package.

mydata <- df.refs %>%
  left_join(df.author, by = 'ID') %>%
    left_join(df.keyword, by = 'ID')

## Error in eval(expr, envir, enclos): Can't join on 'ID' x 'ID' because of incompatible types (list / list)

However, R returns an error message disclosing that the data frames cannot be joined by their ID variable because of incompatible types.

When we take a closer look at the ID variables we see that they are organized as list containing 1603 elements of type raw. Moreover, each of the list elements consists of 16 alphanumerical elements.

typeof(df.author$ID)

## [1] "list"

str(df.author$ID[[1]])

##  raw [1:16] 41 5b 18 51 ...

length(df.author$ID)

## [1] 1603

In order to be able to join our data frames we need to convert the type of the ID variables from list to character. Furthermore, we collapse the 16 list elements into single strings separated by hyphens.

# df.author
for (i in 1:nrow(df.author)){
  df.author$ID[[i]] <- as.character(df.author$ID[[i]])
  df.author$ID[[i]] <- str_c(df.author$ID[[i]], sep = "", collapse = "-")
}
df.author$ID <- unlist(df.author$ID)
# df.keyword
for (i in 1:nrow(df.keyword)){
  df.keyword$ID[[i]] <- as.character(df.keyword$ID[[i]])
  df.keyword$ID[[i]] <- str_c(df.keyword$ID[[i]], sep = "", collapse = "-")
}
df.keyword$ID <- unlist(df.keyword$ID)
# df.refs
for (i in 1:nrow(df.refs)){
  df.refs$ID[[i]] <- as.character(df.refs$ID[[i]])
  df.refs$ID[[i]] <- str_c(df.refs$ID[[i]], sep = "", collapse = "-")
}
df.refs$ID <- unlist(df.refs$ID)

typeof(df.refs$ID)

## [1] "character"

head(df.refs$ID)

## [1] "34-26-a7-db-e2-79-ac-4c-ad-21-45-91-37-3f-e9-b0"
## [2] "75-e6-c1-fd-65-0a-82-48-a8-c8-e9-80-2a-79-db-2c"
## [3] "0b-8f-73-86-82-0a-c8-48-84-f7-d2-7a-04-df-12-65"
## [4] "7a-79-2b-1e-4a-ef-ae-40-ae-06-d5-bd-5d-e9-a7-cb"
## [5] "12-3c-0e-70-a2-2a-f9-48-ad-e4-b7-7e-85-63-97-96"
## [6] "bc-ae-06-78-89-48-df-47-88-b6-5f-92-20-9b-4c-7c"

Joining the data frames

Finally, our data frames may be joined by their ID variables.

mydata <- df.refs %>%
  left_join(df.author, by = 'ID') %>%
    left_join(df.keyword, by = 'ID')

Results

Our final data frame contains 1066 cases and 13 variables, that is:

colnames(mydata)

##  [1] "ID"                 "Title"              "Year"              
##  [4] "Abstract"           "CreatedOn"          "ISBN"              
##  [7] "PageCount"          "PlaceOfPublication" "ReferenceType"     
## [10] "FirstName"          "LastName"           "Sex"               
## [13] "Name"

Author: norbert

Biometrician at Clinical Trial Centre, Leipzig University (GER), with degrees in sociology (MA) and public health (MPH). View all posts by norbert

How to parse Citavi files using R

Intro

Packages required

Data

Import

Creating data frames

Data wrangling

Joining the data frames

Results

Author: norbert

One thought on “How to parse Citavi files using R”

Leave a comment Cancel reply

Intro

Packages required

Data

Import

Creating data frames

Data wrangling

Joining the data frames

Results

Share this:

Related

Author: norbert

One thought on “How to parse Citavi files using R”

Leave a comment Cancel reply