Intro
In academic writing, the use of reference management software has become essential. A couple of years ago, I started using the free open-source software Zotero. However, most of my workmates at Leipzig University work with Citavi, a commercial software which is widely used at German, Austrian, and Swiss universities. The current version of Citavi – Citavi 5 – was released in April 2015.
In this blog post, I show how to import Citavi files into R
.
Packages required
Citavi organizes references in SQLite databases. Although Citavi files end with .ctv5
rather than .sqlite
, they may be parsed as sqlite
files. For reproducing the code of this blog post, the following R
packages are required.
library(RSQLite) library(DBI) library(tidyverse) library(stringr)
Data
Import
A connection to the database myrefs.ctv5
may be established using the dbConnect()
function of the DBI
and the SQLite()
function of the RSQLite
package.
# connect to Citavi file con <- DBI::dbConnect(RSQLite::SQLite(), 'myrefs.ctv5')
A list of all tables may be returned using the dbListTables()
function of the DBI
package. Out database contains 38 tables.
# get a list of all tables dbListTables(con)
[1] “Annotation” “Category”
[3] “CategoryCategory” “Changeset”
[5] “Collection” “DBVersion”
[7] “EntityLink” “Group”
[9] “ImportGroup” “ImportGroupReference”
[11] “Keyword” “KnowledgeItem”
[13] “KnowledgeItemCategory” “KnowledgeItemCollection”
[15] “KnowledgeItemGroup” “KnowledgeItemKeyword”
[17] “Library” “Location”
[19] “Periodical” “Person”
[21] “ProjectSettings” “ProjectUserSettings”
[23] “Publisher” “Reference”
[25] “ReferenceAuthor” “ReferenceCategory”
[27] “ReferenceCollaborator” “ReferenceCollection”
[29] “ReferenceEditor” “ReferenceGroup”
[31] “ReferenceKeyword” “ReferenceOrganization”
[33] “ReferenceOthersInvolved” “ReferencePublisher”
[35] “ReferenceReference” “SeriesTitle”
[37] “SeriesTitleEditor” “TaskItem”
Creating data frames
For reading the tables of the database, we need the dbGetQuery()
function of the DBI
package. Each table contains a number of variables. In case we want to save all variables into a data frame, we select them using the asterisk.
df.author <- dbGetQuery(con,'select * from Person')
In case we are only interested in a couple of variables, we need to specify their names separated by commas.
df.author <- dbGetQuery(con,'select ID, FirstName, LastName, Sex from Person') df.keyword <- dbGetQuery(con,'select ID, Name from Keyword') df.refs <- dbGetQuery(con,'select ID, Title, Year, Abstract, CreatedOn, ISBN, PageCount, PlaceOfPublication, ReferenceType from Reference')
Data wrangling
In the next step, we try to join our data frames using the dplyr
package.
mydata <- df.refs %>% left_join(df.author, by = 'ID') %>% left_join(df.keyword, by = 'ID')
## Error in eval(expr, envir, enclos): Can't join on 'ID' x 'ID' because of incompatible types (list / list)
However, R
returns an error message disclosing that the data frames cannot be joined by their ID
variable because of incompatible types.
When we take a closer look at the ID
variables we see that they are organized as list containing 1603 elements of type raw. Moreover, each of the list elements consists of 16 alphanumerical elements.
typeof(df.author$ID)
## [1] "list"
str(df.author$ID[[1]])
## raw [1:16] 41 5b 18 51 ...
length(df.author$ID)
## [1] 1603
In order to be able to join our data frames we need to convert the type of the ID
variables from list
to character
. Furthermore, we collapse the 16 list elements into single strings separated by hyphens.
# df.author for (i in 1:nrow(df.author)){ df.author$ID[[i]] <- as.character(df.author$ID[[i]]) df.author$ID[[i]] <- str_c(df.author$ID[[i]], sep = "", collapse = "-") } df.author$ID <- unlist(df.author$ID) # df.keyword for (i in 1:nrow(df.keyword)){ df.keyword$ID[[i]] <- as.character(df.keyword$ID[[i]]) df.keyword$ID[[i]] <- str_c(df.keyword$ID[[i]], sep = "", collapse = "-") } df.keyword$ID <- unlist(df.keyword$ID) # df.refs for (i in 1:nrow(df.refs)){ df.refs$ID[[i]] <- as.character(df.refs$ID[[i]]) df.refs$ID[[i]] <- str_c(df.refs$ID[[i]], sep = "", collapse = "-") } df.refs$ID <- unlist(df.refs$ID)
typeof(df.refs$ID)
## [1] "character"
head(df.refs$ID)
## [1] "34-26-a7-db-e2-79-ac-4c-ad-21-45-91-37-3f-e9-b0" ## [2] "75-e6-c1-fd-65-0a-82-48-a8-c8-e9-80-2a-79-db-2c" ## [3] "0b-8f-73-86-82-0a-c8-48-84-f7-d2-7a-04-df-12-65" ## [4] "7a-79-2b-1e-4a-ef-ae-40-ae-06-d5-bd-5d-e9-a7-cb" ## [5] "12-3c-0e-70-a2-2a-f9-48-ad-e4-b7-7e-85-63-97-96" ## [6] "bc-ae-06-78-89-48-df-47-88-b6-5f-92-20-9b-4c-7c"
Joining the data frames
Finally, our data frames may be joined by their ID
variables.
mydata <- df.refs %>% left_join(df.author, by = 'ID') %>% left_join(df.keyword, by = 'ID')
Results
Our final data frame contains 1066 cases and 13 variables, that is:
colnames(mydata)
## [1] "ID" "Title" "Year" ## [4] "Abstract" "CreatedOn" "ISBN" ## [7] "PageCount" "PlaceOfPublication" "ReferenceType" ## [10] "FirstName" "LastName" "Sex" ## [13] "Name"
One thought on “How to parse Citavi files using R”