How to parse Evernote export files (.enex) using R

Evernote is a “cross-platform […] app designed for note taking, organizing, and archiving” (Wikipedia). All notes can be tagged and exported. I'm using Evernote, above all, to save and tag interesting blog posts related to R.

plot of chunk logo

In this blog post, I show how to import and parse an exported Evernote file with R.

Exporting the data from Evernote

In a first step, I've exported all of my notes tagged with 'R':

  • Open the Evernote client;
  • Select all notes to be exported;
  • Go to 'File' > 'Export';
  • Select option 'Export as a file in ENEX format (.enex)' from the format options box;
  • Name the file 'Evernote.enex' and save it into your RStudio project folder.

Importing the data into R

Since the '.enex' file has xml properties, the 'Evernote.enex' file can be imported using the XML package. Because of its structure, the imported file cannot be transformed into a dataframe right away. Instead, we need to transform it into a list (using the XML::xmlToList function).

library(XML)
xmlfile <- xmlParse("Evernote.enex")
xmllist <- xmlToList(xmlfile, addAttributes = FALSE)

In the following section, I show how to create a dataframe based on the xmllist object.

Building a data frame

First, we generate an empty data frame. The number of rows (262) is determined by the number of elements in the xmllist object and the number of columns is set to zero.

mydata <- data.frame(matrix(NA, ncol = 0, nrow = length(xmllist)))
dim(mydata)

[1] 262 0

Second, we read the names of the note titles and save it into a variable called title which is part of our data frame mydata.

for (i in 1:length(xmllist)){
  mydata$title[i] <- unlist(xmllist[[i]]['title'])
}

head(mydata$title, 10)

[1] “Network visualization in R with the igraph package | Rules of Reason”
[2] “More debate analysis with R”
[3] “Analyzing networks of characters in 'Love Actually' – Variance Explained”
[4] “Web scraping in R”
[5] “Color Quantization in R”
[6] “Zellingenach: A visual exploration of the spatial patterns in the endings of German town and village names in R | rud.is”
[7] “Waterfall plots – what and how?”
[8] “Sentiment Analysis on Donald Trump using R and Tableau | DataScience+”
[9] “Version 0.9 of timeline on CRAN”
[10] “Date Formats in R”

In a next step, we obtain the dates the notes were created. In order to receive a variable of the date class, the variable 'create' must be formated. Using the stringr package, we extract year, month and day and save it into the same variable.

for (i in 1:nrow(mydata)){
  mydata$created[i] <- xmllist[[i]]['created']
}


mydata$created <- as.Date(paste0(stringr::str_sub(mydata$created, 1, 4), 
                                 '-', 
                                 stringr::str_sub(mydata$created, 5, 6), 
                                 '-',
                                 stringr::str_sub(mydata$created, 7, 8)))

head(mydata$created, 5)

[1] “2016-01-06” “2016-01-06” “2016-01-05” “2016-01-05” “2016-01-04”

Furthermore, the http addresses of the notes can be read like this:

for (i in 1:nrow(mydata)){
  mydata$www[i] <- xmllist[[i]]['note-attributes']
}

mydata$www <- unlist(qdapRegex::ex_url(mydata$www,
                        trim=TRUE,
                        clean=TRUE,
                        extract=TRUE))

mydata$www <- stringr::str_sub(mydata$www, 1, nchar(mydata$www)-2)

head(mydata$www)

[1] “https://rulesofreason.wordpress.com/2012/11/05/network-visualization-in-r-with-the-igraph-package/
[2] “http://www.r-bloggers.com/more-debate-analysis-with-r/
[3] “http://varianceexplained.org/r/love-actually-network/
[4] “http://cpsievert.github.io/slides/web-scraping/#1
[5] “http://blog.ryanwalker.us/2016/01/color-quantization-in-r.html
[6] “http://rud.is/b/2016/01/03/zellingenach-a-visual-exploration-of-the-spatial-patterns-in-the-endings-of-german-town-and-village-names-in-r/

Finally, we want to read the tags and save them into a variable. Since the number of tags differs between the notes, we have to assess the number of tags for each note:

# number of tags
for (i in 1:nrow(mydata)){
  mydata$num.tag[i] <- length(which(names(xmllist[[i]])=="tag"))
}

head(mydata$num.tag, 20)

[1] 2 2 3 2 2 3 2 5 2 3 3 2 2 3 3 2 2 3 3 3

Since we want to save each tag into a single variable, we need to know the maximum number of tags.

tag.num <- max(mydata$num.tag)
tag.num

[1] 5

With the next code snippet we add three variables to our dataframe: both the position of the first and last tag as numeric variables and a variable (of class list) containing the positions of all tags.

# position of first tag
for (i in 1:nrow(mydata)){
  mydata$pos.1[i] <- which(names(xmllist[[i]])=="tag")[1]
}
# position of last tag
mydata$pos.2 <- mydata$pos.1 + mydata$num.tag - 1
# position of tags
for (i in 1:nrow(mydata)){
  mydata$pos.all[i] <- list(c(mydata$pos.1[i]:mydata$pos.2[i]))
}
# remove pos.1 and pos.2
mydata$pos.1 <- NULL
mydata$pos.2 <- NULL

Since we don't need the variables pos.1 and pos.2 for further processing, we remove them from our dataframe.

In the next step, we create 5 empty variables that will later on contain the tag names.

# create 5 new columns
num.col <- ncol(mydata) 
for (i in (ncol(mydata) + 1):(ncol(mydata) + tag.num)){
  mydata[, i] <- NA
  colnames(mydata)[i] <- paste0('tag.', i - num.col)
}

The following code snipped intents to write the tag names into the variables tag.1 to tag.5.

for (j in (num.col + 1):ncol(mydata)){
  for (i in 1:nrow(mydata)){
    mydata[i, j]  <- xmllist[[i]][mydata$pos.all[[i]][j - num.col]][[1]]
  }}

However, evaluating the code returns the following error message:

Error in '[<-.data.frame'('*tmp*', i, j, value = NULL) :
replacement has length zero

Has anybody got an idea how to get the preceding code snippet working? I'd appreciate every piece of advice.

Thus, I decided to write one loop for each of the five variables. This is definetely not best practice, but it works.

# 1st tag
for (i in 1:nrow(mydata)){
  mydata$tag.1[i]  <- xmllist[[i]][mydata$pos.all[[i]][1]][1]
}
# 2nd tag
for (i in 1:nrow(mydata)){
  mydata$tag.2[i]  <- xmllist[[i]][mydata$pos.all[[i]][2]][1]
}
# 3rd tag
for (i in 1:nrow(mydata)){
  mydata$tag.3[i]  <- xmllist[[i]][mydata$pos.all[[i]][3]][1]
}
# 4th tag
for (i in 1:nrow(mydata)){
  mydata$tag.4[i]  <- xmllist[[i]][mydata$pos.all[[i]][4]][1]
}
# 5th tag
for (i in 1:nrow(mydata)){
  mydata$tag.5[i]  <- xmllist[[i]][mydata$pos.all[[i]][5]][1]
}

In the following step, we define a function (source) replacing NULL by NA and apply this function to each of the five tag variables:

# define function
nullToNA <- function(x) {
  x[sapply(x, is.null)] <- NA
  return(x)
}

# apply function
for (i in (num.col+1):ncol(mydata)){
  for (j in 1:nrow(mydata)){
  mydata[j, i] <- nullToNA(mydata[j, i])
}}

Finally, we paste the values of the five tag variables into a single variable named tags. To do this, we use the paste2 function of the qdap package. Since we don't need the variables tag.1 to tag.5 for further processing, we remove them from the dataframe using the select function of the dplyr package.

mydata$tags <- qdap::paste2(mydata[(num.col+1):ncol(mydata)], 
                            sep = ", ", 
                            handle.na = TRUE, 
                            trim = TRUE)

mydata <- dplyr::select(mydata, -starts_with('tag.'))
mydata$pos.all <- NULL

The final dataframe consists of the following variables:

  • title containing the titles of the notes;
  • created containing the dates the notes were created;
  • www containing the notes' http addresses;
  • num.tag containing the number of tags for each note;
  • tags containing the tag names.

The following table gives an impression about how our final dataframe looks like.

knitr::kable(head(mydata), align = c('l', 'c', 'l', 'c', 'c'))
title created www num.tag tags
Network visualization in R with the igraph package | Rules of Reason 2016-01-06 https://rulesofreason.wordpress.com/2012/11/05/network-visualization-in-r-with-the-igraph-package/ 2 network analysis, R, NA, NA, NA
More debate analysis with R 2016-01-06 http://www.r-bloggers.com/more-debate-analysis-with-r/ 2 text mining, R, NA, NA, NA
Analyzing networks of characters in 'Love Actually' – Variance Explained 2016-01-05 http://varianceexplained.org/r/love-actually-network/ 3 network analysis, text mining, R, NA, NA
Web scraping in R 2016-01-05 http://cpsievert.github.io/slides/web-scraping/#1 2 webscraping, R, NA, NA, NA
Color Quantization in R 2016-01-04 http://blog.ryanwalker.us/2016/01/color-quantization-in-r.html 2 R, image processing, NA, NA, NA
Zellingenach: A visual exploration of the spatial patterns in the endings of German town and village names in R | rud.is 2016-01-04 http://rud.is/b/2016/01/03/zellingenach-a-visual-exploration-of-the-spatial-patterns-in-the-endings-of-german-town-and-village-names-in-r/ 3 text mining, geo, R, NA, NA

The packages used in this blog post can be loaded/installed using the following code:

pacman::p_load(XML, knitr, dplyr, qdap, stringr)

The xmllist object may be downloaded as an .RData file under the following link.

In one of my next blog posts, I will show how to analyse the tags.

Advertisements

About norbert

I am post doc at the Department of Medical Psychology and Sociology, Leipzig University (GER), with degrees in sociology (MA) and public health (MPH).
This entry was posted in Data Management and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s