Evernote is a “cross-platform […] app designed for note taking, organizing, and archiving” (Wikipedia). All notes can be tagged and exported. I'm using Evernote, above all, to save and tag interesting blog posts related to R.
In this blog post, I show how to import and parse an exported Evernote file with R
.
Exporting the data from Evernote
In a first step, I've exported all of my notes tagged with 'R':
- Open the Evernote client;
- Select all notes to be exported;
- Go to 'File' > 'Export';
- Select option 'Export as a file in ENEX format (.enex)' from the format options box;
- Name the file 'Evernote.enex' and save it into your RStudio project folder.
Importing the data into R
Since the '.enex' file has xml properties, the 'Evernote.enex' file can be imported using the XML
package. Because of its structure, the imported file cannot be transformed into a dataframe right away. Instead, we need to transform it into a list (using the XML::xmlToList
function).
library(XML) xmlfile <- xmlParse("Evernote.enex") xmllist <- xmlToList(xmlfile, addAttributes = FALSE)
In the following section, I show how to create a dataframe based on the xmllist object.
Building a data frame
First, we generate an empty data frame. The number of rows (262) is determined by the number of elements in the xmllist object and the number of columns is set to zero.
mydata <- data.frame(matrix(NA, ncol = 0, nrow = length(xmllist))) dim(mydata)
[1] 262 0
Second, we read the names of the note titles and save it into a variable called title
which is part of our data frame mydata
.
for (i in 1:length(xmllist)){ mydata$title[i] <- unlist(xmllist[[i]]['title']) } head(mydata$title, 10)
[1] “Network visualization in R with the igraph package | Rules of Reason”
[2] “More debate analysis with R”
[3] “Analyzing networks of characters in 'Love Actually' – Variance Explained”
[4] “Web scraping in R”
[5] “Color Quantization in R”
[6] “Zellingenach: A visual exploration of the spatial patterns in the endings of German town and village names in R | rud.is”
[7] “Waterfall plots – what and how?”
[8] “Sentiment Analysis on Donald Trump using R and Tableau | DataScience+”
[9] “Version 0.9 of timeline on CRAN”
[10] “Date Formats in R”
In a next step, we obtain the dates the notes were created. In order to receive a variable of the date class, the variable 'create' must be formated. Using the stringr
package, we extract year, month and day and save it into the same variable.
for (i in 1:nrow(mydata)){ mydata$created[i] <- xmllist[[i]]['created'] } mydata$created <- as.Date(paste0(stringr::str_sub(mydata$created, 1, 4), '-', stringr::str_sub(mydata$created, 5, 6), '-', stringr::str_sub(mydata$created, 7, 8))) head(mydata$created, 5)
[1] “2016-01-06” “2016-01-06” “2016-01-05” “2016-01-05” “2016-01-04”
Furthermore, the http addresses of the notes can be read like this:
for (i in 1:nrow(mydata)){ mydata$www[i] <- xmllist[[i]]['note-attributes'] } mydata$www <- unlist(qdapRegex::ex_url(mydata$www, trim=TRUE, clean=TRUE, extract=TRUE)) mydata$www <- stringr::str_sub(mydata$www, 1, nchar(mydata$www)-2) head(mydata$www)
[1] “https://rulesofreason.wordpress.com/2012/11/05/network-visualization-in-r-with-the-igraph-package/”
[2] “http://www.r-bloggers.com/more-debate-analysis-with-r/”
[3] “http://varianceexplained.org/r/love-actually-network/”
[4] “http://cpsievert.github.io/slides/web-scraping/#1”
[5] “http://blog.ryanwalker.us/2016/01/color-quantization-in-r.html”
[6] “http://rud.is/b/2016/01/03/zellingenach-a-visual-exploration-of-the-spatial-patterns-in-the-endings-of-german-town-and-village-names-in-r/”
Finally, we want to read the tags and save them into a variable. Since the number of tags differs between the notes, we have to assess the number of tags for each note:
# number of tags for (i in 1:nrow(mydata)){ mydata$num.tag[i] <- length(which(names(xmllist[[i]])=="tag")) } head(mydata$num.tag, 20)
[1] 2 2 3 2 2 3 2 5 2 3 3 2 2 3 3 2 2 3 3 3
Since we want to save each tag into a single variable, we need to know the maximum number of tags.
tag.num <- max(mydata$num.tag) tag.num
[1] 5
With the next code snippet we add three variables to our dataframe: both the position of the first and last tag as numeric variables and a variable (of class list
) containing the positions of all tags.
# position of first tag for (i in 1:nrow(mydata)){ mydata$pos.1[i] <- which(names(xmllist[[i]])=="tag")[1] } # position of last tag mydata$pos.2 <- mydata$pos.1 + mydata$num.tag - 1 # position of tags for (i in 1:nrow(mydata)){ mydata$pos.all[i] <- list(c(mydata$pos.1[i]:mydata$pos.2[i])) } # remove pos.1 and pos.2 mydata$pos.1 <- NULL mydata$pos.2 <- NULL
Since we don't need the variables pos.1
and pos.2
for further processing, we remove them from our dataframe.
In the next step, we create 5 empty variables that will later on contain the tag names.
# create 5 new columns num.col <- ncol(mydata) for (i in (ncol(mydata) + 1):(ncol(mydata) + tag.num)){ mydata[, i] <- NA colnames(mydata)[i] <- paste0('tag.', i - num.col) }
The following code snipped intents to write the tag names into the variables tag.1
to tag.5
.
for (j in (num.col + 1):ncol(mydata)){ for (i in 1:nrow(mydata)){ mydata[i, j] <- xmllist[[i]][mydata$pos.all[[i]][j - num.col]][[1]] }}
However, evaluating the code returns the following error message:
Error in '[<-.data.frame'('*tmp*', i, j, value = NULL) :
replacement has length zero
Has anybody got an idea how to get the preceding code snippet working? I'd appreciate every piece of advice.
Thus, I decided to write one loop for each of the five variables. This is definetely not best practice, but it works.
# 1st tag for (i in 1:nrow(mydata)){ mydata$tag.1[i] <- xmllist[[i]][mydata$pos.all[[i]][1]][1] } # 2nd tag for (i in 1:nrow(mydata)){ mydata$tag.2[i] <- xmllist[[i]][mydata$pos.all[[i]][2]][1] } # 3rd tag for (i in 1:nrow(mydata)){ mydata$tag.3[i] <- xmllist[[i]][mydata$pos.all[[i]][3]][1] } # 4th tag for (i in 1:nrow(mydata)){ mydata$tag.4[i] <- xmllist[[i]][mydata$pos.all[[i]][4]][1] } # 5th tag for (i in 1:nrow(mydata)){ mydata$tag.5[i] <- xmllist[[i]][mydata$pos.all[[i]][5]][1] }
In the following step, we define a function (source) replacing NULL
by NA
and apply this function to each of the five tag variables:
# define function nullToNA <- function(x) { x[sapply(x, is.null)] <- NA return(x) } # apply function for (i in (num.col+1):ncol(mydata)){ for (j in 1:nrow(mydata)){ mydata[j, i] <- nullToNA(mydata[j, i]) }}
Finally, we paste the values of the five tag variables into a single variable named tags
. To do this, we use the paste2
function of the qdap
package. Since we don't need the variables tag.1 to tag.5 for further processing, we remove them from the dataframe using the select
function of the dplyr
package.
mydata$tags <- qdap::paste2(mydata[(num.col+1):ncol(mydata)], sep = ", ", handle.na = TRUE, trim = TRUE) mydata <- dplyr::select(mydata, -starts_with('tag.')) mydata$pos.all <- NULL
The final dataframe consists of the following variables:
- title containing the titles of the notes;
- created containing the dates the notes were created;
- www containing the notes' http addresses;
- num.tag containing the number of tags for each note;
- tags containing the tag names.
The following table gives an impression about how our final dataframe looks like.
knitr::kable(head(mydata), align = c('l', 'c', 'l', 'c', 'c'))
title | created | www | num.tag | tags |
---|---|---|---|---|
Network visualization in R with the igraph package | Rules of Reason | 2016-01-06 | https://rulesofreason.wordpress.com/2012/11/05/network-visualization-in-r-with-the-igraph-package/ | 2 | network analysis, R, NA, NA, NA |
More debate analysis with R | 2016-01-06 | http://www.r-bloggers.com/more-debate-analysis-with-r/ | 2 | text mining, R, NA, NA, NA |
Analyzing networks of characters in 'Love Actually' – Variance Explained | 2016-01-05 | http://varianceexplained.org/r/love-actually-network/ | 3 | network analysis, text mining, R, NA, NA |
Web scraping in R | 2016-01-05 | http://cpsievert.github.io/slides/web-scraping/#1 | 2 | webscraping, R, NA, NA, NA |
Color Quantization in R | 2016-01-04 | http://blog.ryanwalker.us/2016/01/color-quantization-in-r.html | 2 | R, image processing, NA, NA, NA |
Zellingenach: A visual exploration of the spatial patterns in the endings of German town and village names in R | rud.is | 2016-01-04 | http://rud.is/b/2016/01/03/zellingenach-a-visual-exploration-of-the-spatial-patterns-in-the-endings-of-german-town-and-village-names-in-r/ | 3 | text mining, geo, R, NA, NA |
The packages used in this blog post can be loaded/installed using the following code:
pacman::p_load(XML, knitr, dplyr, qdap, stringr)
The xmllist
object may be downloaded as an .RData file under the following link.
In one of my next blog posts, I will show how to analyse the tags.