Text mining PubMed using R – a case study

Intro

For a couple of years, I've been doing research in the field of psychooncology, which basically is “concerned both with the effects of cancer on a person's psychological health as well as the social and behavioral factors that may affect the disease process of cancer and/or the remission of it.” The most influential psycho-oncological journal is named “Psycho-Oncology”. The journal was founded in 1992 and is published monthly by Wiley-Blackwell.

Data

In this blog post, I'll play arround with some bibliographic data from the PubMed data base. With the following query, we receive bibliographical information for every paper published in Psycho-Oncology from 1997 up today.

("Psycho-oncology"[Jour])

As the following screen shot shows, our search query returned 2785 entries. The search results can be downloaded and saved into a .csv file by clicking on the arrow right beside Send to, choosing destination (File) and format (CSV).

Fig. 1: Screenshot from PubMed

R packages

The following code will install load and / or install the R packages required for this blog post.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(pacman::p_load(readr, stringr, ggplot2, qdap, qdapRegex))

Import

The data can be easily imported into R using the read_csv() function from the readr package.

# import .csv file
mydata <- read_csv("pubmed_result.csv")

Data manipulation

With the next code snippet, we do some data manipulation in order to make our data frame tidy. We start with removing the last row of the data frame containing data about a paper published in 1995, remove superfluous headlines, entries without author and select the variables we need for analysis. The str_detect() function stems from the stringr package.

# remove only article from 1995
mydata <- mydata[1:nrow(mydata)-1,]
# remove headlines
mydata <- subset(mydata, str_detect(URL, 'URL')==FALSE)
# remove entries without author
mydata <- subset(mydata, str_detect(Description, 'Author A.')==FALSE)
mydata <- subset(mydata, str_detect(Description, '\\[No authors listed\\]')==FALSE)
# Select variables
mydata <- mydata[, c('Title', 'Description', 'ShortDetails')]

In the next step, we extract the year of publication from the ShortDetails variable. Based on the year variable, we create a variable named time merging the years to four periods of time (five years sub-periods between 1998 and 2016). Furthermore, we calculate the number of authors for each paper using the str_count function of the stringr package. Since each author except for the last is followed by a comma, we just need to count the number of commas and add 1.

# Return Year of publication
mydata$Year <- as.numeric(str_sub(mydata$ShortDetails, 
                                  nchar(mydata$ShortDetails)-3, 
                                  nchar(mydata$ShortDetails)))
mydata$ShortDetails <- NULL
colnames(mydata) <- c('title', 'authors', 'year')
mydata$time <- ifelse(mydata$year <= 2001, 1,
                      ifelse(mydata$year >= 2002 & mydata$year <=2006, 2,
                             ifelse(mydata$year >= 2007 & mydata$year <=2011, 3, 4)))
mydata$year <- as.factor(mydata$year)
# Categorize years
mydata$time <- factor(mydata$time, levels = c(1:4),
                      labels = c('1997-2001', '2002-2006', '2007-2011', '2012-2016'))
# Calculate number of authors
mydata$auth.n <- str_count(mydata$authors, '\\,') + 1

Before we can plot these data, we need to create 4 more data frames:

  • df.pup containing the number of papers published each year between 1997 and 2006,
  • df.auth.1 containing the mean number of authors per paper and year,
  • df.auth.2 containing the mean number of authors per paper and time period,
  • df.auth.3 containing the maximum number of authors per paper and year,
df.pup <- data.frame(table(mydata$year))
df.auth.1 <- data.frame(aggregate(mydata$auth.n, list(time = mydata$year), mean))
df.auth.2 <- data.frame(aggregate(mydata$auth.n, list(time = mydata$time), mean))
df.auth.3 <- data.frame(aggregate(mydata$auth.n, list(time = mydata$year), max))

Now, we can plot the data using the ggplot2 package.

ggplot(df.pup, aes(x=Var1, y=Freq, fill=0-Freq)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=Freq), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, max(df.pup$Freq)+10), breaks = seq(0, max(df.pup$Freq)+10, by=20)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 2: Number of papers published each year",
       caption = "Data source: PubMed")

plot of chunk ggplot1

Except for a few years, there has been a continuous rise in the number of publications. Since January 2013, Psycho-Oncology has been available exclusively online. Thus, the number of annual papers published is not restricted by requirements of the printed version anymore. Consequently, there was an enormous rise in the number of publications in 2013. However, in the following year (2014) the number of publications declined quite remarkably. The reason for this decline may be a lack of submissions.

ggplot(df.auth.1, aes(x=time, y=x, fill=0-x)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(x,1)), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, 10), breaks = seq(0,10, by=2)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 3: Mean number of authors per paper and year",
       caption = "Data source: PubMed")

plot of chunk ggplot2

According to figure 3 and 4, the number of authorships per article has been rising over the whole time period. This is in accordance with the general trend towards a rise in the number of personal authors per citation.

ggplot(df.auth.2, aes(x=time, y=x, fill=0-x)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(x,1)), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, 10), breaks = seq(0,10, by=2)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 4: Mean number of authors per paper and time period",
       caption = "Data source: PubMed")

plot of chunk ggplot3

Between 1997 and 2016, the mean number of authors per publication has increased from about four to six. The enormous rise in the maximum number of authors (see Fig. 5) supports the idea of an inflation in the number of authors per paper.

ggplot(df.auth.3, aes(x=time, y=x, fill=0-x)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(x,1)), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, max(df.auth.3$x)+5), breaks = seq(0, max(df.auth.3$x)+5, by=5)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 5: Max number of authors per paper and year",
      caption = "Data source: PubMed")

plot of chunk ggplot4

Text mining

Before we analyze the data, some janitorial work needs to be done.
We start with (1) defining a list of stopwords, (2, 3) remove non-word characters, (4, 5) replace spaces in two word groups we want to be grouped together, and (6) remove the defined stopwords from the variable containing the citation titles.

sw <- c(Top200Words, 'among') # qdap
mydata$txt <- rm_non_words(mydata$title) # qdapRegex
mydata$txt <- str_replace_all(mydata$txt, "'", "") # stringr
keeps <- c('quality of life', 'health related')
mydata$txt <- space_fill(mydata$txt, keeps, sep = "~~") # qdap
mydata$txt <- rm_stop(mydata$txt, sw,
                      separate = FALSE) # qdap

In the next step, we generate a distribution table for all words of the citation titles. The resulting data frame (df.words) consits of three variables: interval (citation title terms), freq (term frequency), and percent (percentage of the term).

df.words <- dist_tab(breaker(mydata$txt)) # qdap
df.words <- df.words[,c(1, 2, 4)]
df.words <- subset(df.words, percent > 0.5)
df.words <- df.words[order(df.words$percent),] 
df.words$interval <- factor(df.words$interval, 
                            levels=df.words$interval,
                            labels=df.words$interval, 
                            ordered=TRUE) 
kable(head(df.words))
interval freq percent
666 coping 141 0.51
2407 prostate 140 0.51
789 depression 143 0.52
1335 health 147 0.53
2678 review 150 0.54
2190 patient 154 0.56

With the next R code we create a list l.txt_time containing four sub lists.

# group titles by time period
l.txt_time <- by(mydata$txt, mydata$time, breaker)
str(l.txt_time)
## List of 4
##  $ 1997-2001: chr [1:2160] "relationship" "between" "trait" "anxiety" ...
##  $ 2002-2006: chr [1:3608] "measurement" "coping" "stress" "responses" ...
##  $ 2007-2011: chr [1:7310] "novel" "stop" "multidisciplinary" "clinic" ...
##  $ 2012-2016: chr [1:14606] "relationship" "between" "objectively" "measured" ...
##  - attr(*, "dim")= int 4
##  - attr(*, "dimnames")=List of 1
##   ..$ mydata$time: chr [1:4] "1997-2001" "2002-2006" "2007-2011" "2012-2016"
##  - attr(*, "call")= language by.default(data = mydata$txt, INDICES = mydata$time, FUN = breaker)
##  - attr(*, "class")= chr "by"

The following code snippet is to convert these four lists into separate data frames, each containing the most frequent citation title terms of the different periods of time.

# time span 1997-2001
df.t1 <- dist_tab(l.txt_time[[1]])
df.t1 <- df.t1[,c(1, 2, 4)]
df.t1 <- subset(df.t1, percent > 0.2)
df.t1 <- df.t1[order(df.t1$percent),] 
df.t1$interval <- factor(df.t1$interval, 
                         levels=df.t1$interval,
                         labels=df.t1$interval, 
                         ordered=TRUE) 
df.t1$year <- levels(mydata$time)[1]
# time span 2002-2006
df.t2 <- dist_tab(l.txt_time[[2]])
df.t2 <- df.t2[,c(1, 2, 4)]
df.t2 <- subset(df.t2, percent > 0.2)
df.t2 <- df.t2[order(df.t2$percent),] 
df.t2$interval <- factor(df.t2$interval, levels=df.t2$interval,
                         labels=df.t2$interval, ordered=TRUE) 
df.t2$year <- levels(mydata$time)[2]
# time span 2007-2011
df.t3 <- dist_tab(l.txt_time[[3]])
df.t3 <- df.t3[,c(1, 2, 4)]
df.t3 <- subset(df.t3, percent > 0.3)
df.t3 <- df.t3[order(df.t3$percent),] 
df.t3$interval <- factor(df.t3$interval, levels=df.t3$interval,
                         labels=df.t3$interval, ordered=TRUE) 
df.t3$year <- levels(mydata$time)[3]
# time span 2002-2016
df.t4 <- dist_tab(l.txt_time[[4]])
df.t4 <- df.t4[,c(1, 2, 4)]
df.t4 <- subset(df.t4, percent > 0.3)
df.t4 <- df.t4[order(df.t4$percent),] 
df.t4$interval <- factor(df.t4$interval, levels=df.t4$interval,
                         labels=df.t4$interval, ordered=TRUE) 
df.t4$year <- levels(mydata$time)[4]

Eventually, we create a data frame (df.time) containing the same variables as the data frame df.words and a variable (year) defining the period of time the paper was published.

# merging
df.time <- rbind(df.t1, df.t2, df.t3, df.t4)
df.time <- merge(df.time, df.words, by='interval')
df.time <- df.time[,1:4]
colnames(df.time) <- c('term', 'freq', 'percent', 'year')
df.time$year <- as.factor(df.time$year)

Finally, we plot the most frequent citation title terms over four periods of time.

ggplot(df.time, aes(x=term, y=percent, fill=year)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(percent,1)), size = 4, 
            fontface=2, hjust =  -0.3, vjust = 0.4) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', 
                     limits=c(0, max(df.time$percent)+1), 
                     breaks = seq(0, max(df.time$percent)+1, by=2)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  coord_flip() +
  facet_grid(. ~ year) +
  labs(title = "Fig. 6: Most frequent terms in time",
       caption = "Data source: PubMed")

plot of chunk ggplot5

Above all, the terms survivors and distress were increasingly used over time which reflects a growing interest in so-called psychological distress (for a critical discussion see) and cancer survivorship.