This vignette discusses some advanced functions to analyze PubMed records using easyPubMed, as well as new functions that were introduced in the latest version of the library. If you are looking for a tutorial about how to get started with easyPubMed, please start by reading the Retrieving and Processing PubMed Records using easyPubMed vignette. More information are available at the following URL: https://www.data-pulse.com/dev_site/easypubmed/.

In this vignette, we are making use of some pre-processed PubMed records that are included in the easyPubMed package. You can access them using the utils::data() function.

library(easyPubMed)
library(dplyr)
library(kableExtra)

Before starting - prepare some data

The following code is aimed at downloading ad pre-processing a short-list of PubMed records in small batches. In real-world applications, you may want to download large number of records in series of about 1000 records or more, depending on the specific needs.

# Query pubmed and fetch many results
my_query <- 'Damiano Fantini[AU] AND '
my_query <- get_pubmed_ids(my_query)

# Download by 1000-item batches
my_batches <- seq(from = 1, to = my_query$Count, by = 10)
my_abstracts_xml <- lapply(my_batches,  function(i) {
  fetch_pubmed_data(my_query, retmax = 1000, retstart = i)  
})

# Store Pubmed Records as elements of a list
all_xml <- list()
for(x in my_abstracts_xml) {
  xx <- articles_to_list(x)
  for(y in xx) {
    all_xml[[(1 + length(all_xml))]] <- y
  }  
}

Demo 1: fast extraction of PMID, Title, and Abstract

The following code illustrates the use of article_to_df(, getAuthors = FALSE), for fast extraction of PubMed record titles and abstracts. This function can process PubMed records quickly, and will return all record data, without information about authors. Here, 18 records were processed in less than 1 sec.

# Starting time: record
t.start <- Sys.time()

# Perform operation (use lapply here, no further parameters)
final_df <- do.call(rbind, lapply(all_xml, article_to_df, 
                                  max_chars = -1, getAuthors = FALSE))

# Final time: record
t.stop <- Sys.time()

# How long did it take?
print(t.stop - t.start)
## Time difference of 0.6767623 secs
# Show an excerpt of the results
final_df[,c("pmid", "year", "abstract")]  %>%
  head() %>% kable() %>% kable_styling(bootstrap_options = 'striped')
pmid year abstract
30421072 2018 Bladder cancer is the fou…
30035181 2018 NA…
29785026 2019 The lysine methyltransfer…
29435122 2018 APOBEC enzymes are respon…
29367767 2018 The N-butyl-N-(4-hydroxyb…
29021137 2018 Deregulation of the Wnt/β…
# If interested in specific information,
# you can subset the dataframe and save the
# desired columns/features
id_abst_df <- final_df[,c("pmid", "abstract")]
id_abst_df %>%
    head(n=4) %>% kable() %>% kable_styling(bootstrap_options = 'striped')
pmid abstract
30421072 Bladder cancer is the fou…
30035181 NA…
29785026 The lysine methyltransfer…
29435122 APOBEC enzymes are respon…

Demo 2: full info extraction, including keywords

The following code illustrates the use of article_to_df(, getKeywords = TRUE), for recursive extraction of PubMed record info, including keywords. Author info extraction is a time-consuming process, but easyPubMed can handle this task in an efficient fashion. Here, we are extracting info from ~1000 PubMed records included in the attached IL_PubMed_data. The processing time was less than 3 min.

# Starting time: record
t.start <- Sys.time()

# Perform operation (use lapply here, no further parameters)
IL_records <- easyPubMed::IL_PubMed_data$IL_records
keyword_df <- do.call(rbind, lapply(IL_records, 
                                    article_to_df, autofill = T, 
                                    max_chars = 100, getKeywords = T))

# Final time: record
t.stop <- Sys.time()

# How long did it take?
print(t.stop - t.start)
## Time difference of 2.654878 mins
# Visualize Keywords extracted from PubMed records
# Keyword and MeSH Concepts are separated by semicolons
print(keyword_df$keywords[seq(1, 150, by = 15)])
##  [1] NA                                                                                                                                   
##  [2] "Bifidobacterium bifidum; IL-10; heterologous expression system; low-grade intestinal inflammation; microbiota; recombinant bacteria"
##  [3] "GRP78; GRP94; IL-22; IL-22BP; UPR; dendritic cells; exonization; isoform"                                                           
##  [4] "IL-1β; cytokine; follicular dendritic cell-secreted protein; inflammation; periodontal ligament; signaling pathway"                 
##  [5] NA                                                                                                                                   
##  [6] "B cell; BAFF; IL-33; autoantibodies; germinal center; immune tolerance; radiation resistant"                                        
##  [7] NA                                                                                                                                   
##  [8] NA                                                                                                                                   
##  [9] "Fatty liver; interleukin-1 beta; liver functions tests; morbid; obesity"                                                            
## [10] NA
# Show an excerpt of the results
keyword_df[seq(1, 100, by = 10), c("lastname", "firstname", "keywords")] %>%
    kable() %>% kable_styling(bootstrap_options = 'striped')
lastname firstname keywords
1 Hong Seoung-Jin NA
11 Mauras Aurélie Bifidobacterium bifidum; IL-10…
21 Shen Jia-Xin IL-33; cancer; cytokine; immun…
31 Alloza Iraide GRP78; GRP94; IL-22; IL-22BP; …
41 Beshkar Pezhman IL-6; TGF-β1; regulatory T cel…
51 Takata Takashi IL-1β; cytokine; follicular de…
61 Ghahartars Mehdi NA
71 Heesemann Jürgen interleukin-10; intestinal mot…
81 Kikly Kristine B cell; BAFF; IL-33; autoantib…
91 Nakano Rei NA

Demo 3: full info extraction using parallelization

The following code illustrates the use of article_to_df() in conjunction with parallelization. If multiple cores are available, splitting the job in multiple tasks can support faster info extraction from a large number of records. Here, ~1000 records (as before) were processed in ~1.7 min using 2 cores.

# Load required packages (available from CRAN).
# This will work on UNIX/LINUX systems. 
# Windows systems may not support the following code.
library(parallel)
library(foreach)
library(doParallel)
# Show an excerpt of the results
fullDF[seq(1, 100, by = 10), c("lastname", "keywords", "abstract")] %>%
    kable() %>% kable_styling(bootstrap_options = 'striped')
lastname keywords abstract
1 Hong NA Interleukin-1β (IL-1…
11 Mauras Bifidobacterium… In the last years th…
21 Shen IL-33; cancer; … The human Interleuki…
31 Alloza GRP78; GRP94; I… The human IL22RA2 ge…
41 Beshkar IL-6; TGF-β1; r… Proinflammatory cyto…
51 Takata IL-1β; cytokine… Follicular dendritic…
61 Ghahartars NA IL-27 has been shown…
71 Heesemann interleukin-10;… Objective: Postopera…
81 Kikly B cell; BAFF; I… Breaking tolerance i…
91 Nakano NA Inflammatory and mic…

Demo 4: Faster queries using API key

The following code illustrates the use of the argument api_key, which was introduced in version 2.11. E-utils users are now limited to 3 requests/second if an API key is not provided. However, users can obtain an NCBI/Entrez API key to increase the e-utils limit to 10 requests/second. For more information, visit: (https://www.ncbi.nlm.nih.gov/account/settings/)[https://www.ncbi.nlm.nih.gov/account/settings/]. Two easyPubMed functions can accept an api_key argument: get_pubmed_ids(), and batch_pubmed_download(). Requests submitted by the latter function are automatically paced, therefore the use of a key may speed the queries if records are retrieved in small batches. Please, use your own API key, as the one provided in the vignette has been replaced and is no longer valid.

# define a PubMed Query: this should return 40 results
my_query <- '"immune checkpoint" AND 2010[DP]:2012[DP]'

# Monitor time, and proceed with record download -- USING API_KEY!
t_key1 <- Sys.time()
set_01 <- batch_pubmed_download(my_query, 
                                api_key = "NNNNNNNNNNe9108aee96ace507af23a4eb09", 
                                batch_size = 2, dest_file_prefix = "TMP_api_")
t_key2 <- Sys.time()

# Monitor time, and proceed with record download -- DO NOT USE API_KEY!
t_nok1 <- Sys.time()
set_02 <- batch_pubmed_download(my_query, 
                                batch_size = 2, dest_file_prefix = "TMP_no_")
t_nok2 <- Sys.time()
# Compute time differences
# The use of a key makes the process faster
print(paste("With key:", t_key2 - t_key1))
## [1] "With key: 20.7291417121887"
print(paste("W/o key:", t_nok2 - t_nok1))
## [1] "W/o key: 24.6694004535675"

Demo 5: Searching for Exact Matches in PubMed using Full-length Publication Titles

Here, we demo get_pubmed_ids_by_fulltitle(), a new function included in version 2.11 of easyPubMed, and we compare its results with get_pubmed_ids(). Querying PubMed using full-length titles may be troublesome due to stopwords included in the title. To circumvent this problem, the get_pubmed_ids_by_fulltitle() function attempts a PubMed query after stopword removal if no results were returned by the original query.

# Define the query string and the query filter to apply
my_query <- "Body mass index and cancer risk among Chinese patients with type 2 diabetes mellitus"
my_field <- "[Title]"

# Standard query
res_01 <- get_pubmed_ids(paste("\"", my_query, "\"", my_field, sep = ""))
# Improved query (designed to query titles)
res_02 <- get_pubmed_ids_by_fulltitle(my_query, field = my_field)

## Display and compare the results
# Num results standard query
print(as.numeric(res_01$Count))
## [1] 0
# Num results title-specific query
print(as.numeric(res_02$Count))
## [1] 1
# Pubmed Record ID returned
print(as.numeric(res_02$IdList$Id[1]))
## [1] 30081866

Feedback and Citation

Thank you very much for using easyPubMed and/or reading this vignette. Please, feel free to contact me (author/maintainer) for feedback, questions and suggestions: my email is <damiano.fantini(at)gmail(dot)com>. More info about easyPubMed are available at the following URL: www.data-pulse.com.

easyPubMed Copyright (C) 2017-2019 Damiano Fantini. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

!!Note!! If you are using easyPubMed for a scientific publication, please name the package in the Materials and Methods section of the paper. Thanks! Also, I am always open to collaborations. If you have an idea you would like to discuss or develop based on what you read in this Vignette, feel free to contact me via email. Thank you.

SessionInfo

sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.1 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] doParallel_1.0.14 iterators_1.0.10  foreach_1.4.4     kableExtra_0.9.0 
## [5] dplyr_0.7.8       easyPubMed_2.11   XML_3.98-1.16    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0        pillar_1.3.0      compiler_3.4.4   
##  [4] highr_0.7         bindr_0.1.1       tools_3.4.4      
##  [7] digest_0.6.15     evaluate_0.10.1   tibble_1.4.2     
## [10] viridisLite_0.3.0 pkgconfig_2.0.2   rlang_0.3.0.1    
## [13] rstudioapi_0.8    yaml_2.2.0        bindrcpp_0.2.2   
## [16] stringr_1.3.1     httr_1.3.1        knitr_1.20       
## [19] xml2_1.2.0        hms_0.4.2         rprojroot_1.3-2  
## [22] tidyselect_0.2.5  glue_1.3.0        R6_2.3.0         
## [25] rmarkdown_1.10    purrr_0.2.5       readr_1.2.1      
## [28] magrittr_1.5      codetools_0.2-15  backports_1.1.2  
## [31] scales_1.0.0      htmltools_0.3.6   assertthat_0.2.0 
## [34] rvest_0.3.2       colorspace_1.3-2  stringi_1.2.4    
## [37] munsell_0.5.0     crayon_1.3.4