easyPubMed: working with retstart and retmax

Precise fetching of PubMed records using retmax and retstart

This vignette is aimed at illustrating how to precisely retrieve subsets of PubMed records using the retmax and retstart arguments.

Getting started

Here, I am submitting a simple easyPubMed query that will be used throughout the vignette to illustrate the use of retmax and retstart. The query string used is: “parkinson[TI] AND 2019[PDAT]”. This returned n=539 records.

library(easyPubMed)

# Query PubMed
qr1 <- get_pubmed_ids("parkinson[TI] AND 2019[PDAT]")

# How many records are there?
print(qr1$Count)

## [1] "539"

Fetch the first 5 PubMed records

Let’s retrieve the first 5 records returned my the query. This can be obtained using the fetch_pubmed_data() function. The index of PubMed records returned by any query starts from 0. Therefore, here we want to fetch records with index 0, 1, 2, 3, and 4. To do so, we specify:

retstart = 0
retmax = 5

btch1 <- fetch_pubmed_data(pubmed_id_list = qr1, retstart = 0, retmax = 5)
btch1 <- table_articles_byAuth(btch1, included_authors = 'last', max_chars = 0)
btch1[, c("pmid", "lastname", "jabbrv")]

##       pmid   lastname                             jabbrv
## 1 32231772 Taghizadeh                Basic Clin Neurosci
## 2 32190422   Holloway                  Neurol Clin Pract
## 3 32185101    Barbano                  Neurol Clin Pract
## 4 32104723     Ciucci Perspect ASHA Spec Interest Groups
## 5 32095766     Katabi                      Digit Biomark

Note: if you set the retstart = 1, you will retrieve records starting from the second (and not the first) record returned by the query. See the following example, and compare results with the previous analysis.

# Here, we fetch 5 records, but we skip the first record
btch2 <- fetch_pubmed_data(pubmed_id_list = qr1, retstart = 1, retmax = 5)
btch2 <- table_articles_byAuth(btch2, included_authors = 'last', max_chars = 0)
btch2[, c("pmid", "lastname", "jabbrv")]

##       pmid lastname                             jabbrv
## 1 32190422 Holloway                  Neurol Clin Pract
## 2 32185101  Barbano                  Neurol Clin Pract
## 3 32104723   Ciucci Perspect ASHA Spec Interest Groups
## 4 32095766   Katabi                      Digit Biomark
## 5 32035572 Robinson               Med. Clin. North Am.

Fetch the next 5 PubMed records

Let’s retrieve the following 5 records. Since we have already downloaded the first 5 records (index = 0, 1, 2, 3, and 4), here we want to fetch records with index = 5, 6, 7, 8, and 9. To do so, we specify:

retstart = 5
retmax = 5

btch3 <- fetch_pubmed_data(pubmed_id_list = qr1, retstart = 5, retmax = 5)
btch3 <- table_articles_byAuth(btch3, included_authors = 'last', max_chars = 0)
btch3[, c("pmid", "lastname", "jabbrv")]

##       pmid lastname               jabbrv
## 1 32035572 Robinson Med. Clin. North Am.
## 2 32025977  Shigemi          JA Clin Rep
## 3 32002361    Galal       Adv Pharm Bull
## 4 31998221 Esposito         Front Neurol
## 5 31998219     Liou         Front Neurol

Fetch the last PubMed record

Likewise, the index of the last record is (Record Count - 1). Here we have a total of n=539 records, therefore the last record is fetched using retstart=538.

# Fetch the last record
btch4 <- fetch_pubmed_data(pubmed_id_list = qr1, retstart = 538)
btch4 <- table_articles_byAuth(btch4, included_authors = 'last', max_chars = 0)
btch4[, c("pmid", "lastname", "jabbrv")]

##       pmid lastname       jabbrv
## 1 25914079     <NA> Ann. Neurol.

Record Count Mismatches

Sometimes, you may see a mismatch between the Counts of PubMed records returned by a query, and the final number of PubMed records extracted/processed using easyPubMed (after the table_articles_byAuth() function has been run). Actually, all records are downloaded/fetched by fetch_pubmed_data(), but some records are skipped by the downstream easyPubMed functions, such as the table_articles_byAuth() function). For example, PubMed records NOT including an Abstract or a Title are skipped by table_articles_byAuth(). Therefore, there could be a mismatch between the expected and final number of records obtained using easyPubMed. An example is shown below.

qr2 <- get_pubmed_ids(pubmed_query_string = "31534023[PMID]")
qr2 <- fetch_pubmed_data(pubmed_id_list = qr2)

# Visualize an excerpt
substr(qr2, 1, 300)

## [1] "<?xml version=\"1.0\" ?><!DOCTYPE PubmedArticleSet PUBLIC \"-//NLM//DTD PubMedArticle, 1st January 2019//EN\" \"https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd\"><PubmedArticleSet><PubmedArticle>    <MedlineCitation Status=\"In-Process\" Owner=\"NLM\">        <PMID Version=\"1\">31534023</PMID>       "

# Extract info - no data (0 rows)
ex <- table_articles_byAuth(qr2, included_authors = 'last', max_chars = 0)
ex[, c("pmid", "lastname", "jabbrv")]

## [1] pmid     lastname jabbrv  
## <0 rows> (or 0-length row.names)

# Navigate to pubmed
httr::BROWSE(url = "https://www.ncbi.nlm.nih.gov/pubmed/?term=31534023[PMID]")

PubMed record without Abstract Figure: screenshot from https://www.ncbi.nlm.nih.gov/pubmed/ showing PMID 31534023, which is a PubMed record w/o abstract (timestamp: Apr 19th, 2020).

Loop and Fetch

Here, I am showing how to loop through the results of a PubMed query and fetch all records by setting appropriate retstart and retmax values. Briefly, I want batches of 50 records, starting from the first record (index = 0). At each iteration, I need to update my retstart argument, while retmax is kept to 50. This loop analysis should take about 1-2 min to run.

# Let's write down the loop and ret params
first.i <- 0
last.i <- as.numeric(qr1$Count) - 1 
batch_size <- 50

# Given these params, what are the retstart for each iteration?
my.rs <- seq(from = first.i, 
             to = last.i, 
             by = batch_size)

# Show all ret.start values for the loop
print(my.rs)

##  [1]   0  50 100 150 200 250 300 350 400 450 500

# Initialize a collector list
# This is where we are storing all results
y <- list()

# Now, loop through the my.rs, rocess records, and 
# save the resulting data.frame to y
# This should take less than 2 min
for (i in my.rs) {
  tmp <- fetch_pubmed_data(pubmed_id_list = qr1, retstart = i, retmax = batch_size)
  tmp <- table_articles_byAuth(tmp, included_authors = 'last', max_chars = 0)

  # Save to collector list
  y[[length(y) + 1]] <- tmp
}

# Results are included in a list. Each element is a data.frame
class(y)

## [1] "list"

sapply(y, class)

##  [1] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
##  [6] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
## [11] "data.frame"

# Aggregate results
y <- do.call(rbind, y)

# Check again the class of y
class(y)

## [1] "data.frame"

# TOtal number of records
nrow(y)

## [1] 535

# Show a random excerpt
ii <- sort(sample(1:nrow(y), size = 10))
y[ii, c("pmid", "lastname", "jabbrv")]

##          pmid   lastname                    jabbrv
## 1    32231772 Taghizadeh       Basic Clin Neurosci
## 29   31845760   Antonini              Mov. Disord.
## 32   31841588  Armstrong               JAMA Neurol
## 294  31147178      Anlar     Clin Neurol Neurosurg
## 514  31066807  Silvinato Rev Assoc Med Bras (1992)
## 325  31009038  Baranchuk           JAMA Intern Med
## 420  30949561       Lees     Mov Disord Clin Pract
## 815  30944240   Scarmeas                 Neurology
## 3110 30862597  Kocabicak           World Neurosurg
## 2111 30737338    Deuschl                 Neurology

Built-in alternative: batch_pubmed_download()

If you want to download a large number of records, the batch_pubmed_download() is the function you may want to use. As described in the help page:

batch_pubmed_download() performs a PubMed Query (via the get_pubmed_ids() function), downloads the resulting data (via multiple fetch_pubmed_data() calls) and then saves data in a series of xml or txt files on the local drive. The function is suitable for downloading a very large number of records.

d.fls <- batch_pubmed_download(pubmed_query_string = "parkinson[TI] AND 2019[PDAT]", 
                               batch_size = 50)

## [1] "PubMed data batch 1 / 11 downloaded..."
## [1] "PubMed data batch 2 / 11 downloaded..."
## [1] "PubMed data batch 3 / 11 downloaded..."
## [1] "PubMed data batch 4 / 11 downloaded..."
## [1] "PubMed data batch 5 / 11 downloaded..."
## [1] "PubMed data batch 6 / 11 downloaded..."
## [1] "PubMed data batch 7 / 11 downloaded..."
## [1] "PubMed data batch 8 / 11 downloaded..."
## [1] "PubMed data batch 9 / 11 downloaded..."
## [1] "PubMed data batch 10 / 11 downloaded..."
## [1] "PubMed data batch 11 / 11 downloaded..."

# Files saved
head(d.fls)

## [1] "easyPubMed_data_001.txt" "easyPubMed_data_002.txt"
## [3] "easyPubMed_data_003.txt" "easyPubMed_data_004.txt"
## [5] "easyPubMed_data_005.txt" "easyPubMed_data_006.txt"

# An excerpt
cat(readLines(d.fls[1])[1:32], sep = "\n")

## <?xml version="1.0" ?>
## <!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">
## <PubmedArticleSet>
## <PubmedArticle>
##     <MedlineCitation Status="PubMed-not-MEDLINE" Owner="NLM">
##         <PMID Version="1">32231772</PMID>
##         <DateRevised>
##             <Year>2020</Year>
##             <Month>04</Month>
##             <Day>03</Day>
##         </DateRevised>
##         <Article PubModel="Print-Electronic">
##             <Journal>
##                 <ISSN IssnType="Print">2008-126X</ISSN>
##                 <JournalIssue CitedMedium="Print">
##                     <Volume>10</Volume>
##                     <Issue>4</Issue>
##                     <PubDate>
##                         <MedlineDate>2019 Jul-Aug</MedlineDate>
##                     </PubDate>
##                 </JournalIssue>
##                 <Title>Basic and clinical neuroscience</Title>
##                 <ISOAbbreviation>Basic Clin Neurosci</ISOAbbreviation>
##             </Journal>
##             <ArticleTitle>The Association of Balance, Fear of Falling, and Daily Activities With Drug Phases and Severity of Disease in Patients With Parkinson.</ArticleTitle>
##             <Pagination>
##                 <MedlinePgn>355-362</MedlinePgn>
##             </Pagination>
##             <ELocationID EIdType="doi" ValidYN="Y">10.32598/bcn.9.10.295</ELocationID>
##             <Abstract>
##                 <AbstractText Label="Introduction" NlmCategory="UNASSIGNED">In the elderly, functional balance, fear of falling, and independence in daily living activities are interrelated; however, this relationship may change under the influence of drug phase and the severity of disease in individuals with idiopathic Parkinson disease. This study aimed to investigate the association of functional balance, fear of falling, and independence in the Activities of Daily Living (ADL) with the drug on- and drug off-phases.</AbstractText>
##                 <AbstractText Label="Methods" NlmCategory="UNASSIGNED">A total of 140 patients with Parkinson disease (age: Mean±SD; 60.51±12.32 y) were evaluated in terms of their functional balance, fear of falling, and independence in their daily activities by the Berg Balance Scale (BBS), Fall Efficacy Scale-International (FES-I), and Unified Parkinson Disease Rating Scale-ADL (UPDRS-ADL), respectively, in drug on- and drug off-phases. The Hoehn and Yahr scale recorded global disease rating. The Spearman coefficient, Kruskal-Wallis, and Mann-Whitney tests were used to find out whether the distribution of scale scores differs with regard to functional balance or disease severity.</AbstractText>