This vignette is aimed at illustrating how to precisely retrieve subsets of PubMed records using the retmax
and retstart
arguments.
Here, I am submitting a simple easyPubMed query that will be used throughout the vignette to illustrate the use of retmax
and retstart
. The query string used is: “parkinson[TI] AND 2019[PDAT]”. This returned n=539 records.
library(easyPubMed)
# Query PubMed
qr1 <- get_pubmed_ids("parkinson[TI] AND 2019[PDAT]")
# How many records are there?
print(qr1$Count)
## [1] "539"
Let’s retrieve the first 5 records returned my the query. This can be obtained using the fetch_pubmed_data()
function. The index of PubMed records returned by any query starts from 0. Therefore, here we want to fetch records with index 0, 1, 2, 3, and 4. To do so, we specify:
retstart = 0
retmax = 5
btch1 <- fetch_pubmed_data(pubmed_id_list = qr1, retstart = 0, retmax = 5)
btch1 <- table_articles_byAuth(btch1, included_authors = 'last', max_chars = 0)
btch1[, c("pmid", "lastname", "jabbrv")]
## pmid lastname jabbrv
## 1 32231772 Taghizadeh Basic Clin Neurosci
## 2 32190422 Holloway Neurol Clin Pract
## 3 32185101 Barbano Neurol Clin Pract
## 4 32104723 Ciucci Perspect ASHA Spec Interest Groups
## 5 32095766 Katabi Digit Biomark
Note: if you set the retstart = 1, you will retrieve records starting from the second (and not the first) record returned by the query. See the following example, and compare results with the previous analysis.
# Here, we fetch 5 records, but we skip the first record
btch2 <- fetch_pubmed_data(pubmed_id_list = qr1, retstart = 1, retmax = 5)
btch2 <- table_articles_byAuth(btch2, included_authors = 'last', max_chars = 0)
btch2[, c("pmid", "lastname", "jabbrv")]
## pmid lastname jabbrv
## 1 32190422 Holloway Neurol Clin Pract
## 2 32185101 Barbano Neurol Clin Pract
## 3 32104723 Ciucci Perspect ASHA Spec Interest Groups
## 4 32095766 Katabi Digit Biomark
## 5 32035572 Robinson Med. Clin. North Am.
Let’s retrieve the following 5 records. Since we have already downloaded the first 5 records (index = 0, 1, 2, 3, and 4), here we want to fetch records with index = 5, 6, 7, 8, and 9. To do so, we specify:
retstart = 5
retmax = 5
btch3 <- fetch_pubmed_data(pubmed_id_list = qr1, retstart = 5, retmax = 5)
btch3 <- table_articles_byAuth(btch3, included_authors = 'last', max_chars = 0)
btch3[, c("pmid", "lastname", "jabbrv")]
## pmid lastname jabbrv
## 1 32035572 Robinson Med. Clin. North Am.
## 2 32025977 Shigemi JA Clin Rep
## 3 32002361 Galal Adv Pharm Bull
## 4 31998221 Esposito Front Neurol
## 5 31998219 Liou Front Neurol
Likewise, the index of the last record is (Record Count
- 1). Here we have a total of n=539 records, therefore the last record is fetched using retstart=538.
# Fetch the last record
btch4 <- fetch_pubmed_data(pubmed_id_list = qr1, retstart = 538)
btch4 <- table_articles_byAuth(btch4, included_authors = 'last', max_chars = 0)
btch4[, c("pmid", "lastname", "jabbrv")]
## pmid lastname jabbrv
## 1 25914079 <NA> Ann. Neurol.
Sometimes, you may see a mismatch between the Counts of PubMed records returned by a query, and the final number of PubMed records extracted/processed using easyPubMed (after the table_articles_byAuth()
function has been run). Actually, all records are downloaded/fetched by fetch_pubmed_data()
, but some records are skipped by the downstream easyPubMed functions, such as the table_articles_byAuth()
function). For example, PubMed records NOT including an Abstract or a Title are skipped by table_articles_byAuth()
. Therefore, there could be a mismatch between the expected and final number of records obtained using easyPubMed. An example is shown below.
qr2 <- get_pubmed_ids(pubmed_query_string = "31534023[PMID]")
qr2 <- fetch_pubmed_data(pubmed_id_list = qr2)
# Visualize an excerpt
substr(qr2, 1, 300)
## [1] "<?xml version=\"1.0\" ?><!DOCTYPE PubmedArticleSet PUBLIC \"-//NLM//DTD PubMedArticle, 1st January 2019//EN\" \"https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd\"><PubmedArticleSet><PubmedArticle> <MedlineCitation Status=\"In-Process\" Owner=\"NLM\"> <PMID Version=\"1\">31534023</PMID> "
# Extract info - no data (0 rows)
ex <- table_articles_byAuth(qr2, included_authors = 'last', max_chars = 0)
ex[, c("pmid", "lastname", "jabbrv")]
## [1] pmid lastname jabbrv
## <0 rows> (or 0-length row.names)
# Navigate to pubmed
httr::BROWSE(url = "https://www.ncbi.nlm.nih.gov/pubmed/?term=31534023[PMID]")
Figure: screenshot from https://www.ncbi.nlm.nih.gov/pubmed/ showing PMID 31534023, which is a PubMed record w/o abstract (timestamp: Apr 19th, 2020).
Here, I am showing how to loop through the results of a PubMed query and fetch all records by setting appropriate retstart
and retmax
values. Briefly, I want batches of 50 records, starting from the first record (index = 0). At each iteration, I need to update my retstart argument, while retmax is kept to 50. This loop analysis should take about 1-2 min to run.
# Let's write down the loop and ret params
first.i <- 0
last.i <- as.numeric(qr1$Count) - 1
batch_size <- 50
# Given these params, what are the retstart for each iteration?
my.rs <- seq(from = first.i,
to = last.i,
by = batch_size)
# Show all ret.start values for the loop
print(my.rs)
## [1] 0 50 100 150 200 250 300 350 400 450 500
# Initialize a collector list
# This is where we are storing all results
y <- list()
# Now, loop through the my.rs, rocess records, and
# save the resulting data.frame to y
# This should take less than 2 min
for (i in my.rs) {
tmp <- fetch_pubmed_data(pubmed_id_list = qr1, retstart = i, retmax = batch_size)
tmp <- table_articles_byAuth(tmp, included_authors = 'last', max_chars = 0)
# Save to collector list
y[[length(y) + 1]] <- tmp
}
# Results are included in a list. Each element is a data.frame
class(y)
## [1] "list"
sapply(y, class)
## [1] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
## [6] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
## [11] "data.frame"
# Aggregate results
y <- do.call(rbind, y)
# Check again the class of y
class(y)
## [1] "data.frame"
# TOtal number of records
nrow(y)
## [1] 535
# Show a random excerpt
ii <- sort(sample(1:nrow(y), size = 10))
y[ii, c("pmid", "lastname", "jabbrv")]
## pmid lastname jabbrv
## 1 32231772 Taghizadeh Basic Clin Neurosci
## 29 31845760 Antonini Mov. Disord.
## 32 31841588 Armstrong JAMA Neurol
## 294 31147178 Anlar Clin Neurol Neurosurg
## 514 31066807 Silvinato Rev Assoc Med Bras (1992)
## 325 31009038 Baranchuk JAMA Intern Med
## 420 30949561 Lees Mov Disord Clin Pract
## 815 30944240 Scarmeas Neurology
## 3110 30862597 Kocabicak World Neurosurg
## 2111 30737338 Deuschl Neurology
If you want to download a large number of records, the batch_pubmed_download()
is the function you may want to use. As described in the help page:
d.fls <- batch_pubmed_download(pubmed_query_string = "parkinson[TI] AND 2019[PDAT]",
batch_size = 50)
## [1] "PubMed data batch 1 / 11 downloaded..."
## [1] "PubMed data batch 2 / 11 downloaded..."
## [1] "PubMed data batch 3 / 11 downloaded..."
## [1] "PubMed data batch 4 / 11 downloaded..."
## [1] "PubMed data batch 5 / 11 downloaded..."
## [1] "PubMed data batch 6 / 11 downloaded..."
## [1] "PubMed data batch 7 / 11 downloaded..."
## [1] "PubMed data batch 8 / 11 downloaded..."
## [1] "PubMed data batch 9 / 11 downloaded..."
## [1] "PubMed data batch 10 / 11 downloaded..."
## [1] "PubMed data batch 11 / 11 downloaded..."
# Files saved
head(d.fls)
## [1] "easyPubMed_data_001.txt" "easyPubMed_data_002.txt"
## [3] "easyPubMed_data_003.txt" "easyPubMed_data_004.txt"
## [5] "easyPubMed_data_005.txt" "easyPubMed_data_006.txt"
# An excerpt
cat(readLines(d.fls[1])[1:32], sep = "\n")
## <?xml version="1.0" ?>
## <!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">
## <PubmedArticleSet>
## <PubmedArticle>
## <MedlineCitation Status="PubMed-not-MEDLINE" Owner="NLM">
## <PMID Version="1">32231772</PMID>
## <DateRevised>
## <Year>2020</Year>
## <Month>04</Month>
## <Day>03</Day>
## </DateRevised>
## <Article PubModel="Print-Electronic">
## <Journal>
## <ISSN IssnType="Print">2008-126X</ISSN>
## <JournalIssue CitedMedium="Print">
## <Volume>10</Volume>
## <Issue>4</Issue>
## <PubDate>
## <MedlineDate>2019 Jul-Aug</MedlineDate>
## </PubDate>
## </JournalIssue>
## <Title>Basic and clinical neuroscience</Title>
## <ISOAbbreviation>Basic Clin Neurosci</ISOAbbreviation>
## </Journal>
## <ArticleTitle>The Association of Balance, Fear of Falling, and Daily Activities With Drug Phases and Severity of Disease in Patients With Parkinson.</ArticleTitle>
## <Pagination>
## <MedlinePgn>355-362</MedlinePgn>
## </Pagination>
## <ELocationID EIdType="doi" ValidYN="Y">10.32598/bcn.9.10.295</ELocationID>
## <Abstract>
## <AbstractText Label="Introduction" NlmCategory="UNASSIGNED">In the elderly, functional balance, fear of falling, and independence in daily living activities are interrelated; however, this relationship may change under the influence of drug phase and the severity of disease in individuals with idiopathic Parkinson disease. This study aimed to investigate the association of functional balance, fear of falling, and independence in the Activities of Daily Living (ADL) with the drug on- and drug off-phases.</AbstractText>
## <AbstractText Label="Methods" NlmCategory="UNASSIGNED">A total of 140 patients with Parkinson disease (age: Mean±SD; 60.51±12.32 y) were evaluated in terms of their functional balance, fear of falling, and independence in their daily activities by the Berg Balance Scale (BBS), Fall Efficacy Scale-International (FES-I), and Unified Parkinson Disease Rating Scale-ADL (UPDRS-ADL), respectively, in drug on- and drug off-phases. The Hoehn and Yahr scale recorded global disease rating. The Spearman coefficient, Kruskal-Wallis, and Mann-Whitney tests were used to find out whether the distribution of scale scores differs with regard to functional balance or disease severity.</AbstractText>
Thanks for using easyPubMed. Damiano Fantini damiano.fantini@gmail.com. Copyright 2015-2020.