This is a refactored & extended package of the scripts in WebScrapper-for-Academic-Databases, a side project started years ago.
This toolkit is intended to include
- A web scrapper for academic databases in R/Python in the form of a (set of) function. Initially, this was built for accelerating meta-data analysis where one intends to try various combinations of different keywords in a handful of databases. Manually typing the keywords in each databases and eye-balling the resulting articles can be extremely inefficient and painful, especially when an article already seen from a previous database came up again in a new database, and
- A set of useful functions to summarize, analyze, and visualize the retrieved search results.
The web scrapper automatically
- generates query based on the provided keywords
- parses the result page (xml in PubMed for example)
- extracts the a group of information, including
- title
- author(s)
- published year
- abstract
- link to the article
- availability (may not be available to all databases)
- search term
- database
- returns the results as a data frame
Supported databases as of 2023-04-03, for ONE or MULTIPLE sets of keywords:
- PubMed (
pubmed
) - Sage Journals (
sage_journal
): you may want to use filter Article Type to include only the research articles for your work (there are other types such as review article) bysearch_database(...,additional_args=list(ContentItemType='research-article'))
Supported databases as of 2023-04-04, for ONE set of keywords:
- ProQuest (
proquest
): you need to provide a string or a vector of sub-databases to paramsubdb_proquest
for the sub-databases to search; if you do not know which ones are available, use functionget_proquest_subdb()
Databases that need code update as of 2023-04-03:
- Science Direct (
science_direct
)
Current development work focuses on implementing the automated search of MULTIPLE sets of keywords. This is the scenario where each concept can be described with slightly or largely different words/phrases, and the need is to search for articles that involves at least one keyword from each concept (relationship=or
).
devtools::install_github('wendywangwwt/AcademicDatabaseToolkit')
library(AcademicDatabaseToolkit)
For ONE set of keywords, use a string of keyword, or a vector of multiple keywords:
keywords <- c("decisions","decision-making")
For MULTIPLE set of keywords, use a list where each sublist is a vector of one or multiple keywords that describes the same concept:
keywords <- list(c("decisions","decision-making"),
c( "consumer behavior"))
A string of database name:
database_name <- 'pubmed'
or a vector of multiple database names:
database_name <- c('pubmed','sage_journal')
Put a limit to avoid collecting tens of thousands of results, unless it is intended.
df_data <- search_database(keywords,database_name=database_name,limit_per_search=300)
The relationship between keywords, if multiple are provided. Default to or
. So the above example is equivalent to:
df_data <- search_database(keywords,relationship='or',database_name=database_name,limit_per_search=300)
If you want to concatenate your keywords with an AND relationship for the search, change the value to and
.
df_data <- search_database(keywords,relationship='and',database_name=database_name,limit_per_search=300)
Which field to search. Default to abstract
(depending on database, this usually includes article title & keywords as well). Optionally, you can switch to all
, to search full article.
df_data <- search_database(keywords,relationship='and',field='all',database_name=database_name,limit_per_search=300)
Whether to drop duplicated results or not. Default to TRUE. Duplicated results come from searches across keywords (relationship='or'), and/or searches across databases. You may want to turn it off if you'd like to better understand which database + search term combinations bring a duplicate. Higher number of duplicates could indicate higher relevancy of the article to the topic you intend to look into.
df_data <- search_database(keywords,relationship='or',database_name=database_name,no_duplicate=F,limit_per_search=300)
2023-04-04:
- added support for one set of keywords in proquest
- added tests for proquest
2023-04-03:
- re-factored the code into an R package that can be installed from GitHub, for pubmed & sage journals
- added tests for pubmed and sage journals
- updated readme
2022-02-20:
- re-factored the code (not completely) and updated the readme file
- only pubmed passed the tests, need to check and update the interaction with other databases later