This is an extensible codebase for posting queries to academic databases programatically. The results can be stored in a json or bib file to be imported into bibliographic databases such as Zotero. In situations where articles are available for download, URLs collected during search can also be used to download the files in bulk. This can save scholars and their research assistants a lot of time especially when there is heavy-duty work to be done.
Each database is given it’s own scraper file, which is named accordingly. We currently have scrapers for the following databases, most commonly used in Chinese Paleography research:
Database (中文) | English Name | Filename |
---|---|---|
中國期刊網 | CNKI | cnki.py |
武漢大學簡帛網 | Center of Bamboo Silk Manuscripts, Wuhan University | wuhan.py |
清華大學出土文獻研究與保護中心 | Research and Conservation Center for Unearthed Texts, Tsinghua University | qinghua.py |
復旦大學出土文獻與古文字研究中心 | Fudan University Unearthed and Ancient Characters Research Center | fudan.py |
Each file exposes a search
function, which can be called collectively by main.py
to post multiple queries to multiple databases in bulk.
main.py
provides a search
function that accepts multiple keyword and database arguments to serve the above functionality.
Finally, a save_articles
function allows the user to save the search results as json
or bibtex
files for viewing and further processing.
- Clone this repo to a local directory with:
git clone https://github.com/sati-bodhi/Academic-Bibliography-Scraper.git
- Open
main.py
, scroll to the end of the file and change the arguments for thesearch
andsave_articles
function accordingly.
if __name__ == '__main__':
rslt = search(['尹至'], 'cnki', 'wuhan', 'qinghua')
save_articles(rslt, 'search_result', 'bib')
- Multiple queries can be posted to a single database as such:
if __name__ == '__main__':
rslt = search(['尹至', '郭店'], 'wuhan')
save_articles(rslt, 'search_result', 'bib')
- Search results can be saved as
json
instead ofbib
by changing the 3rd argument of thesave_articles
function.if __name__ == '__main__': rslt = search(['尹至', '郭店'], 'wuhan') save_articles(rslt, 'search_result', 'json')
- The 2nd argument would give the name of the file, which will be ‘search_result.json’ in the example above.
Developers are welcome to extend or amend the current codebase by submitting pull requests.
I would like to thank Reinderien for helping out with the code and Dr. Pham Lee-Moi for partially funding this project.