🌏 English | Chinese |
This is a web scraping program I wrote to practice extracting data from Google Scholar Search Pages.
1. It records the titles, publication years, and URLs ['title', 'year', 'url'] of papers on the Google Search Page in a CSV file format.
2. It downloads links on the Google Search Page that contain tags (such as [PDF], [HTML]) in PDF file format.
This program is designed for macOS and runs on Python 2.7.10. It uses the requests, BeautifulSoup, and pdfkit libraries, all of which are recommended to be installed via pip.
Before installing pdfkit, you need to download wkhtmltopdf
Clone the code in the terminal
$ git clone https://github.com/linhung0319/google-scholar-crawler.git
1. Go to Google Scholar Search
Enter the keywords you want to search for, navigate to the first page of the search results, and copy the URL of this page.
Paste the copied URL into the start_url variable like this: start_url = 'URL'.
In the Search Page, if the title and content of an entry contain words from p_key, the crawler will be inclined to scrape that paper.
If the title and content of an entry contain words from n_key, the crawler will be inclined to ignore that paper.
In the example shown above, because I wanted to find papers related to sound but not optics, I included sound-related words in p_key and optics-related words in n_key.
Setting too many pages may trigger Google’s robot check.
Run google_crawler.py in the Terminal.
$ python google_crawler.py
The scraped data ('title', 'year', 'url', etc.) will be saved in result.pickle.
Run csvNdownload.py in the Terminal.
$ python csvNdownload.py
This will convert the data to a CSV file, saved in the CSV folder.
It will also download links with tags (PDF, HTML) as PDF files and save them in the PDF and HTML folders, respectively.
Google Search Pages have anti-scraping detection. If you download or view pages too quickly, they may suspect you are a bot and block your IP address.
The simplest solution I can think of is to use a VPN. If you are flagged as a bot, you can quickly change your IP address using the VPN.
There are many free VPNs available for download online.
If you encounter a message while running the crawler, the solution is to use a VPN to change your IP.
__findPages - Can not find the pages link in the start URL!!
__findPages - 1.Google robot check might ban you from crawling!!
__findPages - 2.You might not crawl the page of google scholar
If you have any questions or suggestions about this project, feel free to contact me:
- Email: linhung0319@gmail.com
- Portfolio: My Portfolio
- Linkedin: My Linkedin