This git is specially for the purpose of populating a list of pages / categories that can be entered into Wikipedia's Special Export page to request XML files.
I was in the process of creating an AI assistant for Physics and was trying to download the requisite information from Wikipedia for this purpose. Wikipedia allows us to:
- Download all article for a single category say Physics from their Special:Export page.
- Download an unrelated bunch of pages in a single XML format from their Special:Export page.
- Download the XML file for the current revision of a single article.
- Just download the entire Wikipedia database and parse the entire Wikipedia database.
git clone https://github.com/SwamiKannan/Scraping_Wikipedia_categories.git
Through the command window, navigate to the git folder and run:
pip install -r requirements.txt
- Get the URL from where you want to scrape the subcategories and pages. This URL must be a category page in Wikipedia i.e. URL of the format: https://en.wikipedia.org/wiki/Category:
- Decide on the maximum number of sub-categories you would like to scrape (optional)
- Decide on the maximum number of page names you would like to extract (optional)
- Decide on the depth of the category tree that you would like to extract the page names for (depth is explained in the cover image above)
Note 3: If you do not provide (2) or (3) or (4) above, the script will keep running until all subcategories are exhausted. This is not recommended since within 7 levels of depth, you can go from Physics to Selena Gomez' We Own the Night Tour page as below:
First navigate to the 'src' directory. Then run the code below:
python get_pages.py "<source category page>" -o <output_directory> (optional) -pl <max number of pages to be downloaded> -cl<max number of categories to be downloaded> -d <depth of scraping>
A folder "data" in the chosen output directory (or in the root directory of the repository if no output directory provided)
- category_names.txt - A text file containing the list of categories / sub-categories that have been identified
- category_links.txt - A text file containing the list of categories / sub-categories **urls** that have been identified
- page_names.txt - A text file containing the list of pages that have been populated
- page_links.txt - A text file containing the list of page **urls** that have been populated
- done_links.txt - A text file containing the list of categories that have been identified **and traversed**. This is a reference only if we want to restart the session with the same parent Category.
1a. Go to the Wikipedia's Export page
1b. Enter the details from category_names.txt or page_names.txt as below:
OR
2a. Run the following code:
pip install requests
2b. Inside a python console, type the following code:
import requests
page_name = "<insert any page name from page_names.txt>"
url='https://en.wikipedia.org/wiki/Special:Export/'+page_name
response=requests.get(url)
if response.status_code==200:
content= response.content
if content:
with open(<choose a filename ending with .xml>,'w',encoding='utf-8') as outfile:
outfile.write(content)