Processing Wikipedia data - I: Scraping the page names from Wikipedia category hierarchy

This git is specially for the purpose of populating a list of pages / categories that can be entered into Wikipedia's Special Export page to request XML files.

I was in the process of creating an AI assistant for Physics and was trying to download the requisite information from Wikipedia for this purpose. Wikipedia allows us to:

Download all article for a single category say Physics from their Special:Export page.
Download an unrelated bunch of pages in a single XML format from their Special:Export page.
Download the XML file for the current revision of a single article.
Just download the entire Wikipedia database and parse the entire Wikipedia database.

However, I needed, not just the articles that come under the "Physics" category, but also the articles that come under the subcategories of Physics e.g. the pages under Astrophysics or Physicists by nationality, etc. Parsing the whole database and then filtering out the various categories I wanted was troublesome. Hence, this git.

How to run the library

1. Download the git file

git clone https://github.com/SwamiKannan/Scraping_Wikipedia_categories.git

2. Pip install the requirements

Through the command window, navigate to the git folder and run:

pip install -r requirements.txt

Note 1: This assumes that you have already python, and the pip and git libraries installed.

3. Decide your parameters

Get the URL from where you want to scrape the subcategories and pages. This URL must be a category page in Wikipedia i.e. URL of the format: https://en.wikipedia.org/wiki/Category:
Decide on the maximum number of sub-categories you would like to scrape (optional)
Decide on the maximum number of page names you would like to extract (optional)
Decide on the depth of the category tree that you would like to extract the page names for (depth is explained in the cover image above)

Note 2: If you provide (2), (3) and (4), which ever criteria is met first will halt the scraping

Note 3: If you do not provide (2) or (3) or (4) above, the script will keep running until all subcategories are exhausted. This is not recommended since within 7 levels of depth, you can go from Physics to Selena Gomez' We Own the Night Tour page as below:

4. Run the code below:

First navigate to the 'src' directory. Then run the code below:

python get_pages.py "<source category page>" -o <output_directory> (optional) -pl <max number of pages to be downloaded> -cl<max number of categories to be downloaded> -d <depth of scraping>

Outputs:

A folder "data" in the chosen output directory (or in the root directory of the repository if no output directory provided)

category_names.txt - A text file containing the list of categories / sub-categories that have been identified
category_links.txt - A text file containing the list of categories / sub-categories **urls** that have been identified
page_names.txt - A text file containing the list of pages that have been populated
page_links.txt - A text file containing the list of page **urls** that have been populated
done_links.txt - A text file containing the list of categories that have been identified **and traversed**. This is a reference only if we want to restart the session with the same parent Category.

Usage:

Option 1: Through the browser

1a. Go to the Wikipedia's Export page
1b. Enter the details from category_names.txt or page_names.txt as below:

OR

Option 2: Through Python

2a. Run the following code:

  pip install requests

2b. Inside a python console, type the following code:

import requests

page_name = "<insert any page name from page_names.txt>"

url='https://en.wikipedia.org/wiki/Special:Export/'+page_name

response=requests.get(url)
if response.status_code==200:
  content= response.content

if content:
  with open(<choose a filename ending with .xml>,'w',encoding='utf-8') as outfile:
    outfile.write(content)

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
images		images
src		src
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Processing Wikipedia data - I: Scraping the page names from Wikipedia category hierarchy

How to run the library

1. Download the git file

2. Pip install the requirements

Note 1: This assumes that you have already python, and the pip and git libraries installed.

3. Decide your parameters

Note 2: If you provide (2), (3) and (4), which ever criteria is met first will halt the scraping

Note 3: If you do not provide (2) or (3) or (4) above, the script will keep running until all subcategories are exhausted. This is not recommended since within 7 levels of depth, you can go from Physics to Selena Gomez' We Own the Night Tour page as below:

4. Run the code below:

Outputs:

Usage:

Option 1: Through the browser

Option 2: Through Python

About

Releases

Packages

Languages

License

SwamiKannan/Wiki-1-Scraping-the-Wikipedia-Category-Hierarchy

Folders and files

Latest commit

History

Repository files navigation

Processing Wikipedia data - I: Scraping the page names from Wikipedia category hierarchy

How to run the library

1. Download the git file

2. Pip install the requirements

Note 1: This assumes that you have already python, and the pip and git libraries installed.

3. Decide your parameters

Note 2: If you provide (2), (3) and (4), which ever criteria is met first will halt the scraping

Note 3: If you do not provide (2) or (3) or (4) above, the script will keep running until all subcategories are exhausted. This is not recommended since within 7 levels of depth, you can go from Physics to Selena Gomez' We Own the Night Tour page as below:

4. Run the code below:

Outputs:

Usage:

Option 1: Through the browser

Option 2: Through Python

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages