Skip to content

Gets a Wikipedia page URL and creates a network of all pages that link to it at a certain distance.

License

Notifications You must be signed in to change notification settings

EtzionR/create-Wikipedia-pages-network-using-BFS-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Create-Wikipedia-pages-network-using-BFS-crawler

Gets a Wikipedia page URL and creates a network of all pages that link to it at a certain distance.

Overview

A Wikipedia page consists many links between the various pages on the site. These links can also be viewed as a graph: where the pages are used as nodes and the links as edges.

When we want to examine the graph to a certain depth, we can use the adapted BFS algorithm which is designed to search the graph. Using this algorithm we can reach any of the nodes at a certain distance from the original page. The distance is determined by the "depth" parameter, which sets the distance to search for pages from the original page.

In order to identify the links of each page on the web, we must perform a crawling process on each of them. In this process we can locate the links the page contains to other Wikipedia pages. In order to get the most relevant links, the algorithm filters only the introduction section of each page. This section of the page is filtered using BeautifulSoup to get only the correct links. In order to avoid blocking by the site, between each search there is a wait of one second.

Each link is saved under a python object ("Link" class), so it will be easy to retrieve and organize the information about it. Also, the Object handles UTF conversion errors: for example, the page "Bayes's law", display as "Bayes%27s law". source for conversions dictionary from: utf8-chartable. The code exports all of these links to a single CSV file so that it convenient to explore and use afterwards.

Also, the code allows the creation of a networkx graph, based on the links we found. The graph adds its name to each central node so that the dominant pages on the network can be discerned. In order to find the main nodes we will use the networkx HITS algorithm. Then, we will attach its name to each Node only if it indeed relatively central in the network (its hub score will be larger than average + standard deviation). output example:

example

Note: Due to the built-in delay of the crawling process, for depth greater than 3, the runtime of the code may be extremely long.

Libraries

The code uses the following libraries in Python:

BeautifulSoup (bs4)

requests

networkx

matplotlib

pandas

Also, time & queue libraries

Application

An application of the code is attached to this page under the name:

"implementation.py"

the examples outputs are also attached here.

Example for using the code

To use this code, you just need to import it as follows:

# import
from wiki_crawler import wikipedia_network

# define variables
url = r'https://en.wikipedia.org/wiki/something'  # original page
depth = 2	                                        # search distance from the original page
plot = True	                                      # bool (default: False)

# application
wikipedia_network(url, depth, plot)

When the variables displayed are:

url: url string of the base Wikipedia page

bin: search distance to pages, from the original page

name: string which represents the filename of the plot you want to save

License

MIT © Etzion Harari

About

Gets a Wikipedia page URL and creates a network of all pages that link to it at a certain distance.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages