-
Notifications
You must be signed in to change notification settings - Fork 0
Web Scraping
Web scraping done by Wikipedia Graph Mapper is handled by the script process_link.py
written in Python, and is powered by libraries such as Requests and BeautifulSoup.
Upon receiving a user request, the function generate_lists
is first called to handle and process user input.
def generate_lists(self_url, max_level, isMobileBrowser, search_mode):
# patches requests as it has compatibility issues with Google App Engine/ comment this out to test on development server
requests_toolbelt.adapters.appengine.monkeypatch()
global MAX_LEVEL
MAX_LEVEL = int(max_level)
global SEARCH_MODE
SEARCH_MODE = int(search_mode)
# clears list for each new request made
del entryList[:]
nodeList, linkList = proc_data(scrape(self_url, currentLevel = 0), isMobileBrowser)
return nodeList, linkList
The line containing the function
monkeypatch()
should only be used when deploying the application on Google App Engine. Comment it out when testing on your development server as it might create other errors while running.
Besides calling all the other dependent functions scrape()
and proc_data()
to actually carry out web scraping and data processing respectively, the generate_lists()
function also sets the user input maximum search depth max_level
as a global variable to force the recursive function scrape()
to return prematurely upon arriving at the maximum search depth. Similarly, search_mode
, which indicates 1 of the 3 various search settings*, is also set as a global variable to make scrape()
use the same hyperlink selection criteria for each page scraped.
- Refer to this repository's readme for a more elaborate explanation on the various search settings offered and the following section Scraping Data off Wikipedia for more details regarding the implementation of these 3 options.
The recursive function scrape
initially takes in a URL performs a HTTP request to that page. It then parses and extracts the relevant URLs from the HTML document, all while saving them into the list entryList
.
Last but not least, it calls itself on the these URLs found and repeats the same process all over again, terminating only when the max search level has been attained:
if currentLevel + 1 > MAX_LEVEL: # stops sending requests if spider is going to reach max crawl level
return
The page of the URL passed to the function is accessed by using Requests. BeautifulSoup then parses this page into a BeautifulSoup
object:
page = requests.get(self_url)
soup = BeautifulSoup(page.content, 'lxml')
This BeautifulSoup
object is then used to extract the page title, as well as other URLs contained within the page:
self_title = soup.title.contents[0] # stores title of page
listTags = [] # list to store all hyperlinks found
if SEARCH_MODE == 1: # OPTION 1: Use this to search only the FIRST paragraph of each page for hyperlinks
tag = soup.find('p') # search for first paragraph
while (tag.find('a') != None and 'Coordinates' in tag.find('a').contents) or (tag.get('class') != None): # if first search result is not a pure <p> tag nor a coordinate link
tag = tag.findNext('p')
listTags.extend(tag.findAll('a'))
elif SEARCH_MODE == 2: # OPTION 2: Use this to search the introduction of each page only for hyperlinks
stop_at = soup.find('h2') # finds the first h2 element i.e. where the first subsection header is found
class_extr = stop_at.find_all_previous('p') # extracts all elements before this element
for paragraph in class_extr:
listTags.extend(paragraph.findAll('a'))
elif SEARCH_MODE == 3: # OPTION 3: Use this to search the entire page for hyperlinks
for paragraph in soup.findAll('p'): # for each paragraph found
listTags.extend(paragraph.findAll('a')) # stores all hyperlinks found
Once this has been done, it's time to do some additional filtering of the Tag
objects found. A simple Wikipedia paragraph can contain various types of non-relevant links such as those shown below:
So in order to filter random citation notes, Wikimedia files, and the like, each tag is parsed as a string and compared to certain list of special keywords. Tags containing these keywords, as well as repeats, will be removed. These tags are then saved as a dictionary and stored in the list
object entryList
. In addition, the function scrape()
is then once again (recursively) called on each legitimate tag found:
# cleans up list of hyperlinks; retains only relevant links
listLinks = [] # stores the name and url of each hyperlink found
# stores list of keywords that indicates links to be filtered out
listOfFilterKeywords = ['cite_note', 'File', 'wikimedia', 'Help', ':Verifiability', 'Wikipedia:', 'wiktionary.org']
for tag in listTags:
if not any(keyword in str(tag) for keyword in listOfFilterKeywords): # checks if keyword is found; if so, skip this tag
if 'title' in tag.attrs and 'href' in tag.attrs: # checks if title and link elements exist in the tag
# stores a dictionary of the information regarding each hyperlink i.e. which page it is found on
entry = {"self_title": self_title, "self_url": self_url, "ext_title": tag['title'], "ext_url": HOME_URL + tag['href'], "current_level": currentLevel}
if entry not in entryList: # filters out entries already present
entryList.append(entry)
scrape(entry["ext_url"], currentLevel) # depth-search via recursion
Last but not least, after having all its recursion branches terminated, the scrape()
returns entryList
:
return entryList
The entryList
list returned is then passed into the function proc_data()
, which processes the data contained within into 2 lists nodeList
and linkLists
, which represent respectively the nodes and links(edges) that would be built in the graph.
To create a list of nodes representing all the Wikipedia pages found, all unique URLs found in entryList
are added to an intermediate list urls
:
# creates a list to store unique urls found
urls = list(set([ data['self_url'] for data in entryList ])) # removes URL duplicates from self_urls
urls.extend(list(set([ data['ext_url'] for data in entryList ]))) # adds other URLs branches to list
Each url in urls
is then saved in nodeList
as a dictionary object containing the url as a unique identifier id
, with the page title as label
and the search depth where the page was first found as level
, 0 being the source, 1 being its neighbours, and so on...:
nodeList = [] # to store nodes
for url in urls:
for data in entryList:
if url == data["self_url"]:
entry = {"id": url, "label": data["self_title"], "level": data["current_level"] - 1}
if entry not in nodeList:
nodeList.append(entry)
break
elif url == data["ext_url"]:
entry = {"id": url, "label": data["ext_title"], "level": data["current_level"]}
if entry not in nodeList:
nodeList.append(entry)
break
The next step then involves filling up linkList
with the necessary data needed to define the edges of the graph. This is easier than the previous step as we would only need to extract data from entryList
without having any need for further filtering.
However, there is an additional step which requires us to define the strength
of each edge, which roughly represents the attractive force between the target and the source of the link. In other words, the higher the value, the closer the two nodes are likely to be. Two different ways of defining this strength
value have been implemented - one which is used when the user accesses the application via a mobile browser and the other when accessed via normal web browser:
linkList = [] # to store links
for data in entryList:
if isMobileBrowser:
strength = 0.6
else:
# strength formula
strength = 0.8 - 0.4*data["current_level"]/MAX_LEVEL
linkList.append({"target": data["self_url"], "source": data["ext_url"], "strength": strength})
As can be seen, a uniform strength is used to pack nodes closer to one another on a mobile browser, whereas a decreasing function with respect to node level is used to define strength
on a regular browser.
Last but not least, the function returns both nodeList
and linkList
, which will then be used to generate the graph on the client-side:
return nodeList, linkList