Skip to content

venukarnati92/crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 

Repository files navigation

Crawler

Implemented crawler using Breadth First Search Algorithm. Code can be found at https://github.com/venukarnati92/crawler/blob/master/crawler/crawler.py

Database:

For dealing with hierarchical data in MySQL I have used Adjacency List Model. In the adjacency list model, each item in the table contains a pointer to its parent(as shown in the below image).The topmost URL(RootNode) has a NULL value for its parent. Data will be stored in the table as follows

CREATE TABLE links(node_id INT AUTO_INCREMENT PRIMARY KEY, URL VARCHAR(2083) NOT NULL, Parent INT DEFAULT NULL

Implementation can be found at https://github.com/venukarnati92/crawler/blob/master/crawler/crawler_DB.py

The snapshot of database schema

Screen Shot 2019-12-16 at 9 20 03 PM

Before running crawler_DB.py, make sure of the following points:

a.You have created a database crawlerdb
b.User ID and password are set to access crawlerdb

Multithreading:

Inorder to implement multithreading, I have used threading module.Code can be found at https://github.com/venukarnati92/crawler/blob/master/crawler/crawler_multithreading.py

Multiprocessing:

For multiprocessing implementation, I have used multiprocessing module.Code can be found at https://github.com/venukarnati92/crawler/blob/master/crawler/crawler_multiprocessing.py

How to extend it to distributed?

To extend code to distributed, we can use Ray, which is an open source library for writing parallel and distributed Python.

To turn python function getLinks() in the code into a “remote function”, declare the function with the @ray.remote decorator. Then the function invocations via getLinks.remote() will immediately return futures (a future is a reference to the eventual output), and the actual function execution will take place in the background (we refer to this execution as a task).

@ray.remote
#Get all the href's from the URL
def getLinks(url):
    html_page = urlopen.urlopen(url)
    soup = BeautifulSoup(html_page, "html.parser")

Releases

No releases published

Packages

No packages published

Languages