Crawler

Implemented crawler using Breadth First Search Algorithm. Code can be found at https://github.com/venukarnati92/crawler/blob/master/crawler/crawler.py

Database:

For dealing with hierarchical data in MySQL I have used Adjacency List Model. In the adjacency list model, each item in the table contains a pointer to its parent(as shown in the below image).The topmost URL(RootNode) has a NULL value for its parent. Data will be stored in the table as follows

CREATE TABLE links(node_id INT AUTO_INCREMENT PRIMARY KEY, URL VARCHAR(2083) NOT NULL, Parent INT DEFAULT NULL

Implementation can be found at https://github.com/venukarnati92/crawler/blob/master/crawler/crawler_DB.py

The snapshot of database schema

Before running crawler_DB.py, make sure of the following points:

a.You have created a database crawlerdb
b.User ID and password are set to access crawlerdb

Multithreading:

Inorder to implement multithreading, I have used threading module.Code can be found at https://github.com/venukarnati92/crawler/blob/master/crawler/crawler_multithreading.py

Multiprocessing:

For multiprocessing implementation, I have used multiprocessing module.Code can be found at https://github.com/venukarnati92/crawler/blob/master/crawler/crawler_multiprocessing.py

How to extend it to distributed?

To extend code to distributed, we can use Ray, which is an open source library for writing parallel and distributed Python.

To turn python function getLinks() in the code into a “remote function”, declare the function with the @ray.remote decorator. Then the function invocations via getLinks.remote() will immediately return futures (a future is a reference to the eventual output), and the actual function execution will take place in the background (we refer to this execution as a task).

@ray.remote
#Get all the href's from the URL
def getLinks(url):
    html_page = urlopen.urlopen(url)
    soup = BeautifulSoup(html_page, "html.parser")

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
crawler		crawler
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawler

Database:

Multithreading:

Multiprocessing:

How to extend it to distributed?

About

Releases

Packages

Languages

venukarnati92/crawler

Folders and files

Latest commit

History

Repository files navigation

Crawler

Database:

Multithreading:

Multiprocessing:

How to extend it to distributed?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages