Multi-thread-web-crawler

An multi threaded web crawler... It's in the name

How do I run it?

Starting with a Google Compute Engine Ubuntu 18.04 LTS image, run these commands:

sudo apt-get update
sudo apt-get install -y mysql-server mysql-client python3 python3-pip
pip3 install beautifulsoup4 flask request mysql-connector
sudo mysql << EOF
CREATE USER 'crawl'@'localhost' IDENTIFIED BY 'password';
GRANT ALL PRIVILEGES ON * . * TO 'crawl'@'localhost';
EOF

(note that this is really insecure and shouldn't be used in a production setting.)

To run a crawl:

sudo mysql << EOF
DROP SCHEMA WEBCRAWL;
EOF
sudo mysql < script.sql
python3 crawl.py
sudo mysql < create-idf.sql

To run the server:

python3 server.py

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
static		static
.gitignore		.gitignore
README.md		README.md
config.ini		config.ini
crawl.py		crawl.py
create-idf.sql		create-idf.sql
datastore.py		datastore.py
indexer.py		indexer.py
python_mysql_dbconfig.py		python_mysql_dbconfig.py
renderhtml.py		renderhtml.py
script.sql		script.sql
server.py		server.py
urlqueue.py		urlqueue.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-thread-web-crawler

How do I run it?

About

Releases

Packages

Contributors 2

Languages

betterthanitwas/Multi-thread-web-crawler

Folders and files

Latest commit

History

Repository files navigation

Multi-thread-web-crawler

How do I run it?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages