Web Crawler

The purpose of this code is to create a crawler that visits the site epocacosmeticos.com.br and save a .csv file with the product name, title and url of each product page found.

Rules:

This file should not contain duplicate entries;
It is not allowed to use the sitemap to get all urls from the site; the site must in fact be visited and parsed to obtain the information;
Except for Scrapy, you can use the frameworks and libraries you want, as long as the main language used is Python;

Required

Python 3

How to install

clone the repository.
create a virtualenv with Python 3.5. (https://virtualenv.pypa.io/en/stable/)
activate virtualenv.
install the dependencies. (pip install -r requirements.txt)
run the project.

git clone https://github.com/asafepy/crawler-challenge.git
cd crawler-challenge
virtualenv -p python3 .virtualenv
source .virtualenv/bin/activate
pip install -r requirements.txt
make install
make run

How to Run:

Install Requirements and the Application;
- pip install -r requirements.txt
Create Application Database;
- python core/db/database.py
Rotate Crawler;
- python core/modules/crawler.py
Rotating the Processor;
- python core/modules/processor.py
Rotate Indexer:
- python core/modules/indexer.py
Using Makefile

make run

Modules (core/modules)

crawler.py (MultiThread)

Responsible for capturing links from the url informed.

processor.py (Multiprocess)

Responsible for reading the raw information in the database (WAIT) and updating it.
This is a multiprocess application, you can keep running in the background and you can upload as many applications as you want, you can add more machines and / or more processes to increase the processing speed.

indexer.py (SingleProcess)

Responsible for generating the csv file, queries the database for all records processed and indexes in a csv worksheet.
The indexer can be rotated whenever you want, refreshing the spreadsheet data. if you do not have new data that has been processed nothing will be indexed. But if the processor has consumed new messages and updated information the worksheet will be updated with these new values.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
config		config
core		core
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
default_config.yaml		default_config.yaml
questions.txt		questions.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler

Required

How to install

How to Run:

Modules (core/modules)

crawler.py (MultiThread)

processor.py (Multiprocess)

indexer.py (SingleProcess)

About

Releases

Packages

Contributors 2

Languages

asafepy/web-crawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Required

How to install

How to Run:

Modules (core/modules)

crawler.py (MultiThread)

processor.py (Multiprocess)

indexer.py (SingleProcess)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages