Book crawler

Using scrapy with proxies for spiders

Dataset (from Goodreads)

Installation

Clone the repo

git clone https://github.com/vmtuan12/scrapy-book-proxy.git

Download prerequisites

Install all the prerequisite packages

pip install -r requirement.txt

Download the BERT model here Put these 3 files in fast_api/model/. If the directory model does not exist, create it.

Download the Tf-Idf here Put that in fast_api/.

Setup

Turn on Docker, then run the containers

cd flow
docker-compose up setup
docker compose up -d

Wait for all the containers to start up, check at

localhost:9123 - Kafdrop (Kafka UI)
localhost:5601 - Kibana (Elasticsearch UI)

If these 2 are accessible, then it's ok.

Finally, push all the data into Elasticsearch

cd data
python3 prod_to_es.py

Wait for it to be done. go to localhost:5601 to check the result. Navigate to Overview like the image below, then go to Indices on the left side bar, scroll down to check the indices. If there are 149110 documents in index book, then it's ok.

Run

Web UI

cd dashboard_template
yarn install
yarn dev

Server

cd fast_api
fastapi dev main.py

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
book_scrapy		book_scrapy
dashboard_template		dashboard_template
data		data
fast_api		fast_api
flow		flow
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
spider_runner.py		spider_runner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Book crawler

Installation

Download prerequisites

Setup

Run

About

Releases

Packages

Languages

vmtuan12/scrapy-book-proxy

Folders and files

Latest commit

History

Repository files navigation

Book crawler

Installation

Download prerequisites

Setup

Run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages