Skip to content

vmtuan12/scrapy-book-proxy

Repository files navigation

Book crawler

Using scrapy with proxies for spiders

Dataset (from Goodreads)

Installation

Clone the repo

git clone https://github.com/vmtuan12/scrapy-book-proxy.git

Download prerequisites

Install all the prerequisite packages

pip install -r requirement.txt

Download the BERT model here Put these 3 files in fast_api/model/. If the directory model does not exist, create it.

Download the Tf-Idf here Put that in fast_api/.

Setup

Turn on Docker, then run the containers

cd flow
docker-compose up setup
docker compose up -d

Wait for all the containers to start up, check at

localhost:9123 - Kafdrop (Kafka UI)
localhost:5601 - Kibana (Elasticsearch UI)

If these 2 are accessible, then it's ok.

Finally, push all the data into Elasticsearch

cd data
python3 prod_to_es.py

Wait for it to be done. go to localhost:5601 to check the result. Navigate to Overview like the image below, then go to Indices on the left side bar, scroll down to check the indices. If there are 149110 documents in index book, then it's ok.

image

Run

Web UI

cd dashboard_template
yarn install
yarn dev

Server

cd fast_api
fastapi dev main.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published