Using scrapy with proxies for spiders
Dataset (from Goodreads)
Clone the repo
git clone https://github.com/vmtuan12/scrapy-book-proxy.git
Install all the prerequisite packages
pip install -r requirement.txt
Download the BERT model here
Put these 3 files in fast_api/model/
. If the directory model
does not exist, create it.
Download the Tf-Idf here
Put that in fast_api/
.
Turn on Docker, then run the containers
cd flow
docker-compose up setup
docker compose up -d
Wait for all the containers to start up, check at
localhost:9123 - Kafdrop (Kafka UI)
localhost:5601 - Kibana (Elasticsearch UI)
If these 2 are accessible, then it's ok.
Finally, push all the data into Elasticsearch
cd data
python3 prod_to_es.py
Wait for it to be done. go to localhost:5601 to check the result. Navigate to Overview
like the image below, then go to Indices
on the left side bar, scroll down to check the indices. If there are 149110 documents in index book
, then it's ok.
Web UI
cd dashboard_template
yarn install
yarn dev
Server
cd fast_api
fastapi dev main.py