We implement a Innovative Cron-Job Website Scraper with continuous streaming to the Backend Product Analytics part with G2 that blazingly processess 1000+ different products in less than 10 seconds.
We have Focused on Fast Web scraping and Blazing analysis of the product leveraging tools like Qdrant Vector Database , Vector similarity indexing and Fast API Interaction
- Web Scraping with Selenium: Utilizes Selenium to scrape data from websites periodically as a cron job, ensuring compatibility with dynamic web pages and periodic updates.
- Lightning Fast Communication using Redis Pub/Sub: Integrates Redis Pub/Sub to efficiently communicate between components.
- Data Analysis and Indexing: Employs similarity indexing to quickly check if scraped data already exists in the database.
- Vector Database: Utilizes the power of Vector Database for efficient data storage and retrieval, enhanced by custom queries and a neural engine for faster search.
- Streamlit Frontend: Offers a user-friendly interface for visualizing products not present in the database, enhancing user interaction and data exploration.
- Speed: Processes over 1000+ products in just 10 seconds, showcasing the framework's high-speed capabilities.
- Efficiency: Utilizes similarity indexing to quickly identify existing data, reducing unnecessary processing.
- Scalability: Designed with scalability in mind, allowing for easy expansion and integration of new features.
- Python 3.6+
- Redis Stack (local or cloud-based)
- Selenium WebDriver (local or containerised)
- qdrant vector db
- Drive to have fun!
To start a Redis Stack container using the redis-stack
image, follow these steps:
- Run the following command in your terminal:
docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest
This command launches a Redis Stack container and exposes RedisInsight on port 8001. You can access RedisInsight by opening your browser and navigating to localhost:8001
.
To set up Qdrant, follow these steps:
- Download the Qdrant image from DockerHub:
docker pull qdrant/qdrant
- Start Qdrant inside Docker with the following command:
docker run -p 6333:6333 \
-v $(pwd)/qdrant_storage:/qdrant/storage \
qdrant/qdrant
Once Qdrant is running, you can access its Web UI by navigating to localhost:6333/dashboard
.
Before proceeding further, create a .env
file in your project directory and add the following line:
BEARER_TOKEN=your_bearer_token_here
Replace your_bearer_token_here
with your actual bearer token.
Now, generate the required data by following these steps:
- Run the
ProductsCollector.py
script to createG2_Products.json
. - Pre-process
G2_Products.json
to produceG2_Cleaned.json
.
To build the neural search engine, follow these steps:
- Open the
Qdrant_store.ipynb
notebook. - Sequentially run the cells in the notebook to vectorize the G2 Products data and prepare the neural search engine.
Once everything is set up, run the processor.py
script to perform high-speed processing:
python processor.py
This script will handle the heavy lifting tasks.
To set up and run the Selenium web scraper, setup the container:
docker run -d -p 4000:4444 -v /dev/shm:/dev/shm selenium/standalone-chrome
- installing python libraries
pip install -r requirements.txt
- run selenium web scraper
python3 scraper/fetchSourceForge.py
- visit
http://localhost:4000/
to see scraper in action
Now your environment should be set up and ready to go! If you encounter any issues, feel free to reach out for assistance.