Welcome to the mole! mole is an application for a web scraping and Q&A application that leverages large language models (LLMs).
- Overview
- Key Features
- Run from Streamlit Community Environment
- Run from Local
- Usage
- Python Packages
- Contributing
- License
mole is a Streamlit app designed to facilitate web scraping, data storage, and question answering using LLMs for natural language processing and vector indexing. It employs a Retrieve and Generate (RAG) model, utilizing the powerful llama-index for vector indexing and retrieval.
- Easy-to-Use Interface: Provides a simple and intuitive interface for inputting URLs and asking questions.
- Automated Web Scraping: Uses Playwright to scrape text content from specified web pages, adhering to site rules by checking
robots.txt
. - Intelligent Q&A: Employs a RAG architecture, integrating scraping with question-answering capabilities using the llama-index package.
- Secure and Private: Ensures that all data handling and storage adhere to best practices for security and privacy.
The application is hosted at https://scraper-mole.streamlit.app/.
To run the mole locally, follow these steps:
-
Create a virtual Python environment:
python -m venv env source env/bin/activate
-
Clone the repository:
git clone https://github.com/yourusername/scraper-project.git cd scraper-project
-
Set up the environment: Install the necessary libraries and tools by running:
pip install -r requirements.txt
-
Run the project:
streamlit run main.py
To use the mole, follow these steps:
-
Run the Streamlit application:
streamlit run main.py
-
Input URL:
- Enter the URL of the website you want to scrape in the provided input field.
- Click the button to start the scraping process.
-
Select the Website Language:
- Choose the appropriate language to ensure the correct embedding model is used.
-
Ask Questions:
- After scraping, enter your question in the input field.
- The application retrieves relevant information, creates a context, sends the question and context to the LLM, and provides an answer.
This project primarily uses the following Python packages:
- Streamlit: For creating the web application.
- Playwright: For automated web scraping.
- llama-index: For the RAG implementation, handling vector indexing and retrieval.
We welcome contributions to the mole! To contribute, follow these steps:
- Fork the repository.
- Create a new branch:
git checkout -b feature/your-feature-name
- Make your changes and commit them:
git commit -m "Add your feature description"
- Push to the branch:
git push origin feature/your-feature-name
- Create a pull request.
This project is licensed under the MIT License. See the LICENSE file for more details.