mole

Welcome to the mole! mole is an application for a web scraping and Q&A application that leverages large language models (LLMs).

Overview

mole is a Streamlit app designed to facilitate web scraping, data storage, and question answering using LLMs for natural language processing and vector indexing. It employs a Retrieve and Generate (RAG) model, utilizing the powerful llama-index for vector indexing and retrieval.

Key Features

Easy-to-Use Interface: Provides a simple and intuitive interface for inputting URLs and asking questions.
Automated Web Scraping: Uses Playwright to scrape text content from specified web pages, adhering to site rules by checking robots.txt.
Intelligent Q&A: Employs a RAG architecture, integrating scraping with question-answering capabilities using the llama-index package.
Secure and Private: Ensures that all data handling and storage adhere to best practices for security and privacy.

Run from Streamlit Community Environment

The application is hosted at https://scraper-mole.streamlit.app/.

Run from Local

To run the mole locally, follow these steps:

Create a virtual Python environment:

python -m venv env
source env/bin/activate

Clone the repository:

git clone https://github.com/yourusername/scraper-project.git
cd scraper-project

Set up the environment: Install the necessary libraries and tools by running:
```
pip install -r requirements.txt
```
Run the project:
```
streamlit run main.py
```

Usage

To use the mole, follow these steps:

Run the Streamlit application:
```
streamlit run main.py
```
Input URL:
- Enter the URL of the website you want to scrape in the provided input field.
- Click the button to start the scraping process.
Select the Website Language:
- Choose the appropriate language to ensure the correct embedding model is used.
Ask Questions:
- After scraping, enter your question in the input field.
- The application retrieves relevant information, creates a context, sends the question and context to the LLM, and provides an answer.

Python Packages

This project primarily uses the following Python packages:

Streamlit: For creating the web application.
Playwright: For automated web scraping.
llama-index: For the RAG implementation, handling vector indexing and retrieval.

Contributing

We welcome contributions to the mole! To contribute, follow these steps:

Fork the repository.

Create a new branch:

git checkout -b feature/your-feature-name

Make your changes and commit them:

git commit -m "Add your feature description"

Push to the branch:

git push origin feature/your-feature-name

Create a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.css		.css
.streamlit		.streamlit
images		images
scraper		scraper
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
main.py		main.py
packages.txt		packages.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mole

Table of Contents

Overview

Key Features

Run from Streamlit Community Environment

Run from Local

Usage

Python Packages

Contributing

License

About

Releases

Packages

Languages

License

arkeodev/scraper

Folders and files

Latest commit

History

Repository files navigation

mole

Table of Contents

Overview

Key Features

Run from Streamlit Community Environment

Run from Local

Usage

Python Packages

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages