LocalDocRAG

Local Document Retrieval Augmented Generation (RAG) with sentence embedding context for cited question answering with small language models (LM).

Current Status

Project concept demonstrated, Reader-Answerer model begun, next visualization of semantic search. Currently runs in terminal again, will add GUI back soon.

Overview

Reader: Load Files -> Parse Sentences -> Generate Sentence Embeddings -> Tag and Store in CSV
Answerer: Load Sentences CSV -> Request User Query -> (Preprocess Query with LM slower but helps if query could be malformed or poorly articulated) -> Generate Embedding of Query -> Find ALL Sentences with Cosine Similarity above threshold (we used 0.50 = 50%) -> Sort Context Sentences by Similarity Metric (looking at neighboring sentences improves sort but is also slower) -> If length above max input of LM generate summary with LM (ASSUMED TO HAPPEN ALWAYS IF ENOUGH KNOWLEDGE IN CORPUS) -> Build Prompt -> Generate Answer with LM -> List Context Sources

Current Status

Installation

Clone the repository:

git clone https://github.com/your-username/EmbedQA.git
cd EmbedQA

Install the required packages:
```
pip install -r requirements.txt
```

Usage

Running the Application

Scrape PDF Content:
- Run the script to scrape PDF content.
```
python src/scrape.py
```
Clean the Extracted Text:
- Run the script to clean the extracted text.
```
python src/extract_clean_text.py
```
Generate Embeddings:
- Run the script to generate embeddings for the cleaned text.
```
python src/create_embeddings.py
```

Visualize the Embeddings:

Run the scripts to visualize embeddings using various dimensionality reduction techniques.

python src/visualize_embeddings_in_DR.py  # For dimensionality reduction visualization
python src/visualize_embeddings_in_PC.py  # For parallel coordinates visualization

Chat Interface:
- The chat interface is currently unimplemented in the new version.

Explanation of the Process

PDF Extraction:
- Load PDF files from the specified directory and extract text from each page using PyMuPDF.
Text Cleaning and Splitting:
- Clean the extracted text by removing HTML tags, non-ASCII characters, and multiple spaces.
- Split the cleaned text into sentences and filter out sentences that are not semantically useful (e.g., too short, contain mostly numbers/symbols).
Embedding Generation:
- Generate sentence embeddings using the sentence-transformers model nli-roberta-large.
Saving and Loading Embeddings:
- Save the generated embeddings and their corresponding sentences to a CSV file.
- Load the embeddings from the CSV file for similarity search.
Similarity Search:
- Use the sentence-transformers model to find the top k most similar sentences to the user query based on cosine similarity.
Contextual Response Generation:
- Extract the neighboring sentences for each of the top k similar sentences.
- Combine these sentences to form a context.
- Use the transformers model deepset/roberta-large-squad2 to generate a response based on the combined context.
Graphical User Interface:
- Provide a Tkinter-based GUI for users to input their questions and receive responses.
- Display the top k most similar sentences and the generated response.

Embeddings Generation for Non-NVIDIA Systems

To support systems without NVIDIA GPUs, we convert the PyTorch model to ONNX format and use ONNX Runtime with DirectML. This enables the use of hardware-accelerated inference on a broader range of devices, including those with AMD or Intel GPUs.

Subprograms for Clustering and Visualization

Clustering and Dimensionality Reduction

We use several techniques to explore clusters of semantic similarity within the embeddings:

PCA (Principal Component Analysis): Reduces the dimensionality of embeddings to three principal components for 3D visualization.
t-SNE (t-Distributed Stochastic Neighbor Embedding): Reduces the dimensionality to three components, focusing on preserving local similarities.
UMAP (Uniform Manifold Approximation and Projection): Another dimensionality reduction technique that preserves more of the global structure compared to t-SNE.

Visualization

3D Plots: We create 3D scatter plots of the reduced embeddings using Plotly. These plots allow interactive exploration of the clusters, with additional features like hovering over points to see the associated sentences.
JavaScript Integration: Custom JavaScript is added to enable copying the text of a sentence to the clipboard when clicked.

Tuning Hyperparameters with Dash

We provide a Dash application for tuning UMAP hyperparameters. This application allows users to adjust parameters such as the number of neighbors, minimum distance, and metric used for computing the UMAP embedding. The resulting embeddings are visualized in a 3D scatter plot, facilitating the exploration of different clustering behaviors.

Screenshots

UMAP Visualization 1

UMAP Visualization 2: Subcluster of Sentences with German Words

Dependencies

PyMuPDF: For extracting text from PDF files.
sentence-transformers: For generating sentence embeddings.
transformers: For question answering.
nltk: For natural language processing tasks such as tokenization and stopword removal.
spacy: For advanced natural language processing tasks.
scikit-learn: For dimensionality reduction (PCA, t-SNE).
umap-learn: For UMAP dimensionality reduction.
matplotlib: For plotting embeddings.
plotly: For interactive plotting.
tkinter: For creating the graphical user interface.
beautifulsoup4: For cleaning HTML tags from text.
dash: For building the hyperparameter tuning web app.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
ref		ref
screenshots		screenshots
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LocalDocRAG

Current Status

Overview

Current Status

Installation

Usage

Running the Application

Explanation of the Process

Embeddings Generation for Non-NVIDIA Systems

Subprograms for Clustering and Visualization

Clustering and Dimensionality Reduction

Visualization

Tuning Hyperparameters with Dash

Screenshots

UMAP Visualization 1

UMAP Visualization 2: Subcluster of Sentences with German Words

Dependencies

License

About

Releases

Packages

Languages

License

AvaAvarai/LocalDocRAG

Folders and files

Latest commit

History

Repository files navigation

LocalDocRAG

Current Status

Overview

Current Status

Installation

Usage

Running the Application

Explanation of the Process

Embeddings Generation for Non-NVIDIA Systems

Subprograms for Clustering and Visualization

Clustering and Dimensionality Reduction

Visualization

Tuning Hyperparameters with Dash

Screenshots

UMAP Visualization 1

UMAP Visualization 2: Subcluster of Sentences with German Words

Dependencies

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages