Local Document Retrieval Augmented Generation (RAG) with sentence embedding context for cited question answering with small language models (LM).
Project concept demonstrated, Reader-Answerer model begun, next visualization of semantic search. Currently runs in terminal again, will add GUI back soon.
Reader: Load Files -> Parse Sentences -> Generate Sentence Embeddings -> Tag and Store in CSV
Answerer: Load Sentences CSV -> Request User Query -> (Preprocess Query with LM slower but helps if query could be malformed or poorly articulated) -> Generate Embedding of Query -> Find ALL Sentences with Cosine Similarity above threshold (we used 0.50 = 50%) -> Sort Context Sentences by Similarity Metric (looking at neighboring sentences improves sort but is also slower) -> If length above max input of LM generate summary with LM (ASSUMED TO HAPPEN ALWAYS IF ENOUGH KNOWLEDGE IN CORPUS) -> Build Prompt -> Generate Answer with LM -> List Context Sources
-
Clone the repository:
git clone https://github.com/your-username/EmbedQA.git cd EmbedQA
-
Install the required packages:
pip install -r requirements.txt
-
Scrape PDF Content:
- Run the script to scrape PDF content.
python src/scrape.py
-
Clean the Extracted Text:
- Run the script to clean the extracted text.
python src/extract_clean_text.py
-
Generate Embeddings:
- Run the script to generate embeddings for the cleaned text.
python src/create_embeddings.py
-
Visualize the Embeddings:
- Run the scripts to visualize embeddings using various dimensionality reduction techniques.
python src/visualize_embeddings_in_DR.py # For dimensionality reduction visualization python src/visualize_embeddings_in_PC.py # For parallel coordinates visualization
-
Chat Interface:
- The chat interface is currently unimplemented in the new version.
-
PDF Extraction:
- Load PDF files from the specified directory and extract text from each page using
PyMuPDF
.
- Load PDF files from the specified directory and extract text from each page using
-
Text Cleaning and Splitting:
- Clean the extracted text by removing HTML tags, non-ASCII characters, and multiple spaces.
- Split the cleaned text into sentences and filter out sentences that are not semantically useful (e.g., too short, contain mostly numbers/symbols).
-
Embedding Generation:
- Generate sentence embeddings using the
sentence-transformers
modelnli-roberta-large
.
- Generate sentence embeddings using the
-
Saving and Loading Embeddings:
- Save the generated embeddings and their corresponding sentences to a CSV file.
- Load the embeddings from the CSV file for similarity search.
-
Similarity Search:
- Use the
sentence-transformers
model to find the top k most similar sentences to the user query based on cosine similarity.
- Use the
-
Contextual Response Generation:
- Extract the neighboring sentences for each of the top k similar sentences.
- Combine these sentences to form a context.
- Use the
transformers
modeldeepset/roberta-large-squad2
to generate a response based on the combined context.
-
Graphical User Interface:
- Provide a Tkinter-based GUI for users to input their questions and receive responses.
- Display the top k most similar sentences and the generated response.
To support systems without NVIDIA GPUs, we convert the PyTorch model to ONNX format and use ONNX Runtime with DirectML. This enables the use of hardware-accelerated inference on a broader range of devices, including those with AMD or Intel GPUs.
We use several techniques to explore clusters of semantic similarity within the embeddings:
- PCA (Principal Component Analysis): Reduces the dimensionality of embeddings to three principal components for 3D visualization.
- t-SNE (t-Distributed Stochastic Neighbor Embedding): Reduces the dimensionality to three components, focusing on preserving local similarities.
- UMAP (Uniform Manifold Approximation and Projection): Another dimensionality reduction technique that preserves more of the global structure compared to t-SNE.
- 3D Plots: We create 3D scatter plots of the reduced embeddings using Plotly. These plots allow interactive exploration of the clusters, with additional features like hovering over points to see the associated sentences.
- JavaScript Integration: Custom JavaScript is added to enable copying the text of a sentence to the clipboard when clicked.
We provide a Dash application for tuning UMAP hyperparameters. This application allows users to adjust parameters such as the number of neighbors, minimum distance, and metric used for computing the UMAP embedding. The resulting embeddings are visualized in a 3D scatter plot, facilitating the exploration of different clustering behaviors.
PyMuPDF
: For extracting text from PDF files.sentence-transformers
: For generating sentence embeddings.transformers
: For question answering.nltk
: For natural language processing tasks such as tokenization and stopword removal.spacy
: For advanced natural language processing tasks.scikit-learn
: For dimensionality reduction (PCA, t-SNE).umap-learn
: For UMAP dimensionality reduction.matplotlib
: For plotting embeddings.plotly
: For interactive plotting.tkinter
: For creating the graphical user interface.beautifulsoup4
: For cleaning HTML tags from text.dash
: For building the hyperparameter tuning web app.
This project is licensed under the MIT License.