Reading Time: ~10 minutes
Building Art Deco RAG ChatBot using PulseJet Github Repo: https://github.com/Jet-Engine/art-deco-chatbot
This blogpost can be read from the following links:
Large Language Models (LLMs) have significantly advanced, improving their ability to answer a broad array of questions. However, they still encounter challenges, particularly with specific or recent information, often resulting in inaccuracies or "hallucinations." To address these issues, the Retrieval Augmented Generation (RAG) approach integrates a document retrieval step into the response generation process. This approach uses a corpus of documents and employs vector databases for efficient retrieval, enhancing the accuracy and reliability of LLM responses through three key steps:
- Segmenting documents into manageable chunks for context window of used LLM.
- Generating embeddings for both the query and document chunks to measure their relevance through similarity scores.
- Retrieving the most relevant chunks and using them as context to generate well-informed answers.
Vector databases facilitate quick similarity searches and efficient data management, making RAG a powerful solution for enhancing LLM capabilities.
The Art Deco era, spanning the roaring 1920s to the 1940s, left a dazzling legacy in architecture. Despite the capabilities of models like Meta's Llama3.1, their responses can be unreliable, especially for nuanced or detailed queries specific to Art Deco. Our goal with the Art Deco ChatBot is to use RAG to improve the quality of responses about Art Deco architecture, comparing these with those generated by traditional LLMs in both quality and time efficiency.
By designing the Art Deco ChatBot, we also aim to show how a complex RAG system can be built. You can access the complete code at the Art Deco ChatBot GitHub repository. By examining the code and reading this README, you will learn:
- How to scrape documents from Wikipedia and store them in a structured format.
- How to index these documents in a vector database for efficient retrieval.
- How to use JetEngine's performant vector database: Pulsejet
- How to use LiteLLM to query different LLMs easily.
- How to integrate a RAG system with Ollama
- How to write a RAG system that would chunk documents, generate embeddings, and retrieve relevant chunks.
- How to evaluate RAG output with LLM output
Ollama is a program that facilitates running LLM models easily on local machines.
- Install Ollama on your local machine by following instructions on the Ollama website.
- Download the required models for the Art Deco ChatBot project:
ollama pull llama3.1
(LLM that will be used for RAG)ollama pull nomic-embed-text
(embedding model that will be used for RAG)
- You can run these models in your terminal after they are downloaded, but this is not a prerequisite for this project.
In this project, we not only aim to write code to show how RAG can be done but also to compare and benchmark results of RAG with queries to different LLMs. Some of these LLMs cannot be run locally (like GPT-4o
), while others are compute-heavy and are run on cloud services (like Llama3.1:70b
on Groq).
LiteLLM provides a unified interface to query different LLMs, making our code cleaner and more readable. Checking out the LiteLLM Python library is recommended but not required for this project.
Get your API keys from OpenAI and Groq to use them in the project. Be aware that you may be billed for using these services. While the Groq API
can be used for free at the time of writing, the OpenAI API
is not free.
PulseJet is a high-performance vector database that enables efficient storage and retrieval of document embeddings. To set up PulseJet:
- Install PulseJet by running:
pip install pulsejet
- Create a Docker container with PulsejetDB:
docker run --name pulsejet_container -p 47044-47045:47044-47045 jetngine/pulsejet
Note: You can skip the first step since pulsejet is already included in the requirements.txt
file.
Check PulseJet Docs for details about running Pulsejet Docker images and using the pulsejet Python library for vector database operations.
Install all necessary dependencies by running:
pip install -r requirements.txt
This project was developed using a
conda
environment withPython 3.11
.
As we have not tested the project in different environments, we recommend adhering to this configuration for optimal performance and compatibility.
The Art Deco ChatBot uses two YAML files for configuration: config.template.yaml
and secrets.yaml
. Here's a detailed breakdown of each section:
Create a secrets.yaml
file with your API keys:
#api_keys:
openai_key: "your_openai_key_here"
groq_key: "your_groq_key_here"
- openai_key: Your API key for OpenAI services, used primarily for interfacing with OpenAI's models.
- groq_key: Your API key to access Groq's computational resources.
#models:
main_model: "llama3.1"
embed_model: "nomic-embed-text"
#vector_db:
vector_db: "pulsejet"
#pulsejet:
pulsejet_location: "remote"
pulsejet_collection_name: "art-deco"
#paths:
rag_files_path: "rag_files/"
questions_file_path: "evaluation/questions.csv"
evaluation_path: "evaluation/"
rag_prompt_path: "evaluation/rag_prompt.txt"
metrics_file_path: "evaluation/metrics.json"
#embeddings:
embeddings_file_path: "embeddings_data/all_embeddings_HSNW.h5"
use_precalculated_embeddings: true
#llm_models:
all_models:
gpt-4o: "gpt-4o"
groq-llama3.1-8b: "groq/llama-3.1-8b-instant"
groq-llama3.1-70b: "groq/llama-3.1-70b-versatile"
ollama-llama3.1: "ollama/llama3.1"
ollama-llama3.1-70b: "ollama/llama3.1:70b"
selected_models:
- "gpt-4o"
- "groq-llama3.1-70b"
- "ollama-llama3.1"
#rag_parameters:
sentences_per_chunk: 10
chunk_overlap: 2
file_extension: ".txt"
Here's a detailed explanation of each section:
- main_model: Specifies the primary LLM used for retrieval-augmented tasks. In this case, it's set to "llama3.1".
- embed_model: Indicates the model used for generating embeddings. Here, it's set to "nomic-embed-text".
- vector_db: Specifies the vector database to be used. In this project, we're using "pulsejet". In future work we may integrate our RAG systems to different vector databases so that one could run our RAG systems with different databases and see high performance of Pulsejet in benchmarks.
- pulsejet_location: The location where PulseJet is running. Set to "remote" for a Docker container instance.
- pulsejet_collection_name: The name of the collection within PulseJet where document embeddings are stored.
- rag_files_path: The directory path where articles fetched by the wiki-bot are stored.
- questions_file_path: Location of the CSV file with questions used to evaluate the model's performance.
- evaluation_path: Specifies the directory for storing output files from the evaluation scripts.
- rag_prompt_path: Path to the RAG prompt template file.
- metrics_file_path: Path to save performance metrics.
- embeddings_file_path: The full path to the H5 file where embeddings are stored or will be saved.
- use_precalculated_embeddings: When set to
true
, the system will load embeddings from the specified file. Whenfalse
, it will generate new embeddings and save them to this file.
- all_models: A dictionary where the keys are names used to identify the models in the project, and the values are how these models are known to LiteLLM. You need to check https://docs.litellm.ai/docs/providers if you are going to modify this parameter.
- selected_models: A list of model names (keys from all_models) that will be used in the project.
- sentences_per_chunk: Specifies the number of sentences to include in each chunk when splitting the documents. This parameter affects the granularity of the information retrieved during the RAG process.
- chunk_overlap: Determines the number of sentences that overlap between adjacent chunks. This overlap helps maintain context across chunk boundaries.
- file_extension: Specifies the file type to be processed.
Ensure you update these configuration files with your specific settings before running the project. Adjusting the RAG parameters can significantly impact the performance and accuracy of the RAG system. Experimentation with different values may be necessary to find the optimal configuration for your specific use case and document set.
This step is optional since the content files of all scraped articles from Wikipedia are available in the https://huggingface.co/datasets/JetEngine/Art_Deco_USA_DS.
You can download this dataset and copy all text files from it into the rag_files directory. If you plan to use pre-calculated embeddings, which will be explained in the next section, you don't actually need to download this dataset.
There is no need to repeat the scraping process. You could skip reading rest of this section if you are not interested in data scraping process.
Our initial step involves gathering knowledge about Art-Deco architecture. We focus on U.S. structures, given their prominence in the Art-Deco movement. The wiki-bot.py script automates the collection of relevant Wikipedia articles, organizing them into a structured directory for ease of access.
Run the bot using:
python wiki-bot.py
When you run wiki-bot.py with an empty rag_files
directory, it saves the contents of the scraped Wikipedia articles in a sub-folder named text
under rag_files. The bot also creates various sub-folders to organize different types of data such as article URLs, references, etc. Since our current focus is only on the contents of the Wikipedia articles, to reduce clutter, we only transferred the contents from the text
sub-folder to our HG dataset and removed all other sub-folders.
Thus, if you want to run the bot yourself which is optional since the scraped documents are already available in Hugging Face, you would need to either copy all files from the text sub-folder to the rag_files
directory and then delete all sub-folders within rag_files
, or simply change the rag_files_path
in config.yaml
to rag_files/text
.
Index the documents by running:
python indexing.py
This script processes the documents, generates embeddings, and stores them in PulseJet.
If you don't want to lose time for generating embeddings, you can download pre-calculated embeddings
from https://huggingface.co/JetEngine/rag_art_deco_embeddings
and set use_precalculated_embeddings: true
in the configuration.
In our setup generation of embeddings takes around 15 minutes to complete and insertion of vectors to Pulsejet takes around 4 seconds.
The script outputs timing information for:
- Embedding generation or loading
- Insertion of vectors into Pulsejet
- Sample vector search operations
Ensure your configuration is correct, then run:
python chat.py
This script queries different LLMs and the RAG system, outputting results in HTML, JSON, and CSV formats for comparison.
Pulsejet is used in this project for efficient vector storage and retrieval. Here's a detailed overview of how Pulsejet is integrated into our Art Deco ChatBot project:
-
Initializing the Pulsejet Client:
client = pj.PulsejetClient(location=config['pulsejet_location'])
This creates a Pulsejet client. In our project, we're using a remote Pulsejet instance, so the
location
is set to "remote". This connects to a Pulsejet server running in a Docker container. -
Creating a Collection:
client.create_collection(collection_name, vector_config)
This creates a new collection in Pulsejet to store our document embeddings. The
vector_config
parameter specifies the configuration for the vector storage, such as the vector size and index type (e.g., HNSW for efficient similarity search). -
Inserting Vectors: In our project, we use the following pattern for inserting vectors:
collection[0].insert_single(collection[1], embed, meta)
This might look confusing at first, but here's what it means:
collection[0]
is actually our Pulsejet client instance.collection[1]
is the name of the collection we're inserting into.embed
is the vector we're inserting.meta
is additional metadata associated with the vector.
This is equivalent to calling:
client.insert_single(collection_name, vector, meta)
For bulk insertions, we use:
client.insert_multi(collection_name, embeds)
This inserts multiple embeddings at once, which is more efficient for large datasets.
-
Searching Vectors:
results = client['db'].search_single(collection, query_embed, limit=5, filter=None)
This performs a similarity search in the specified Pulsejet collection to find the most relevant documents for a given query vector. The
limit
parameter specifies the maximum number of results to return.In our project,
client['db']
is used to access the database methods of the Pulsejet client. This is equivalent to using the client directly:results = client.search_single(collection_name, query_vector, limit=5, filter=None)
-
Closing the Connection:
client.close()
This closes the connection to the Pulsejet database when it's no longer needed.
The PulsejetRagClient
class is defined in pulsejet_rag_client.py
and provides a high-level interface for interacting with PulseJet in the context of our RAG system. Here's a breakdown of its key components:
-
Initialization:
class PulsejetRagClient: def __init__(self, config): self.config = config self.collection_name = config['pulsejet_collection_name'] self.main_model = config['main_model'] self.embed_model = config['embed_model'] self.client = pj.PulsejetClient(location=config['pulsejet_location'])
The client is initialized with configuration parameters, setting up the PulseJet client and storing relevant config values.
-
Creating a Collection:
def create_collection(self): vector_size = get_vector_size(self.config['embed_model']) vector_params = pj.VectorParams(size=vector_size, index_type=pj.IndexType.HNSW) try: self.client.create_collection(self.collection_name, vector_params) logger.info(f"Created new collection: {self.collection_name}") except Exception as e: logger.info(f"Collection '{self.collection_name}' already exists or error occurred: {str(e)}")
This method creates a new collection in PulseJet with the specified parameters. It uses the
get_vector_size
function to determine the appropriate vector size for the embeddings. -
Inserting Vectors:
def insert_vector(self, vector, metadata=None): try: self.client.insert_single(self.collection_name, vector, metadata) logger.debug(f"Inserted vector with metadata: {metadata}") except Exception as e: logger.error(f"Error inserting vector: {str(e)}") def insert_vectors(self, vectors, metadatas=None): try: self.client.insert_multi(self.collection_name, vectors, metadatas) logger.debug(f"Inserted {len(vectors)} vectors") except Exception as e: logger.error(f"Error inserting multiple vectors: {str(e)}")
These methods handle the insertion of single and multiple vectors into the PulseJet collection, along with their associated metadata.
-
Searching Vectors:
def search_similar_vectors(self, query_vector, limit=5): try: results = self.client.search_single(self.collection_name, query_vector, limit=limit, filter=None) return results except Exception as e: logger.error(f"Error searching for similar vectors: {str(e)}") return []
This method performs a similarity search in the PulseJet collection to find the most relevant documents for a given query vector.
-
Closing the Connection:
def close(self): try: self.client.close() logger.info("Closed Pulsejet client connection") except Exception as e: logger.error(f"Error closing Pulsejet client connection: {str(e)}")
This method closes the connection to the PulseJet database when it's no longer needed.
The PulsejetRagClient
is used throughout the project to interact with PulseJet. Here's how it's typically instantiated and used:
-
Creation:
from pulsejet_rag_client import create_pulsejet_rag_client config = get_config() rag_client = create_pulsejet_rag_client(config)
-
Indexing Documents:
In indexing.py
, we use the client to create the collection and insert vectors:
rag_client.create_collection()
for file_name, file_embeddings in embeddings_data.items():
for chunk_id, content, embed in file_embeddings:
metadata = {"filename": file_name, "chunk_id": chunk_id, "content": content}
rag_client.insert_vector(embed, metadata)
- Searching Similar Vectors:
In rag.py
, we use the client to search for similar vectors during the RAG process:
results = rag_client.search_similar_vectors(query_embed, limit=5)
- Closing the Connection:
After operations are complete, we close the connection:
rag_client.close()
This implementation provides a clean, encapsulated interface for all PulseJet operations in our RAG system.
- RAG tasks with
LLama3.1
take longer than simple question answering due to the increased query length. - Embedding extraction is time-consuming but can be pre-calculated and reused.
- LLM inference and embedding extraction takes considerable time.
- Pulsejet operations (insertion and search) are very fast.
The Art Deco ChatBot demonstrates how LLMs could be better utilized with RAG. Our project offers a comprehensive exploration of RAG implementation, covering every step from data scraping and document chunking to embedding creation and the integration of vector databases.
As the document base for a RAG system grows larger, the performance of insertion and search operations becomes increasingly critical. By learning how to integrate the Pulsejet vector database into a full-fledged RAG system, one can significantly benefit from its capabilities, particularly when dealing with RAG applications on large document bases.
Our RAG responses could have been more accurate. To enhance our Art Deco ChatBot's performance, we are considering several experimental approaches:
- Evaluating different Large Language Models (LLMs)
- Testing various embedding models
- Exploring alternative chunking techniques
We plan to expand this project through the following initiatives:
- Benchmarking different vector databases
- Exploring various chunking and embedding techniques
- Experimenting with different index types supported by Pulsejet
- Testing the asynchronous Pulsejet client and batch insertions
- Expanding our RAG system to different domains
- Implementing a Graphical User Interface (GUI) for improved accessibility
We encourage you to experiment with the Art Deco ChatBot, modify its parameters, and adapt it to your own domains of interest.
Author: Güvenç USANMAZ