This repository contains a curated Awesome List and general information on Retrieval-Augmented Generation (RAG) applications in Generative AI.
Retrieval-Augmented Generation (RAG) is a technique in Generative AI where additional context is retrieved from external sources to enrich the generative process of Large Language Models (LLMs). This approach allows LLMs to incorporate up-to-date, specific, or sensitive information that they may lack from their pre-training data alone.
- βΉοΈ General Information on RAG
- π― Approaches
- π§° Frameworks that Facilitate RAG
- π οΈ Techniques
- π Metrics
- πΎ Databases
In traditional RAG approaches, a basic framework is employed to retrieve documents that enrich the context of an LLM prompt. For instance, when querying about materials for renovating a house, the LLM may possess general knowledge about renovation but lacks specific details about the particular house. Implementing an RAG architecture allows for quick searching and retrieval of relevant documents, such as blueprints, to offer more customized responses. This ensures that the LLM incorporates specific information to the renovation needs, thereby enhancing the accuracy of its responses.
A typical RAG implementation follows these key steps:
- Divide the knowledge base: Break the document corpus into smaller, manageable chunks.
- Create embeddings: Apply an embedding model to transform these text chunks into vector embeddings, capturing their semantic meaning.
- Store in a vector database: Save the embeddings in a vector database, enabling fast retrieval based on semantic similarity.
- Handle user queries: Convert the user's query into an embedding using the same model that was applied to the text chunks.
- Retrieve relevant data: Search the vector database for embeddings that closely match the queryβs embedding based on semantic similarity.
- Enhance the prompt: Incorporate the most relevant text chunks into the LLMβs prompt to provide valuable context for generating a response.
- Generate a response: The LLM leverages the augmented prompt to deliver a response that is accurate and tailored to the userβs query.
RAG implementations vary in complexity, from simple document retrieval to advanced techniques integrating iterative feedback loops and domain-specific enhancements. Approaches may include:
- Data cleaning techniques: Pre-processing steps to refine input data and improve model performance.
- Corrective RAG (CRAG): Methods to correct or refine the retrieved information before integration into LLM responses.
- Retrieval-Augmented Fine-Tuning (RAFT): Techniques to fine-tune LLMs specifically for enhanced retrieval and generation tasks.
- Self Reflective RAG: Models that dynamically adjust retrieval strategies based on model performance feedback.
- RAG Fusion: Techniques combining multiple retrieval methods for improved context integration.
- Temporal Augmented Retrieval (TAR): Considering time-sensitive data in retrieval processes.
- Plan-then-RAG (PlanRAG): Strategies involving planning stages before executing RAG for complex tasks.
- GraphRAG: A structured approach using knowledge graphs for enhanced context integration and reasoning.
- FLARE - An approach that incorporates active retrieval-augmented generation to improve response quality.
- Contextual Retrieval - Improves retrieval by adding relevant context to document chunks before retrieval, enhancing the relevance of information retrieved from large knowledge bases.
- Haystack - LLM orchestration framework to build customizable, production-ready LLM applications.
- LangChain - An all-purpose framework for working with LLMs.
- Semantic Kernel - An SDK from Microsoft for developing Generative AI applications.
- LlamaIndex - Framework for connecting custom data sources to LLMs.
- Cognita - Open-source RAG framework for building modular and production ready applications.
- Verba - Open-source application for RAG out of the box.
- Mastra - Typescript framework for building AI applications.
- Strategies
- Tagging and Labeling: Adding semantic tags or labels to retrieved data to enhance relevance.
- Reason and Action (ReAct) (ReAct): Integration of reasoning capabilities to guide LLM responses based on retrieved context.
- Chain of Thought (CoT): Encouraging the model to think through problems step by step before providing an answer.
- Chain of Verification (CoVe): Prompting the model to verify each step of its reasoning for accuracy.
- Self-Consistency: Generating multiple reasoning paths and selecting the most consistent answer.
- Zero-Shot Prompting: Designing prompts that guide the model without any examples.
- Few-Shot Prompting: Providing a few examples in the prompt to demonstrate the desired response format.
- Caching
- Prompt Caching: Optimizes LLMs by storing and reusing precomputed attention states.
- Fixed-size chunking
- Dividing text into consistent-sized segments for efficient processing.
- Splits texts into chunks based on size and overlap.
- Example: Split by character (LangChain).
- Example: SentenceSplitter (LlamaIndex).
- Recursive chunking
- Hierarchical segmentation using recursive algorithms for complex document structures.
- Example: Recursively split by character (LangChain).
- Document-based chunking
- Segmenting documents based on metadata or formatting cues for targeted analysis.
- Example: MarkdownHeaderTextSplitter (LangChain).
- Example: Handle image and text embeddings with models like OpenCLIP.
- Semantic chunking
- Extracting meaningful sections based on semantic relevance rather than arbitrary boundaries.
- Agentic chunking
- Interactive chunking methods where LLMs guide segmentation.
- Select embedding model
- MTEB Leaderboard: Explore Hugging Face's benchmark for evaluating model embeddings.
- Custom Embeddings: Develop tailored embeddings for specific domains or tasks to enhance model performance. Custom embeddings can capture domain-specific terminology and nuances. Techniques include fine-tuning pre-trained models on your own dataset or training embeddings from scratch using frameworks like TensorFlow or PyTorch.
- Search Methods
- Vector Store Flat Index
- Simple and efficient form of retrieval.
- Content is vectorized and stored as flat content vectors.
- Hierarchical Index Retrieval
- Hierarchically narrow data to different levels.
- Executes retrievals by hierarchical order.
- Hypothetical Questions
- Used to increase similarity between database chunks and queries (same with HyDE).
- LLM is used to generate specific questions for each text chunk.
- Converts these questions into vector embeddings.
- During search, matches queries against this index of question vectors.
- Hypothetical Document Embeddings (HyDE)
- Used to increase similarity between database chunks and queries (same with Hypothetical Questions).
- LLM is used to generate a hypothetical response based on the query.
- Converts this response into a vector embedding.
- Compares the query vector with the hypothetical response vector.
- Small to Big Retrieval
- Improves retrieval by using smaller chunks for search and larger chunks for context.
- Smaller child chunks refers to bigger parent chunks
- Vector Store Flat Index
- Re-ranking: Enhances search results in RAG pipelines by reordering initially retrieved documents, prioritizing those most semantically relevant to the query.
These metrics are used to measure the similarity between embeddings, which is crucial for evaluating how effectively RAG systems retrieve and integrate external documents or data sources. By selecting appropriate similarity metrics, you can optimize the performance and accuracy of your RAG system. Alternatively, you may develop custom metrics tailored to your specific domain or niche to capture domain-specific nuances and improve relevance.
-
- Measures the cosine of the angle between two vectors in a multi-dimensional space.
- Highly effective for comparing text embeddings where the direction of the vectors represents semantic information.
- Commonly used in RAG systems to measure semantic similarity between query embeddings and document embeddings.
-
- Calculates the sum of the products of corresponding entries of two sequences of numbers.
- Equivalent to cosine similarity when vectors are normalized.
- Simple and efficient, often used with hardware acceleration for large-scale computations.
-
- Computes the straight-line distance between two points in Euclidean space.
- Can be used with embeddings but may lose effectiveness in high-dimensional spaces due to the "curse of dimensionality."
- Often used in clustering algorithms like K-means after dimensionality reduction.
-
- Measures the similarity between two finite sets as the size of the intersection divided by the size of the union of the sets.
- Useful when comparing sets of tokens, such as in bag-of-words models or n-gram comparisons.
- Less applicable to continuous embeddings produced by LLMs.
Note: Cosine Similarity and Dot Product are generally seen as the most effective metrics for measuring similarity between high-dimensional embeddings.
These metrics assess the quality and relevance of the generated answers from your RAG system, evaluating how accurate, contextually appropriate, and reliable they are. By applying these evaluation metrics, you can gain insights into the performance of your system and identify areas for improvement.
- Automated bechmarking
- ...
- Humans as judges
- ...
- Models as judges
- ...
These tools can assist in evaluating the performance of your RAG system, from tracking user feedback to logging query interactions and comparing multiple evaluation metrics over time.
- LangFuse: Open-source tool for tracking LLM metrics, observability, and prompt management.
- Ragas: Framework that helps evaluate RAG pipelines.
- LangSmith: A platform for building production-grade LLM applications, allows you to closely monitor and evaluate your application.
- Hugging Face Evaluate: Tool for computing metrics like BLEU and ROUGE to assess text quality.
- Weights & Biases: Tracks experiments, logs metrics, and visualizes performance.
The list below features several database systems suitable for Retrieval Augmented Generation (RAG) applications. They cover a range of RAG use cases, aiding in the efficient storage and retrieval of vectors to generate responses or recommendations.
- Apache Cassandra: Distributed NoSQL database management system.
- MongoDB Atlas: Globally distributed, multi-model database service with integrated vector search.
- Vespa: Open-source big data processing and serving engine designed for real-time applications.
- Elasticsearch: Provides vector search capabilities along with traditional search functionalities.
- OpenSearch: Distributed search and analytics engine, forked from Elasticsearch.
- Chroma DB: An AI-native open-source embedding database.
- Milvus: An open-source vector database for AI-powered applications.
- Pinecone: A serverless vector database, optimized for machine learning workflows.
- Oracle AI Vector Search: Integrates vector search capabilities within Oracle Database for semantic querying based on vector embeddings.
- Pgvector: An open-source extension for vector similarity search in PostgreSQL.
- Azure Cosmos DB: Globally distributed, multi-model database service with integrated vector search.
- Couchbase: A distributed NoSQL cloud database.
- Lantern: A privacy-aware personal search engine.
- LlamaIndex: Employs a straightforward in-memory vector store for rapid experimentation.
- Neo4j: Graph database management system.
- Qdrant: An open-source vector database designed for similarity search.
- Redis Stack: An in-memory data structure store used as a database, cache, and message broker.
- SurrealDB: A scalable multi-model database optimized for time-series data.
- Weaviate: A open-source cloud-native vector search engine.
- FAISS: A library for efficient similarity search and clustering of dense vectors, designed to handle large-scale datasets and optimized for fast retrieval of nearest neighbors.