RAG Multimodal Demo

Features
Installation
Usage
Development

This project demonstrates a multimodal system capable of processing and summarizing different types of data, including text, images, and tables. It utilizes a retriever to store and manage the processed information.

Features

Use Unstructured to parse images, text, and tables from documents (PDFs).
Summarization of images, tables, and text documents.
Extraction and storage of metadata for various data types.

Option 1: This option involves retrieving the raw image directly from the dataset and combining it with the raw table and text data. The combined raw data is then processed by a Multimodal LLM to generate an answer. This approach uses the complete, unprocessed image data in conjunction with textual information.
- Ingestion : Multimodal embeddings
- RAG chain : Multimodal LLM
Option 2: In this option, instead of using the raw image, an image summary is retrieved. This summary, along with the raw table and text data, is fed into a Text LLM to generate an answer.
- Ingestion : Multimodal LLM (for summarization) + Text embeddings
- RAG chain : Text LLM
Option 3: This option also retrieves an image summary, but unlike Option 2, it passes the raw image to a Multimodal LLM for synthesis along with the raw table and text data.
- Ingestion : Multimodal LLM (for summarization) + Text embeddings
- RAG chain : Multimodal LLM

For all options, we can choose to treat tables as text or images.

Common parameters:

ingest.clear_database : Whether to clear the database before ingesting new data.
ingest.partition_pdf_func : Parameters for Unstructured partition_pdf function.
ingest.chunking_func : Parameters for Unstructured chunking function.
ingest.metadata_keys : Unstructured metadata to use.
ingest.table_format : How to extract table with Unstructured (text, html or image).
ingest.image_min_size : Minimum relative size for images to be considered.
ingest.table_min_size : Minimum relative size for tables to be considered.
ingest.export_extracted : Whether to export extracted elements in local folder.

Padding around extracted images can be adjusted by specifying two environment variables "EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD" and "EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD".

RAG Option 1

Folder: backend/rag_1

Method:

Use multimodal embeddings (such as CLIP) to embed images and text.
Retrieve both images and text using similarity search.
Pass raw images and text chunks to a multimodal LLM for answer synthesis

Backend:

Use Open Clip multi-modal embeddings.
Use Chroma with support for multi-modal.
Use GPT-4V for final answer synthesis from join review of images and texts (or tables).

RAG Option 2

Folder: backend/rag_2

Method:

Use a multimodal LLM (such as GPT-4V, LLaVA, or FUYU-8b) to produce text summaries from images.
Embed and retrieve image summaries and texts chunks.
Pass image summaries and text chunks to a text LLM for answer synthesis.

Backend:

Use the multi-vector retriever with Chroma to store raw text (or tables) and images (in a docstore) along with their summaries (in a vectorstore) for retrieval.
Use GPT-4V for image summarization.
Use GPT-4 for final answer synthesis from join review of image summaries and texts (or tables).

Specific parameters:

ingest.summarize_text : Whether to summarize texts with an LLM or use raw texts for retrieval.
ingest.summarize_table : Whether to summarize tables with LLM or use raw tables for retrieval.
ingest.vectorstore_source : The field of documents to add into the vectorstore (content or summary).
ingest.docstore_source : The field of documents to add into the docstore (content or summary).

In option 2, the vectorstore and docstore must be populated with text documents (text content or summary).

RAG Option 3

Folder: backend/rag_3

Method:

Use a multimodal LLM (such as GPT-4V, LLaVA, or FUYU-8b) to produce text summaries from images.
Embed and retrieve image summaries with a reference to the raw image.
Pass raw images and text chunks to a multimodal LLM for answer synthesis.

Backend:

Use the multi-vector retriever with Chroma to store raw text (or tables) and images (in a docstore) along with their summaries (in a vectorstore) for retrieval.
Use GPT-4V for both image summarization (for retrieval) as well as final answer synthesis from join review of images and texts (or tables).

Specific parameters:

ingest.summarize_text : Whether to summarize texts with an LLM or use raw texts for retrieval.
ingest.summarize_table : Whether to summarize tables with LLM or use raw tables for retrieval.
ingest.vectorstore_source : The field of documents to add into the vectorstore (content or summary).
ingest.docstore_source : The field of documents to add into the docstore (content or summary).

In option 3, the vectorstore must be populated with text documents (text content or summary) as in option 2. However, the docstore can be populated with either text or image documents.

Frontend

The demo Streamlit comes from skaff-rag-accelerator. Please read documentation for more details.

Installation

To set up the project, ensure you have Python version between 3.10 and 3.11. Then install the dependencies using Poetry:

poetry install

Unstructured requires the following system dependencies:

poppler-utils : Needed for pdf2image.
tesseract-ocr : Needed for images and PDFs processing.

Installation on Linux:

sudo apt update
sudo apt install -y poppler-utils tesseract-ocr

Installation on MacOS:

brew update
brew install poppler tesseract

Before running the application, you need to set up the environment variables. Copy the template.env file to a new file named .env and fill in the necessary API keys and endpoints:

cp template.env .env
# Edit the .env file with your actual values

Usage

To use the RAG Multimodal Demo, follow these steps:

Ingest data from PDFs and summarize the content:

For RAG Option 1:
```
make ingest_rag_1
```
For RAG Option 2:
```
make ingest_rag_2
```
For RAG Option 3:
```
make ingest_rag_3
```
This command will process PDFs to extract images, text, and tables, summarize them (depending on the method), and store the information in the retriever for later retrieval.
Start the backend server locally:

make serve_backend

This command will launch the backend server, allowing you to access the FastAPI documentation and playground interfaces :

FastAPI documentation: http://0.0.0.0:8000/docs
RAG Option 1 playground interface: http://0.0.0.0:8000/rag-1/playground/
RAG Option 2 playground interface: http://0.0.0.0:8000/rag-2/playground/
RAG Option 3 playground interface: http://0.0.0.0:8000/rag-3/playground/

Launch the Streamlit frontend interface:

make serve_frontend

Development

To set up a development environment and install pre-commit hooks, run the following commands:

poetry install --with dev
pre-commit install

If Poetry is not installed, you can install it using the following instructions: Poetry Installation

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
.github/workflows		.github/workflows
app		app
backend		backend
docs		docs
frontend		frontend
img		img
tests/backend/rag_components		tests/backend/rag_components
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
template.env		template.env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Multimodal Demo

Features

RAG Option 1

RAG Option 2

RAG Option 3

Frontend

Installation

Usage

Development

About

Releases

Packages

Languages

artefactory/rag-multimodal-demo

Folders and files

Latest commit

History

Repository files navigation

RAG Multimodal Demo

Features

RAG Option 1

RAG Option 2

RAG Option 3

Frontend

Installation

Usage

Development

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages