Multimodal-RAG

Repository Structure

app.py
The main application file for launching the service.
requirements.txt
A list of dependencies required for the project to work.
data/
Directory containing data for indexing and searching.
src/
Project source code.
- llm/
  Modules for working with large language models.
- retrievers/
  Modules for extracting relevant information.
- utils.py
  Helper functions and utilities.

Running

To run the project, follow these steps:

Install poppler
```
!sudo apt-get install -y poppler-utils
```

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate

Install dependencies:

pip install torch --index-url https://download.pytorch.org/whl/cu124

pip install -r requirements.txt

Add a .env file to the root of the project, contact us, we'll provide a key) tg @umbilnm
Run the application:
```
streamlit run app.py
```
Access the application:

Open a web browser and navigate to http://localhost:8000.

Build and Launch Time

Installing dependencies:
- Takes about 2 minutes.
Launching the application:
- Instant launch after installing dependencies.

Solution Overview

Our project includes the following key components:

Data Indexing

Using FAISS for efficient embedding-based search.
Storing metadata and embeddings for quick access.

Working with LLM

Integration with Pixtral-12b for generating responses.

Hypothesis: Combining Textual and Visual Embeddings to Enhance Multimodal Search

The main hypothesis of our solution is the assumption that combining embeddings from textual and visual modalities significantly improves the accuracy and quality of multimodal search. We believe that integrating data from different types of sources (text descriptions and images) provides a deeper understanding of context and increases the relevance of the results.

Our Approach: Strategies for Implementing the Hypothesis

To test this hypothesis, we developed several strategies that allow for different combinations of textual and visual embeddings. Each of them represents a part of the overall approach to solving the multimodal search problem.

1. SummaryEmb

This strategy extracts images using textual embeddings obtained from image descriptions (summary) via Pixtral-12b. It helps account for the textual context of images but does not directly utilize visual information.

2. ColQwen

Visual embeddings obtained through the ViT model are used to extract images relevant to the query. This strategy allows for considering exclusively the visual characteristics of images.

3. Intersection

This strategy combines the results of both modalities by intersecting images found by textual and visual embeddings. This allows for considering both textual and visual relevance, which is important for multimodal queries.

4. ColQwen+SummaryEmb

This strategy combines the top results of textual and visual embeddings, selecting the most relevant images from both approaches. It helps effectively solve tasks that require multimodal analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
data		data
notebooks		notebooks
src		src
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
docker-compose.yaml		docker-compose.yaml
pyproject.toml		pyproject.toml
run_app.sh		run_app.sh
setup.cfg		setup.cfg
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal-RAG

Repository Structure

Running

Build and Launch Time

Solution Overview

Hypothesis: Combining Textual and Visual Embeddings to Enhance Multimodal Search

Our Approach: Strategies for Implementing the Hypothesis

1. SummaryEmb

2. ColQwen

3. Intersection

4. ColQwen+SummaryEmb

About

Releases

Packages

Contributors 4

Languages

Ivan-Lopatkin/Multimodal-RAG

Folders and files

Latest commit

History

Repository files navigation

Multimodal-RAG

Repository Structure

Running

Build and Launch Time

Solution Overview

Hypothesis: Combining Textual and Visual Embeddings to Enhance Multimodal Search

Our Approach: Strategies for Implementing the Hypothesis

1. SummaryEmb

2. ColQwen

3. Intersection

4. ColQwen+SummaryEmb

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages