-
app.py
The main application file for launching the service. -
requirements.txt
A list of dependencies required for the project to work. -
data/
Directory containing data for indexing and searching. -
src/
Project source code.llm/
Modules for working with large language models.retrievers/
Modules for extracting relevant information.utils.py
Helper functions and utilities.
To run the project, follow these steps:
-
Install poppler
!sudo apt-get install -y poppler-utils
-
Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate
-
Install dependencies:
pip install torch --index-url https://download.pytorch.org/whl/cu124 pip install -r requirements.txt
-
Add a
.env
file to the root of the project, contact us, we'll provide a key) tg @umbilnm -
Run the application:
streamlit run app.py
-
Access the application:
Open a web browser and navigate to
http://localhost:8000
.
-
Installing dependencies:
- Takes about 2 minutes.
-
Launching the application:
- Instant launch after installing dependencies.
Our project includes the following key components:
Data Indexing
- Using FAISS for efficient embedding-based search.
- Storing metadata and embeddings for quick access.
Working with LLM
- Integration with Pixtral-12b for generating responses.
The main hypothesis of our solution is the assumption that combining embeddings from textual and visual modalities significantly improves the accuracy and quality of multimodal search. We believe that integrating data from different types of sources (text descriptions and images) provides a deeper understanding of context and increases the relevance of the results.
To test this hypothesis, we developed several strategies that allow for different combinations of textual and visual embeddings. Each of them represents a part of the overall approach to solving the multimodal search problem.
This strategy extracts images using textual embeddings obtained from image descriptions (summary) via Pixtral-12b. It helps account for the textual context of images but does not directly utilize visual information.
Visual embeddings obtained through the ViT model are used to extract images relevant to the query. This strategy allows for considering exclusively the visual characteristics of images.
This strategy combines the results of both modalities by intersecting images found by textual and visual embeddings. This allows for considering both textual and visual relevance, which is important for multimodal queries.
This strategy combines the top results of textual and visual embeddings, selecting the most relevant images from both approaches. It helps effectively solve tasks that require multimodal analysis.