Skip to content

Latest commit

 

History

History
179 lines (130 loc) · 7.39 KB

README.md

File metadata and controls

179 lines (130 loc) · 7.39 KB

MultimodalQnA Application

Suppose you possess a set of videos and wish to perform question-answering to extract insights from these videos. To respond to your questions, it typically necessitates comprehension of visual cues within the videos, knowledge derived from the audio content, or often a mix of both these visual elements and auditory facts. The MultimodalQnA framework offers an optimal solution for this purpose.

MultimodalQnA addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos, images, and audio files. For this purpose, MultimodalQnA utilizes BridgeTower model, a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.

The MultimodalQnA architecture shows below:

architecture

MultimodalQnA is implemented on top of GenAIComps, the MultimodalQnA Flow Chart shows below:

---
config:
  flowchart:
    nodeSpacing: 400
    rankSpacing: 100
    curve: linear
  themeVariables:
    fontSize: 50px
---
flowchart LR
    %% Colors %%
    classDef blue fill:#ADD8E6,stroke:#ADD8E6,stroke-width:2px,fill-opacity:0.5
    classDef orange fill:#FBAA60,stroke:#ADD8E6,stroke-width:2px,fill-opacity:0.5
    classDef orchid fill:#C26DBC,stroke:#ADD8E6,stroke-width:2px,fill-opacity:0.5
    classDef invisible fill:transparent,stroke:transparent;
    style MultimodalQnA-MegaService stroke:#000000
    %% Subgraphs %%
    subgraph MultimodalQnA-MegaService["MultimodalQnA-MegaService"]
        direction LR
        EM([Embedding <br>]):::blue
        RET([Retrieval <br>]):::blue
        LVM([LVM <br>]):::blue
    end
    subgraph UserInterface[" User Interface "]
        direction LR
        a([User Input Query]):::orchid
        Ingest([Ingest data]):::orchid
        UI([UI server<br>]):::orchid
    end

    TEI_EM{{Embedding service <br>}}
    VDB{{Vector DB<br><br>}}
    R_RET{{Retriever service <br>}}
    DP([Data Preparation<br>]):::blue
    LVM_gen{{LVM Service <br>}}
    GW([MultimodalQnA GateWay<br>]):::orange

    %% Data Preparation flow
    %% Ingest data flow
    direction LR
    Ingest[Ingest data] --> UI
    UI -->DP
    DP <-.-> TEI_EM

    %% Questions interaction
    direction LR
    a[User Input Query] --> UI
    UI --> GW
    GW <==> MultimodalQnA-MegaService
    EM ==> RET
    RET ==> LVM

    %% Embedding service flow
    direction LR
    EM <-.-> TEI_EM
    RET <-.-> R_RET
    LVM <-.-> LVM_gen

    direction TB
    %% Vector DB interaction
    R_RET <-.->VDB
    DP <-.->VDB



Loading

This MultimodalQnA use case performs Multimodal-RAG using LangChain, Redis VectorDB and Text Generation Inference on Intel Gaudi2 and Intel Xeon Scalable Processors, and we invite contributions from other hardware vendors to expand the example.

The Intel Gaudi2 accelerator supports both training and inference for deep learning models in particular for LLMs. Visit Habana AI products for more details.

In the below, we provide a table that describes for each microservice component in the MultimodalQnA architecture, the default configuration of the open source project, hardware, port, and endpoint.

Gaudi default compose.yaml
MicroService Open Source Project HW Port Endpoint
Embedding Langchain Xeon 6000 /v1/embeddings
Retriever Langchain, Redis Xeon 7000 /v1/multimodal_retrieval
LVM Langchain, TGI Gaudi 9399 /v1/lvm
Dataprep Redis, Langchain, TGI Gaudi 6007 /v1/generate_transcripts, /v1/generate_captions

Required Models

By default, the embedding and LVM models are set to a default value as listed below:

Service HW Model
embedding-multimodal Xeon BridgeTower/bridgetower-large-itm-mlm-itc
LVM Xeon llava-hf/llava-1.5-7b-hf
embedding-multimodal Gaudi BridgeTower/bridgetower-large-itm-mlm-itc
LVM Gaudi llava-hf/llava-v1.6-vicuna-13b-hf

You can choose other LVM models, such as llava-hf/llava-1.5-7b-hf and llava-hf/llava-1.5-13b-hf, as needed.

Deploy MultimodalQnA Service

The MultimodalQnA service can be effortlessly deployed on either Intel Gaudi2 or Intel XEON Scalable Processors.

Currently we support deploying MultimodalQnA services with docker compose.

Setup Environment Variable

To set up environment variables for deploying MultimodalQnA services, follow these steps:

  1. Set the required environment variables:

    # Example: export host_ip=$(hostname -I | awk '{print $1}')
    export host_ip="External_Public_IP"
    # Example: no_proxy="localhost, 127.0.0.1, 192.168.1.1"
    export no_proxy="Your_No_Proxy"
  2. If you are in a proxy environment, also set the proxy-related environment variables:

    export http_proxy="Your_HTTP_Proxy"
    export https_proxy="Your_HTTPs_Proxy"
  3. Set up other environment variables:

    Notice that you can only choose one command below to set up envs according to your hardware. Other that the port numbers may be set incorrectly.

    # on Gaudi
    source ./docker_compose/intel/hpu/gaudi/set_env.sh
    # on Xeon
    source ./docker_compose/intel/cpu/xeon/set_env.sh

Deploy MultimodalQnA on Gaudi

Refer to the Gaudi Guide to build docker images from source.

Find the corresponding compose.yaml.

cd GenAIExamples/MultimodalQnA/docker_compose/intel/hpu/gaudi/
docker compose -f compose.yaml up -d

Notice: Currently only the Habana Driver 1.17.x is supported for Gaudi.

Deploy MultimodalQnA on Xeon

Refer to the Xeon Guide for more instructions on building docker images from source.

Find the corresponding compose.yaml.

cd GenAIExamples/MultimodalQnA/docker_compose/intel/cpu/xeon/
docker compose -f compose.yaml up -d

MultimodalQnA Demo on Gaudi2

MultimodalQnA-upload-waiting-screenshot

MultimodalQnA-upload-done-screenshot

MultimodalQnA-query-example-screenshot