Skip to content

nguyennpa412/simple-multimodal-ai

Repository files navigation

Simple Application for Multimodal AI

Table of Contents

  1. Introduction
  2. Main features
  3. Demos
  4. Installation
  5. Configs & Run
  6. Docker
  7. References

1. Introduction

A simple yet vesatile application using Gradio, featuring the integration of various open-source models from Hugging Face. This app supports a range of tasks, including Image Text to Text, Visual Question Answering, and Text to Speech, providing an accessible interface for experimenting with these advanced machine learning models.

cover

2. Main features

Module Source Function
#M1 Image-Text-to-Text microsoft/Florence-2-large
  • Description Generation
  • Computer Vision Tasks
#M2 Visual Question Answering OpenGVLab/Mini-InternVL-Chat-2B-V1-5
  • Chatbot
#M3 Text-to-Speech coqui/XTTS-v2
  • Description Speech Generation
Computer Vision Tasks details
Task type Task details Usage
Image Captioning Generate a short description !describe -s
Generate a detailed description !describe -m
Generate a more detailed description !describe -l
Localize and Describe salient regions !densecap
Object Detection Detect objects from text inputs !detect obj1 obj2 ...
Image Segmentation Segment objects from text inputs !segment obj1 obj2 ...
Optical Character Recognition Localize and Recognize text !ocr

Additional features

  • Voice options: You can choose the voice for Speech Synthesizer, there are currently 2 voice options:
    • David Attenborough
    • Morgan Freeman
  • Random bot: With every input image entry, a different random bot avatar would be used.
    • Demo
      demo_bot.mp4

3. Demos

Image-Text-to-Text
demo_ittt.mp4
Visual Question Answering
demo_vqa.mp4
Text-to-Speech
demo_tts.mp4

4. Installation

4.1 Tested environment

  • Ubuntu 22.04
  • Python 3.10.12
  • NVIDIA driver 555
  • CUDA 11.8
  • CuDNN8 & CuDNN9

4.2 GPU requirements

  • Capable of processing on GPU and CPU:
GPU CPU
#M1
#M2
#M3
  • Do you need GPU to run this app?
    • No, you can run this app on CPU, but you can only use Image-Text-to-Text and Text-to-Speech modules, also processing time would be longer.
  • GPU consumptions:

GPU_consumptions

  • You can set dtype and quantization based on this table so that you can make full use of your GPU.
  • For example with my 6GB GPU:
    • #M1: gpu - q4 - bfp16
    • #M2: gpu - q8 - bfp16
    • #M3: cpu - fp32
      • This is the current gpu_low specs config.

4.3 Installation

This preparation is for local run, you should use a venv for local run.

  • CPU only: Run pip install -r requirements.cpu.txt
  • GPU:
    • Install suitable NVIDIA driver
    • Install CUDA 11.8 & CuDNN8|9
    • pip install -r requirements.txt

5. Configs & Run

5.1 Config files

File Includes
General configs app_config.yaml
  • Module configs
  • App configs
  • Launch configs
#M1 configs florence_2_large.yaml
  • Load configs
  • Warm-up configs
#M2 configs mini_internvl_chat_2b_v1_5.yaml
#M3 configs xtts_v2.yaml

5.2 Specs configs

There are 3 profiles for specs configs:

cpu gpu_low gpu_high
#M1 cpu - fp32 gpu - q4 - bfp16 gpu - fp32
#M2 gpu - q8 - bfp16 gpu - fp32
#M3 cpu - fp32 cpu - fp32 gpu - fp32
GPU VRAM needed 0 ~6GB > 16GB
  • With gpu_high, #M3 will use longer speaker voice duration for synthesizing.
  • The current default profile is gpu_low. You can set the specs profile in app_config.yaml.
  • If you want to create a custom profile for this, remember to add the custom profile to all module config files as well.

5.3 Run the app (Local)

  • Share option: To create a temporary shareable link for others to use the app, simply set share -> True under lanch_config in app_config.yaml before running the app.
  • Run the app:
    • Activate venv (Optional)
    • python app.py

The app is running on http://127.0.0.1:7860/

6. Docker

6.1 NVIDIA Container Toolkit

You need to install NVIDIA Container Toolkit in order to use docker for gpu images.

6.2 Build new images

Remember to change the specs profile in app_config.yaml before building images.

  • Docker engine build:

    • CPU specs: docker build -f Dockerfile.cpu -t {image_name}:{tag} .
    • GPU specs: docker build -t {image_name}:{tag} .
  • Docker compose build:

6.3 Run built images

  • Docker engine run:

    • CPU image: docker run -p 7860:7860 {image_name}:{tag}
    • GPU image: docker run --gpus all -p 7860:7860 {image_name}:{tag}
  • Docker compose run:

    • CPU image: docker compose -f docker-compose.cpu.yaml up
    • GPU image: docker compose up

The app is running on http://0.0.0.0:7860/

6.4 Pre-built images

  • Docker engine run:

    • cpu: docker run --pull=always -p 7860:7860 nguyennpa412/simple-multimodal-ai:cpu
    • gpu-low: docker run --pull=always --gpus all -p 7860:7860 nguyennpa412/simple-multimodal-ai:gpu-low
    • gpu-high: docker run --pull=always --gpus all -p 7860:7860 nguyennpa412/simple-multimodal-ai:gpu-high
  • Docker compose run:

    • cpu:
      • Change image in docker-compose.cpu.yaml to nguyennpa412/simple-multimodal-ai:cpu
      • docker compose -f docker-compose.cpu.yaml up --pull=always
    • gpu-low:
      • Change image in docker-compose.yaml to nguyennpa412/simple-multimodal-ai:gpu-low
      • docker compose up --pull=always
    • gpu-high:
      • Change image in docker-compose.yaml to nguyennpa412/simple-multimodal-ai:gpu-high
      • docker compose up --pull=always

The app is running on http://0.0.0.0:7860/

References

  1. B. Xiao et al., "Florence-2: Advancing a unified representation for a variety of vision tasks," arXiv preprint arXiv:2311.06242, 2023. [Online]. Available: https://arxiv.org/abs/2311.06242
  2. Z. Chen et al., "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks," arXiv preprint arXiv:2312.14238, 2023. [Online]. Available: https://arxiv.org/abs/2312.14238
  3. Z. Chen et al., "How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites," arXiv preprint arXiv:2404.16821, 2024. [Online]. Available: https://arxiv.org/abs/2404.16821
  4. E. Casanova et al., "XTTS: A Massively Multilingual Zero-Shot Text-to-Speech Model," arXiv preprint arXiv:2406.04904, 2024. [Online]. Available: https://arxiv.org/abs/2406.04904