Simple Application for Multimodal AI

1. Introduction

A simple yet vesatile application using Gradio, featuring the integration of various open-source models from Hugging Face. This app supports a range of tasks, including Image Text to Text, Visual Question Answering, and Text to Speech, providing an accessible interface for experimenting with these advanced machine learning models.

2. Main features

	Module	Source	Function
#M1	Image-Text-to-Text	microsoft/Florence-2-large	Description Generation Computer Vision Tasks
#M2	Visual Question Answering	OpenGVLab/Mini-InternVL-Chat-2B-V1-5	Chatbot
#M3	Text-to-Speech	coqui/XTTS-v2	Description Speech Generation

Computer Vision Tasks details

Task type	Task details	Usage
Image Captioning	Generate a short description	`!describe -s`
	Generate a detailed description	`!describe -m`
	Generate a more detailed description	`!describe -l`
	Localize and Describe salient regions	`!densecap`
Object Detection	Detect objects from text inputs	`!detect obj1 obj2 ...`
Image Segmentation	Segment objects from text inputs	`!segment obj1 obj2 ...`
Optical Character Recognition	Localize and Recognize text	`!ocr`

Additional features

Voice options: You can choose the voice for Speech Synthesizer, there are currently 2 voice options:

David Attenborough

Morgan Freeman

Random bot: With every input image entry, a different random bot avatar would be used.

Demo

demo_bot.mp4

3. Demos

Image-Text-to-Text

demo_ittt.mp4

Visual Question Answering

demo_vqa.mp4

Text-to-Speech

demo_tts.mp4

4. Installation

4.1 Tested environment

Ubuntu 22.04
Python 3.10.12
NVIDIA driver 555
CUDA 11.8
CuDNN8 & CuDNN9

4.2 GPU requirements

Capable of processing on GPU and CPU:

	GPU	CPU
#M1	✅	✅
#M2	✅	❌
#M3	✅	✅

Do you need GPU to run this app?

No, you can run this app on CPU, but you can only use Image-Text-to-Text and Text-to-Speech modules, also processing time would be longer.

GPU consumptions:

You can set dtype and quantization based on this table so that you can make full use of your GPU.

For example with my 6GB GPU:

#M1: gpu - q4 - bfp16

#M2: gpu - q8 - bfp16

#M3: cpu - fp32

This is the current gpu_low specs config.

4.3 Installation

This preparation is for local run, you should use a venv for local run.

CPU only: Run pip install -r requirements.cpu.txt
GPU:
- Install suitable NVIDIA driver
- Install CUDA 11.8 & CuDNN8|9
- pip install -r requirements.txt

5. Configs & Run

5.1 Config files

	File	Includes
General configs	app_config.yaml	Module configs App configs Launch configs
#M1 configs	florence_2_large.yaml	Load configs Warm-up configs
#M2 configs	mini_internvl_chat_2b_v1_5.yaml
#M3 configs	xtts_v2.yaml

5.2 Specs configs

There are 3 profiles for specs configs:

	cpu	gpu_low	gpu_high
#M1	`cpu - fp32`	`gpu - q4 - bfp16`	`gpu - fp32`
#M2		`gpu - q8 - bfp16`	`gpu - fp32`
#M3	`cpu - fp32`	`cpu - fp32`	`gpu - fp32`
GPU VRAM needed	0	~6GB	> 16GB

With gpu_high, #M3 will use longer speaker voice duration for synthesizing.

The current default profile is gpu_low. You can set the specs profile in app_config.yaml.

If you want to create a custom profile for this, remember to add the custom profile to all module config files as well.

5.3 Run the app (Local)

Share option: To create a temporary shareable link for others to use the app, simply set share -> True under lanch_config in app_config.yaml before running the app.
Run the app:
- Activate venv (Optional)
- python app.py

The app is running on http://127.0.0.1:7860/

6. Docker

6.1 NVIDIA Container Toolkit

You need to install NVIDIA Container Toolkit in order to use docker for gpu images.

6.2 Build new images

Remember to change the specs profile in app_config.yaml before building images.

Docker engine build:
- CPU specs: docker build -f Dockerfile.cpu -t {image_name}:{tag} .
- GPU specs: docker build -t {image_name}:{tag} .
Docker compose build:
- CPU specs:
  - Change image in docker-compose.cpu.yaml to your liking
  - docker compose -f docker-compose.cpu.yaml build
- GPU specs:
  - Change image in docker-compose.yaml to your liking
  - docker compose build

6.3 Run built images

Docker engine run:
- CPU image: docker run -p 7860:7860 {image_name}:{tag}
- GPU image: docker run --gpus all -p 7860:7860 {image_name}:{tag}
Docker compose run:
- CPU image: docker compose -f docker-compose.cpu.yaml up
- GPU image: docker compose up

The app is running on http://0.0.0.0:7860/

6.4 Pre-built images

Docker Hub repository: https://hub.docker.com/r/nguyennpa412/simple-multimodal-ai

There are 3 tags for 3 specs profiles: cpu, gpu-low, gpu-high

Docker engine run:
- cpu: docker run --pull=always -p 7860:7860 nguyennpa412/simple-multimodal-ai:cpu
- gpu-low: docker run --pull=always --gpus all -p 7860:7860 nguyennpa412/simple-multimodal-ai:gpu-low
- gpu-high: docker run --pull=always --gpus all -p 7860:7860 nguyennpa412/simple-multimodal-ai:gpu-high
Docker compose run:
- cpu:
  - Change image in docker-compose.cpu.yaml to nguyennpa412/simple-multimodal-ai:cpu
  - docker compose -f docker-compose.cpu.yaml up --pull=always
- gpu-low:
  - Change image in docker-compose.yaml to nguyennpa412/simple-multimodal-ai:gpu-low
  - docker compose up --pull=always
- gpu-high:
  - Change image in docker-compose.yaml to nguyennpa412/simple-multimodal-ai:gpu-high
  - docker compose up --pull=always

The app is running on http://0.0.0.0:7860/

References

B. Xiao et al., "Florence-2: Advancing a unified representation for a variety of vision tasks," arXiv preprint arXiv:2311.06242, 2023. [Online]. Available: https://arxiv.org/abs/2311.06242
Z. Chen et al., "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks," arXiv preprint arXiv:2312.14238, 2023. [Online]. Available: https://arxiv.org/abs/2312.14238
Z. Chen et al., "How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites," arXiv preprint arXiv:2404.16821, 2024. [Online]. Available: https://arxiv.org/abs/2404.16821
E. Casanova et al., "XTTS: A Massively Multilingual Zero-Shot Text-to-Speech Model," arXiv preprint arXiv:2406.04904, 2024. [Online]. Available: https://arxiv.org/abs/2406.04904

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
configs		configs
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.cpu		Dockerfile.cpu
LICENSE		LICENSE
README.md		README.md
app.py		app.py
docker-compose.cpu.yaml		docker-compose.cpu.yaml
docker-compose.yaml		docker-compose.yaml
requirements.cpu.txt		requirements.cpu.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Application for Multimodal AI

Table of Contents

1. Introduction

2. Main features

Additional features

3. Demos

4. Installation

4.1 Tested environment

4.2 GPU requirements

4.3 Installation

5. Configs & Run

5.1 Config files

5.2 Specs configs

5.3 Run the app (Local)

6. Docker

6.1 NVIDIA Container Toolkit

6.2 Build new images

6.3 Run built images

6.4 Pre-built images

References

About

Releases

Packages

Languages

License

nguyennpa412/simple-multimodal-ai

Folders and files

Latest commit

History

Repository files navigation

Simple Application for Multimodal AI

Table of Contents

1. Introduction

2. Main features

Additional features

3. Demos

4. Installation

4.1 Tested environment

4.2 GPU requirements

4.3 Installation

5. Configs & Run

5.1 Config files

5.2 Specs configs

5.3 Run the app (Local)

6. Docker

6.1 NVIDIA Container Toolkit

6.2 Build new images

6.3 Run built images

6.4 Pre-built images

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages