Multimodal Visual Question Answering

Visual Question Answering System using ViT, GPT, BERT, CLIP (LLMs/VLMs)

Tasks and questions to ponder:

Late Fusion - process modalities separtely through Language or Vision models then merge their outputs
Early Fusion like Chameleon VLM (requires high computation power) - process modalities together after merging in the beginning itself.
(Fine-tuning) Final layer: Classification vs Generation layer (some design changes)
Can we utilize pre-trained VLM such as BLIP or MiniGPT (simiar to late fusion)?

Models

BERT (Bidirectional Encoder Representations from Transformers): Encoder-only language model trained on masked language modeling (MLM) and next sentence prediction (NSP).
ViT (Vision Transformer): Encoder-only vision model that treats image patches as tokens for efficient image processing.
GPT-2 (Generative Pre-trained Transformer 2): Decoder-only language model designed for coherent text generation.
CLIP (Contrastive Language-Image Pre-training): Model jointly trained an image and text encoder to understand and align visual and textual data in a shared latent space.

Late Fusion Model (Classification)

The late fusion model for Visual Question Answering (VQA) treats the task as a classification problem. It uses separate encoders for text and image inputs, which are fused together before making a classification prediction.

Architecture:

Text Encoder: Pre-trained BERT model (bert-base-uncased).
Image Encoder: Pre-trained Vision Transformer model (google/vit-base-patch16-224-in21k).
Fusion Layer: Combines the outputs from the text and image encoders using a linear layer, followed by ReLU activation and dropout.
Classifier: A linear layer that maps the fused representation to a fixed set of answer classes.

Generation Model

vqa_generation.ipynb

The generation model treats VQA as a sequence generation problem. It integrates separate encoders for text and image inputs and uses a decoder to generate textual answers.

Architecture:

Text Encoder: Pre-trained BERT model (bert-base-uncased).
Image Encoder: Pre-trained Vision Transformer model (google/vit-base-patch16-224-in21k).
Fusion Layer: Combines the outputs from the text and image encoders using a linear layer, followed by ReLU activation and dropout.
Text Decoder: Pre-trained GPT-2 model (gpt2) for generating the answer text.

Running

Create conda env

conda env create -f environment.yml

Using docker

# run app
python app.py

# use docker
transformers-cli serve --task=fill-mask --model=bert-base-uncased

curl -X POST http://localhost:8888/forward -H "accept: application/json" -H "Content-Type: application/json" -d '{"inputs": "Today is going to be a [MASK] day"}' | jq

docker build --platform linux/amd64 -t vqa:v1 .
# check port from docker ps and use the curl command to get output

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
.vscode		.vscode
static		static
templates		templates
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
data_preprocessing.py		data_preprocessing.py
docker-compose.debug.yml		docker-compose.debug.yml
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
image1.png		image1.png
image2.png		image2.png
late_fusion_classification.png		late_fusion_classification.png
late_fusion_generation.png		late_fusion_generation.png
model.py		model.py
requirements.txt		requirements.txt
tests.py		tests.py
vqa_classification.ipynb		vqa_classification.ipynb
vqa_generation.ipynb		vqa_generation.ipynb
vqa_page.png		vqa_page.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Visual Question Answering

Models

Late Fusion Model (Classification)

Generation Model

Running

About

Releases

Packages

Languages

License

kHarshit/visual-question-answering

Folders and files

Latest commit

History

Repository files navigation

Multimodal Visual Question Answering

Models

Late Fusion Model (Classification)

Generation Model

Running

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages