CAST: Cross-modal Alignment Similarity Test for Vision Language Models

Abstract

Vision Language Models (VLMs) are typically evaluated with Visual Question Answering (VQA) tasks which assess a model's understanding of scenes. Good VQA performance is taken as evidence that the model will perform well on a broader range of tasks that require both visual and language inputs. However, scene-aware VQA does not fully capture input biases or assess hallucinations caused by a misalignment between modalities. To address this, we propose a Cross-modal Alignment Similarity Test (CAST) to probe VLMs for self-consistency across modalities. This test involves asking the models to identify similarities between two scenes through text-only, image-only, or both and then assess the truthfulness of the similarities they generate. Since there is no ground-truth to compare against, this evaluation does not focus on objective accuracy but rather on whether VLMs are internally consistent in their outputs. We argue that while not all self-consistent models are capable or accurate, all capable VLMs must be self-consistent.

Dataset

The dataset is sub-sampled from the DOCCI dataset (Onoe et al., 2024) and consists of 100 image pairs.

We select pairs of images that are visually similar and textually interesting. The similarity of the image pairs is determined by the cosine similarity of the image embeddings extracted from CLIP (Radford et al., 2021).

Requirements

The code is implemented in Python 3.12. To install the required packages, run the following command:

pip install -r requirements.txt

Run

To run CAST on our generated DOCCI subset, edit the configs/eval.yaml with desired configuration and run python main.py.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
configs		configs
consistency		consistency
data		data
data_prompt		data_prompt
feats		feats
plots		plots
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
all_statements.csv		all_statements.csv
analysis.ipynb		analysis.ipynb
generate_dataset.ipynb		generate_dataset.ipynb
gpt4o-mini.cache.json		gpt4o-mini.cache.json
launch.py		launch.py
main.py		main.py
requirements.txt		requirements.txt
save_analysis.py		save_analysis.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAST: Cross-modal Alignment Similarity Test for Vision Language Models

Abstract

Dataset

Requirements

Run

About

Releases

Packages

Contributors 2

Languages

License

gautierdag/cast

Folders and files

Latest commit

History

Repository files navigation

CAST: Cross-modal Alignment Similarity Test for Vision Language Models

Abstract

Dataset

Requirements

Run

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages