Multimodal Large Language Models (MLLMs) hallucinate, resulting in an emerging topic of visual hallucination evaluation (VHE). This paper contributes a ChatGPT-Prompted visual hallucination evaluation Dataset (PhD) for objective VHE at a large scale. The essence of VHE is to ask an MLLM questions about specific images to assess its susceptibility to hallucination. Depending on what to ask (objects, attributes, sentiment, etc.) and how the questions are asked, we structure PhD along two dimensions, i.e., task and mode. Five visual recognition tasks, ranging from low-level (object / attribute recognition) to middle-level (sentiment / position recognition and counting), are considered. Besides a normal visual QA mode, which we term PhD-base, PhD also asks questions with inaccurate context (PhD-iac) or with incorrect context (PhD-icc), or with AI-generated counter common sense images (PhD-ccs). We construct PhD by a ChatGPT-assisted semi-automated pipeline, encompassing four pivotal modules: task-specific hallucinatory item (hitem) selection, hitem-embedded question generation, inaccurate / incorrect context generation, and counter-common-sense (CCS) image generation. With over 14k daily images, 750 CCS images and 102k VQA triplets in total, PhD reveals considerable variability in MLLMs' performance across various modes and tasks, offering valuable insights into the nature of hallucination. As such, PhD stands as a potent tool not only for VHE but may also play a significant role in the refinement of MLLMs.
In particular, we consider 4 testing mode, including 5 visual tasks: object recognition, attribute recognition, sentiment understanding, positional reasoning, and counting.
Note, the different modes are specifically designed to different source of hallucinations, including visual ambiguity (PhD-base), multi-modal input (PhD-iac and PhD-icc), and counter common sense (PhD-ccs). See the following figure for more details.
Hallucinatory items (hitems) refer to specific terms (words or phrases) in visual questions posed to a MLLM that lead to discrepancies between the MLLM’s response and the corresponding visual content.
To illustrate, consider an image of a dining table
setting that lacks a fork
. Although the fork
is absent, its association with the dining table
makes it a potential hitem.
The PhD dataset provides hitem information for each query. For VHE-ccs, we also include the ccs_description
to elucidate why the image may easily induce hallucinations.
Therefore, you will understand why the PhD questions are applicable to reflecting hallucinations. This is an aspect currently missing from other hallucination datasets.
The statistics of the dataset and some examples are shown below. Images of VHE-base, VHE-iac, and VHE-icc are sourced from the COCO dataset. This ensures that MLLMs have been exposed to these images. Despite this, they can still generate incorrect answers, which reflects hallucinations in low-level visual tasks.
- PhD-base: Shown in (c) with red and green block. Basically you can regard it as a normal visual question answering task (normal question and image ). But we additionally indicate the hallucinatory element (
hitem
) in the question (see data.json). - PhD-iac: Shown in (c) with yellow block. For each question in PhD-base, we further combine it with inaccurate context. This inaccurate context has some noise information unrelated to the image.
- PhD-icc: Shown in (c) with purple block. Similar to PhD-iac, the question is combined with incorrect context. This context is totally conflicted with the image.
- PhD-ccs: Shown in (d). Though the question is normal, the image is generated by AI and is counter-common-sense in the real world.
PhD is a consistently developing dataset, and we will continue to update and refine it. If you have any questions or suggestions, please feel free to contact us.
-
PhD-base, PhD-iac, and PhD-icc use COCO 2014 images (including both train and val). You can directly download the images from the COCO website.
-
PhD-ccs uses our AI-generated images. You can download it into
CCS_images
from the following links: Google Drive.
For your convenience in evaluation, please organize the data in the following format.
images/
COCO/
train2014/
COCO_train2014_000000000139.jpg
COCO_train2014_000000000164.jpg
...
val2014/
COCO_val2014_000000000139.jpg
COCO_val2014_000000000164.jpg
...
CCS_images/
0.png
1.png
...
data_base_cxt.json
data_ccs.json
# this file contains data of PhD-base, PhD-iac, PhD-icc and PhD-ccs.
# the file can be read as a dict. array in JSON format.
data = json.load(open('data.json', encoding='utf-8'))
# Each sample includes the following keys:
"""
· image_id: indicate COCO_image id to the test image.
· task: one of the 5 tasks
· yes_question: question which answer is yes.
· no_question: question which answer is no.
· hitem: hallucination item.
· gt: ground truth.
· subject: questioned subject.
· context: {"iac": inaccurate context, "icc": incorrect context}
"""
- If you want to perform PhD-base mode, you can just use the
question
(yes_ / no_). - For PhD-iac and PhD-icc, you can use the
context
to get the inaccurate or incorrect context, and then combine it with thequestion
.- For example:
context["iac"]
+" In case there is an inconsistency between the context and the image content, you should follow the image. "
+question
.
- For example:
"""
· image_id: indicate id of our generated images.
· ccs_description: specific the reason why the image is counter-common-sense.
· yes_question: question which answer is yes.
· no_question: question which answer is no.
· task: one of the 5 tasks.
"""
import json
data = json.load(open('data.json', encoding='utf-8'))
# Examples: Loading PhD-base, PhD-iac, PhD-icc, and PhD-ccs
# PhD-base
phd_base = [{'image_id': sample['image_id'], 'yes_question': sample['yes_question'],
'no_question': sample['no_question']} for sample in data if 'ccs_description' not in sample]
# PhD-iac
instruction = " In case there is an inconsistency between the context and the image content, you should follow the image. "
phd_iac = []
for sample in data:
if 'ccs_description' in sample:
continue
yes_question = sample["context"]["iac"] + instruction + sample['yes_question']
no_question = sample["context"]["iac"] + instruction + sample['no_question']
phd_iac.append({'image_id': sample['image_id'], 'yes_question': yes_question,
'no_question': no_question})
# PhD-icc
instruction = " In case there is an inconsistency between the context and the image content, you should follow the image. "
phd_icc = []
for sample in data:
if 'ccs_description' in sample:
continue
yes_question = sample["context"]["icc"] + instruction + sample['yes_question']
no_question = sample["context"]["icc"] + instruction + sample['no_question']
phd_iac.append({'image_id': sample['image_id'], 'yes_question': yes_question,
'no_question': no_question})
# PhD-ccs
phd_ccs = [{'image_id': sample['image_id'], 'yes_question': sample['yes_question'],
'no_question': sample['no_question']} for sample in data if 'ccs_description' in sample]
image_id
should be further compiled with the image path to load the image. For example,images/COCO/val2014/COCO_val2014_{image_id}.jpg
.
As mentioned in papers, we propose a novel evaluation metric, the PhD score, to evaluate the performance of MLLMs on the PhD dataset.
Simply to say, the PhD Index is the F1 value of the recall rates for yes
and no
answers,
which is designed to be sensitive to the tendency of outputing yes
or no
, providing a nuanced understanding of the model's performance.
For the evaluation results, please refer to the experiment
section of the paper, as well as the supplementary materials.
If you found this work useful, consider giving this repository a star and citing our paper as followed:
@misc{liu2024phd,
title={PhD: A Prompted Visual Hallucination Evaluation Dataset},
author={Jiazhen Liu and Yuhan Fu and Ruobing Xie and Runquan Xie and Xingwu Sun and Fengzong Lian and Zhanhui Kang and Xirong Li},
year={2024},
eprint={2403.11116},
archivePrefix={arXiv},
primaryClass={cs.CV}
}