PhD: A Prompted Visual Hallucination Evaluation Dataset

Introduction

Multimodal Large Language Models (MLLMs) hallucinate, resulting in an emerging topic of visual hallucination evaluation (VHE). This paper contributes a ChatGPT-Prompted visual hallucination evaluation Dataset (PhD) for objective VHE at a large scale. The essence of VHE is to ask an MLLM questions about specific images to assess its susceptibility to hallucination. Depending on what to ask (objects, attributes, sentiment, etc.) and how the questions are asked, we structure PhD along two dimensions, i.e., task and mode. Five visual recognition tasks, ranging from low-level (object / attribute recognition) to middle-level (sentiment / position recognition and counting), are considered. Besides a normal visual QA mode, which we term PhD-base, PhD also asks questions with inaccurate context (PhD-iac) or with incorrect context (PhD-icc), or with AI-generated counter common sense images (PhD-ccs). We construct PhD by a ChatGPT-assisted semi-automated pipeline, encompassing four pivotal modules: task-specific hallucinatory item (hitem) selection, hitem-embedded question generation, inaccurate / incorrect context generation, and counter-common-sense (CCS) image generation. With over 14k daily images, 750 CCS images and 102k VQA triplets in total, PhD reveals considerable variability in MLLMs' performance across various modes and tasks, offering valuable insights into the nature of hallucination. As such, PhD stands as a potent tool not only for VHE but may also play a significant role in the refinement of MLLMs.

Mode and Task

In particular, we consider 4 testing mode, including 5 visual tasks: object recognition, attribute recognition, sentiment understanding, positional reasoning, and counting.

Note, the different modes are specifically designed to different source of hallucinations, including visual ambiguity (PhD-base), multi-modal input (PhD-iac and PhD-icc), and counter common sense (PhD-ccs). See the following figure for more details.

The meaning of `hitem`

Hallucinatory items (hitems) refer to specific terms (words or phrases) in visual questions posed to a MLLM that lead to discrepancies between the MLLM’s response and the corresponding visual content.

To illustrate, consider an image of a dining table setting that lacks a fork. Although the fork is absent, its association with the dining table makes it a potential hitem.

The PhD dataset provides hitem information for each query. For VHE-ccs, we also include the ccs_description to elucidate why the image may easily induce hallucinations.

Therefore, you will understand why the PhD questions are applicable to reflecting hallucinations. This is an aspect currently missing from other hallucination datasets.

Showcases

The statistics of the dataset and some examples are shown below. Images of VHE-base, VHE-iac, and VHE-icc are sourced from the COCO dataset. This ensures that MLLMs have been exposed to these images. Despite this, they can still generate incorrect answers, which reflects hallucinations in low-level visual tasks.

PhD-base: Shown in (c) with red and green block. Basically you can regard it as a normal visual question answering task (normal question and image ). But we additionally indicate the hallucinatory element (hitem) in the question (see data.json).
PhD-iac: Shown in (c) with yellow block. For each question in PhD-base, we further combine it with inaccurate context. This inaccurate context has some noise information unrelated to the image.
PhD-icc: Shown in (c) with purple block. Similar to PhD-iac, the question is combined with incorrect context. This context is totally conflicted with the image.
PhD-ccs: Shown in (d). Though the question is normal, the image is generated by AI and is counter-common-sense in the real world.

PhD is a consistently developing dataset, and we will continue to update and refine it. If you have any questions or suggestions, please feel free to contact us.

Image Download

PhD-base, PhD-iac, and PhD-icc use COCO 2014 images (including both train and val). You can directly download the images from the COCO website.
PhD-ccs uses our AI-generated images. You can download it into CCS_images from the following links: Google Drive.

Data Organization

Files for 4 modes

# this file contains data of PhD-base, PhD-iac, PhD-icc and PhD-ccs.
# the file can be read as a dict. array in JSON format. 
data = json.load(open('data.json', encoding='utf-8'))

The format of `PhD-base`, `PhD-iac`, and `PhD-icc`

# Each sample includes the following keys:

"""
· image_id: indicate COCO_image id to the test image.
· task: one of the 5 tasks
· yes_question: question which answer is yes.
· no_question: question which answer is no.
· hitem: hallucination item.
· gt: ground truth.
· subject: questioned subject.
· context: {"iac": inaccurate context, "icc": incorrect context}
"""

If you want to perform PhD-base mode, you can just use the question (yes_ / no_).
For PhD-iac and PhD-icc, you can use the context to get the inaccurate or incorrect context, and then combine it with the question.
- For example: context["iac"] + " In case there is an inconsistency between the context and the image content, you should follow the image. " + question.

The format of `PhD-ccs`

"""
· image_id: indicate id of our generated images.
· ccs_description: specific the reason why the image is counter-common-sense.
· yes_question: question which answer is yes.
· no_question: question which answer is no.
· task: one of the 5 tasks.
"""

Demo Code for Loading

import json

data = json.load(open('data.json', encoding='utf-8'))

# Examples: Loading PhD-base, PhD-iac, PhD-icc, and PhD-ccs
# PhD-base
phd_base = [{'image_id': sample['image_id'], 'yes_question': sample['yes_question'], 
             'no_question': sample['no_question']} for sample in data if 'ccs_description' not in sample]
           
# PhD-iac
instruction = " In case there is an inconsistency between the context and the image content, you should follow the image. "
phd_iac = []
for sample in data:
    if 'ccs_description' in sample:
        continue
    yes_question = sample["context"]["iac"] + instruction + sample['yes_question']
    no_question = sample["context"]["iac"] + instruction + sample['no_question']
    phd_iac.append({'image_id': sample['image_id'], 'yes_question': yes_question, 
                    'no_question': no_question})

# PhD-icc
instruction = " In case there is an inconsistency between the context and the image content, you should follow the image. "
phd_icc = []
for sample in data:
    if 'ccs_description' in sample:
        continue
    yes_question = sample["context"]["icc"] + instruction + sample['yes_question']
    no_question = sample["context"]["icc"] + instruction + sample['no_question']
    phd_iac.append({'image_id': sample['image_id'], 'yes_question': yes_question, 
                    'no_question': no_question})
                
# PhD-ccs
phd_ccs = [{'image_id': sample['image_id'], 'yes_question': sample['yes_question'], 
             'no_question': sample['no_question']} for sample in data if 'ccs_description' in sample]

image_id should be further compiled with the image path to load the image. For example, images/COCO/val2014/COCO_val2014_{image_id}.jpg.

Metric

As mentioned in papers, we propose a novel evaluation metric, the PhD score, to evaluate the performance of MLLMs on the PhD dataset. Simply to say, the PhD Index is the F1 value of the recall rates for yes and no answers, which is designed to be sensitive to the tendency of outputing yes or no, providing a nuanced understanding of the model's performance.

For the evaluation results, please refer to the experiment section of the paper, as well as the supplementary materials.

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@misc{liu2024phd,
      title={PhD: A Prompted Visual Hallucination Evaluation Dataset}, 
      author={Jiazhen Liu and Yuhan Fu and Ruobing Xie and Runquan Xie and Xingwu Sun and Fengzong Lian and Zhanhui Kang and Xirong Li},
      year={2024},
      eprint={2403.11116},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PhD: A Prompted Visual Hallucination Evaluation Dataset

Introduction

Mode and Task

The meaning of `hitem`

Showcases

Image Download

Data Organization

Directory

Files for 4 modes

The format of `PhD-base`, `PhD-iac`, and `PhD-icc`

The format of `PhD-ccs`

Demo Code for Loading

Metric

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

PhD: A Prompted Visual Hallucination Evaluation Dataset

Introduction

Mode and Task

The meaning of hitem

Showcases

Image Download

Data Organization

Directory

Files for 4 modes

The format of PhD-base, PhD-iac, and PhD-icc

The format of PhD-ccs

Demo Code for Loading

Metric

Citation

The meaning of `hitem`

The format of `PhD-base`, `PhD-iac`, and `PhD-icc`

The format of `PhD-ccs`