Skip to content

A Prompted Visual Hallucination Evaluation Dataset, featuring over 100,000 data points and four advanced evaluation modes. The dataset includes extensive contextual descriptions, counterintuitive images, and clear indicators of hallucination elements.

Notifications You must be signed in to change notification settings

jiazhen-code/PhD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PhD: A Prompted Visual Hallucination Evaluation Dataset

Introduction

Multimodal Large Language Models (MLLMs) hallucinate, resulting in an emerging topic of visual hallucination evaluation (VHE). This paper contributes a ChatGPT-Prompted visual hallucination evaluation Dataset (PhD) for objective VHE at a large scale. The essence of VHE is to ask an MLLM questions about specific images to assess its susceptibility to hallucination. Depending on what to ask (objects, attributes, sentiment, etc.) and how the questions are asked, we structure PhD along two dimensions, i.e., task and mode. Five visual recognition tasks, ranging from low-level (object / attribute recognition) to middle-level (sentiment / position recognition and counting), are considered. Besides a normal visual QA mode, which we term PhD-base, PhD also asks questions with inaccurate context (PhD-iac) or with incorrect context (PhD-icc), or with AI-generated counter common sense images (PhD-ccs). We construct PhD by a ChatGPT-assisted semi-automated pipeline, encompassing four pivotal modules: task-specific hallucinatory item (hitem) selection, hitem-embedded question generation, inaccurate / incorrect context generation, and counter-common-sense (CCS) image generation. With over 14k daily images, 750 CCS images and 102k VQA triplets in total, PhD reveals considerable variability in MLLMs' performance across various modes and tasks, offering valuable insights into the nature of hallucination. As such, PhD stands as a potent tool not only for VHE but may also play a significant role in the refinement of MLLMs.

Mode and Task

In particular, we consider 4 testing mode, including 5 visual tasks: object recognition, attribute recognition, sentiment understanding, positional reasoning, and counting.

Note, the different modes are specifically designed to different source of hallucinations, including visual ambiguity (PhD-base), multi-modal input (PhD-iac and PhD-icc), and counter common sense (PhD-ccs). See the following figure for more details.

example

The meaning of hitem

Hallucinatory items (hitems) refer to specific terms (words or phrases) in visual questions posed to a MLLM that lead to discrepancies between the MLLM’s response and the corresponding visual content.

To illustrate, consider an image of a dining table setting that lacks a fork. Although the fork is absent, its association with the dining table makes it a potential hitem.

The PhD dataset provides hitem information for each query. For VHE-ccs, we also include the ccs_description to elucidate why the image may easily induce hallucinations.

Therefore, you will understand why the PhD questions are applicable to reflecting hallucinations. This is an aspect currently missing from other hallucination datasets.

Showcases

The statistics of the dataset and some examples are shown below. Images of VHE-base, VHE-iac, and VHE-icc are sourced from the COCO dataset. This ensures that MLLMs have been exposed to these images. Despite this, they can still generate incorrect answers, which reflects hallucinations in low-level visual tasks.

example
example
  • PhD-base: Shown in (c) with red and green block. Basically you can regard it as a normal visual question answering task (normal question and image ). But we additionally indicate the hallucinatory element (hitem) in the question (see data.json).
  • PhD-iac: Shown in (c) with yellow block. For each question in PhD-base, we further combine it with inaccurate context. This inaccurate context has some noise information unrelated to the image.
  • PhD-icc: Shown in (c) with purple block. Similar to PhD-iac, the question is combined with incorrect context. This context is totally conflicted with the image.
  • PhD-ccs: Shown in (d). Though the question is normal, the image is generated by AI and is counter-common-sense in the real world.

PhD is a consistently developing dataset, and we will continue to update and refine it. If you have any questions or suggestions, please feel free to contact us.

Image Download

  • PhD-base, PhD-iac, and PhD-icc use COCO 2014 images (including both train and val). You can directly download the images from the COCO website.

  • PhD-ccs uses our AI-generated images. You can download it into CCS_images from the following links: Google Drive.

Data Organization

Directory

For your convenience in evaluation, please organize the data in the following format.

images/
    COCO/
        train2014/   
           COCO_train2014_000000000139.jpg
           COCO_train2014_000000000164.jpg
           ...
        val2014/   
           COCO_val2014_000000000139.jpg
           COCO_val2014_000000000164.jpg
           ...      
    CCS_images/
        0.png
        1.png
        ...
        
data_base_cxt.json
data_ccs.json

Files for 4 modes

# this file contains data of PhD-base, PhD-iac, PhD-icc and PhD-ccs.
# the file can be read as a dict. array in JSON format. 
data = json.load(open('data.json', encoding='utf-8'))

The format of PhD-base, PhD-iac, and PhD-icc

# Each sample includes the following keys:

"""
· image_id: indicate COCO_image id to the test image.
· task: one of the 5 tasks
· yes_question: question which answer is yes.
· no_question: question which answer is no.
· hitem: hallucination item.
· gt: ground truth.
· subject: questioned subject.
· context: {"iac": inaccurate context, "icc": incorrect context}
"""
  • If you want to perform PhD-base mode, you can just use the question (yes_ / no_).
  • For PhD-iac and PhD-icc, you can use the context to get the inaccurate or incorrect context, and then combine it with the question.
    • For example: context["iac"] + " In case there is an inconsistency between the context and the image content, you should follow the image. " + question.

The format of PhD-ccs

"""
· image_id: indicate id of our generated images.
· ccs_description: specific the reason why the image is counter-common-sense.
· yes_question: question which answer is yes.
· no_question: question which answer is no.
· task: one of the 5 tasks.
"""

Demo Code for Loading

import json

data = json.load(open('data.json', encoding='utf-8'))

# Examples: Loading PhD-base, PhD-iac, PhD-icc, and PhD-ccs
# PhD-base
phd_base = [{'image_id': sample['image_id'], 'yes_question': sample['yes_question'], 
             'no_question': sample['no_question']} for sample in data if 'ccs_description' not in sample]
           
# PhD-iac
instruction = " In case there is an inconsistency between the context and the image content, you should follow the image. "
phd_iac = []
for sample in data:
    if 'ccs_description' in sample:
        continue
    yes_question = sample["context"]["iac"] + instruction + sample['yes_question']
    no_question = sample["context"]["iac"] + instruction + sample['no_question']
    phd_iac.append({'image_id': sample['image_id'], 'yes_question': yes_question, 
                    'no_question': no_question})

# PhD-icc
instruction = " In case there is an inconsistency between the context and the image content, you should follow the image. "
phd_icc = []
for sample in data:
    if 'ccs_description' in sample:
        continue
    yes_question = sample["context"]["icc"] + instruction + sample['yes_question']
    no_question = sample["context"]["icc"] + instruction + sample['no_question']
    phd_iac.append({'image_id': sample['image_id'], 'yes_question': yes_question, 
                    'no_question': no_question})
                
# PhD-ccs
phd_ccs = [{'image_id': sample['image_id'], 'yes_question': sample['yes_question'], 
             'no_question': sample['no_question']} for sample in data if 'ccs_description' in sample]
  • image_id should be further compiled with the image path to load the image. For example, images/COCO/val2014/COCO_val2014_{image_id}.jpg.

Metric

As mentioned in papers, we propose a novel evaluation metric, the PhD score, to evaluate the performance of MLLMs on the PhD dataset. Simply to say, the PhD Index is the F1 value of the recall rates for yes and no answers, which is designed to be sensitive to the tendency of outputing yes or no, providing a nuanced understanding of the model's performance.

results1

For the evaluation results, please refer to the experiment section of the paper, as well as the supplementary materials.

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@misc{liu2024phd,
      title={PhD: A Prompted Visual Hallucination Evaluation Dataset}, 
      author={Jiazhen Liu and Yuhan Fu and Ruobing Xie and Runquan Xie and Xingwu Sun and Fengzong Lian and Zhanhui Kang and Xirong Li},
      year={2024},
      eprint={2403.11116},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

About

A Prompted Visual Hallucination Evaluation Dataset, featuring over 100,000 data points and four advanced evaluation modes. The dataset includes extensive contextual descriptions, counterintuitive images, and clear indicators of hallucination elements.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published