Code for the paper - "YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models"
We are thrilled to announce that our paper has been accepted to EMNLP 2024 Main Proceedings as a Long Paper! πβ¨
Abstract:
Understanding satire and humor is a challenging task for even current Vision-Language models. In this paper, we propose the challenging tasks of Satirical Image Detection (detecting whether an image is satirical), Understanding (generating the reason behind the image being satirical), and Completion (given one half of the image, selecting the other half from 2 given options, such that the complete image is satirical) and release a high-quality dataset YesBut, consisting of 2547 images, 1084 satirical and 1463 non-satirical, containing different artistic styles, to evaluate those tasks. Each satirical image in the dataset depicts a normal scenario, along with a conflicting scenario which is funny or ironic. Despite the success of current Vision-Language Models on multimodal tasks such as Visual QA and Image Captioning, our benchmarking experiments show that such models perform poorly on the proposed tasks on the YesBut Dataset in Zero-Shot Settings w.r.t both automated as well as human evaluation. Additionally, we release a dataset of 119 real, satirical photographs for further research.
Check out https://yesbut-dataset.github.io/ for all details and resouces on the work!
Our paper is now live and accessible on:
- Arxiv π: Read it here
- HuggingFace π€: Check it out here (40+ upvotes already!) π
If you find the paper interesting, don't forget to give it an upvote! π
Check out our fun and engaging video explaining the paper in simple terms! Watch it on YouTube:
πΊ Watch Video
The YesBut Dataset is available on HuggingFace Datasets! π€. Get it here!
You can also download the dataset via the Google Drive links below β¬οΈ:
Stay tuned for updates! π
- https://drive.google.com/file/d/1s5K0FlUOKUKknhKh9runmjDKouIAVxwM/view?usp=sharing - contains the 283 images manually downloaded (and then manually filtered) from the posts in βXβ (erstwhile known as Twitter) handle @_yesbut_.
- https://drive.google.com/file/d/1fHthLYNfcRFE4wEyWCMOUZHRVw3_ctNB/view?usp=sharing - Satirical Images annotated in Stage 3
- https://drive.google.com/file/d/1Tzs4OcEJK469myApGqOUKPQNUtVyTRDy/view?usp=sharing - Non-Satirical Images annotated in Stage 3
- https://drive.google.com/file/d/1YhXMEEiZnuv_VxORtEBR7JR3guhLBFjy/view?usp=sharing - Satirical Images annotated in Stage 4
- https://drive.google.com/file/d/1i4Fy01uBZ_2YGPzyVArZjijleNbt8xRu/view?usp=sharing - Non-Satirical Images annotated in Stage 4
- https://drive.google.com/file/d/1YcikUqusUp_Lj0Y11-GiaZYaZ2c2qgCf/view?usp=sharing - 119 Satirical Real Photographs following the 'Yes, But' Theme.
YesBut_Stage2_Annotation.csv
- Second Stage Annotation results for YesBut- Links for running the SOTA VL Models
LLaVA
- https://github.com/haotian-liu/LLaVAMiniGPT4
- https://github.com/Vision-CAIR/MiniGPT-4Kosmos-2
- https://github.com/microsoft/unilm/tree/master/kosmos-2GPT4
- https://platform.openai.com/docs/guides/vision (We usegpt-4-vision-preview
API)Gemini
- https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/gemini#gemini-1.0-pro-vision
- Links for calculating evaluation metric values for the Satirical Understanding Task
BLEU
- https://huggingface.co/spaces/evaluate-metric/bleuMETEOR
- https://huggingface.co/spaces/evaluate-metric/meteorROUGE-L
- https://pypi.org/project/py-rouge/ (F1-Score)BERTScore
- https://github.com/Tiiiger/bert_score
generate_using_dalle3.ipynb
- contains the code for generating images using DALL-E 3Human_Eval_Results.csv
- contains Human Evaluation outcomes (majority vote per sample) of the Satirical Understanding Task on 30 images (10 images randomly sampled from each of Annotation Stages 2, 3, 4), along with the corresponding human-written as well as model-generated (by 5 SOTA VL Models) overall image descriptions. Some columns and column headers are elaborated as follows -image_filename
- correspond to images present inhuman_eval_images
foider- For columns related to Appropriate Length, Correctness, Faithfulness, Visual Completeness, a blank value means that the annotator does not think that the corresponding aspect is being followed for the description, and vice versa.
order_of_overall_img_descriptions
- this column contains the list of models (or whether it is humman-written) to which the 6 descriptions correspond to in that order. The values mean the following things -humanannotation
- Human-Written Descriptionminigpt
- Description generated using MiniGPT4kosmos
- Description generated using Kosmos-2llava
- Description generated using LLaVAgpt4vision
- Description generated using GPT4gemini
- Description generated using Gemini