Skip to content

[2024VizWiz] Vision-Language Model-based PolyFormer for Recognizing Visual Questions with Multiple Answer Groundings

Notifications You must be signed in to change notification settings

daitranskku/VizWiz2024-VQA-AnswerTherapy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Vision-Language Model-based PolyFormer for Recognizing Visual Questions with Multiple Answer Groundings

Introduction

This paper presents a new method that utilizes the capabilities of Vision-and-Language Transformers (ViLT) and the advanced PolyFormer model to tackle the Single Answer Grounding Challenge in the VQA-Therapy dataset. The initial step of our approach involves employing the ViLT model to predict the possible count of unique responses by considering the input question and image. The PolyFormer model subsequently examines the output from ViLT in conjunction with the image to produce visual answer masks that correspond to the input. The presence of overlap between these masks determines whether the answers have a common grounding. If there is no overlap, it indicates the existence of multiple groundings. Our approach achieved an F1 score of 81.71 on the test-dev set and 80.72 on the VizWiz Grand Challenge test set, positioning our team among the Top3 submissions in the competition.

Proposed Approach

image

Installation

Here is the list of libraries used in this project:

Inference

Contact

If you have any questions, please feel free to contact Dai Tran (daitran@skku.edu).

About

[2024VizWiz] Vision-Language Model-based PolyFormer for Recognizing Visual Questions with Multiple Answer Groundings

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published