This code implements the prediction of visual scanpath along with its corresponding natural language explanations in three different tasks (3 different datasets) with two different architecture:
- Free-viewing: the prediction of scanpath for looking at some salient or important object in the given image. (OSIE)
- Visual Question Answering: the prediction of scanpath during human performing general tasks, e.g., visual question answering, to reflect their attending and reasoning processes. (AiR-D)
- Visual search: the prediction of scanpath during the search of the given target object to reflect the goal-directed behavior under target present and absent conditions. (COCO-Search18 Target-Present and Target-Absent)
[2024/07]
GazeXplain code and datasets initially released.
We introduce GazeXplain
, a novel scanpath explanation task to understand human visual attention. We provide ground-truth explanations on various eye-tracking datasets and develop a model architecture for predicting scanpaths and generating natural language explanations.
This example reveals how observers strategically investigate a scene to find out if the person is walking on the sidewalk. Fixations (circles) start centrally, locating a driving car, then shifting to the sidewalk to find the person, and finally looking down to confirm if the person is walking. By annotating observers' scanpaths with detailed explanations, we enable a deeper understanding of the what and why behind fixations, providing insights into human decision-making and task performance.
For the ScanMatch evaluation metric, we adopt the part of GazeParser
package.
We adopt the implementation of SED and STDE from VAME
as two of our evaluation metrics mentioned in the Visual Attention Models
.
More specific, we adopt the evaluation metrics provided in Scanpath
and Gazeformer
, respectively.
Based on the checkpoint
implementation from updown-baseline
, we slightly modify it to accommodate our pipeline.
-
Python 3.10
-
PyTorch 2.1.2 (along with torchvision)
-
We also provide the conda environment
environment.yml
, you can directly run
$ conda env create -f environment.yml
to create the same environment where we successfully run our codes.
Our GazeXplain dataset is released! You can download the dataset from Link
.
This dataset contains the explanations of visual scanpaths in three different scanpath datasets (OSIE, AiR-D, COCO-Search18).
To process the data, you can follow the instructions provided in Scanpath
and Gazeformer
.
For handling the SS cluster, you can refer to Gazeformer
and Target-absent-Human-Attention
.
More specifically, you can run the following scripts to process the data.
$ python ./src/preprocess/${dataset}/preprocess_fixations.py
$ python ./src/preprocess/${dataset}/feature_extractor.py
We structure <dataset_root>
as follows
We set all the corresponding hyper-parameters in opt.py
.
The train_explanation_alignment.py
script will dump checkpoints into the folder specified by --log_root
(default = ./runs/
). You can also set the other hyper-parameters in opt.py
or define them in the bash/train.sh
.
--datasets
Folder to the dataset, e.g.,<dataset_root>
.--epoch
The number of total epochs.--start_rl_epoch
Start to use reinforcement learning at the predefined epoch.
You can also use the following commands to train your own network. Then you can run the following commands to evaluate the performance of your trained model on test split.
$ sh bash/train.sh
For inference, we provide the pretrained model
, and you can directly run the following command to evaluate the performance of the pretrained model on test split.
$ sh bash/test.sh
If you use our code or data, please cite our paper:
@inproceedings{xianyu:2024:gazexplain,
Author = {Xianyu Chen and Ming Jiang and Qi Zhao},
Title = {GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
Year = {2024}
}