This is a repository for the following models and data:
- SCOUT and SCOUT+ models for task- and context-aware driver gaze prediction;
- corrected and annotated ground truth for DR(eye)VE dataset;
- extra annotations for drivers' actions and context for DR(eye)VE, BDD-A, and LBW datasets.
More information can be found in these papers:
- I. Kotseruba, J.K. Tsotsos, "Understanding and Modeling the Effects of Task and Context on Drivers' Gaze Allocation", IV, 2024. paper | arXiv.
- I. Kotseruba, J.K. Tsotsos, "SCOUT+: Towards practical task-driven drivers’ gaze prediction", IV, 2024. paper | arXiv
- I. Kotseruba, J.K. Tsotsos, "Data limitations for modeling top-down effects on drivers’ attention", IV, 2024. paper | arXiv
- SCOUT model description
- SCOUT+ model description
- Annotations for DR(eye)VE, BDD-A, and LBW datasets
- Installation and running instructions
- Citation
SCOUT is a model for drivers' gaze prediction that uses task and context information (represented as a set of numeric values and text labels) to modulate the output of the model, simulating top-down attentional mechanisms.
Since the model is aware of the driver's actions and context, it is able to anticipate maneuvers and pay attention to the relevant element of context unlike bottom-up models that are more reactive. The qualitative results of SCOUT and two state-of-the-art models demonstrate this on scenarios from DR(eye)VE involving maneuvers at the intersections. For example, when making turns at unsignalized intersections and during merging, the model correctly identifies intersecting roads and neighbouring lanes, respectively, as areas which the driver should examine.
SCOUT+ is an extension of SCOUT that uses a map and route images instead of task and context labels, which is more similar to the information available to the human driver.
SCOUT+ achieves similar results as SCOUT, without relying on the precise labels and vehicle sensor information.
extra_annotations
folder contains additional annotations for the datasets as described below.
extra_annotations/DReyeVE/gaze_data
contains .txt
files (one for each video in the dataset) with the following columns:
frame_etg
- frame index of eye-tracking glasses (ETG) video;frame_gar
- frame index of rooftop camera (GAR) video;X,Y
- gaze coordinates in the ETG video;X_gar,Y_gar
- gaze coordinates in the GAR videoevent_type
- type of data point: fixation, saccade, blink, or error;code
- timestamploc
- text labels for gaze location: scene (windshield), in-vehicle (with subcategories, such as speedometer, dashboard, passenger, mirrors, etc), out-of-frame (gaze is out of GAR camera view) and NA (for blinks, saccades, errors).
Note that in these files, ETG and GAR videos are temporally realigned. As a result, the correspondences between ETG and GAR frame indices are different from the original files supplied with DR(eye)VE. We recomputed all homographies between pairs of ETG and GAR frames (available here) and manually corrected all outliers.
To generate new saliency maps for DR(eye)VE we did the following:
- filtered out saccades, blinks and fixations to the car-interior;
- pushed fixations outside of the scene frame bounds to the image boundary to preserve the direction and elevation of the drivers' gaze;
- re-aggregated fixations over 1s interval (+-12 frames) around each frame using motion-compensated saliency method based on the optical flow.
For more details see scripts/DReyeVE_ground_truth
.
SCOUT+ uses street graphs from OpenStreetMap and valhalla to map-match the GPS coordinates to the street network. See scripts/maps/README.md
for more details.
extra_anotations/BDD-A/video_labels.xlsx
contains video-level labels indicating the recording time, time of day, location, weather, and quality issuesextra_annotations/BDD-A/exclude_videos.json
is a list of videos that are excluded from training/evaluation due to missing data or recording quality issuesextra_annotations/BDD-A/vehicle_data
contains Excel spreadsheets with GPS and heading data, as well as annotations for maneuvers and intersections (see the next section)extra_annotations/BDD-A/route_maps
contains.png
images of OpenStreetMaps of the local area around the route recorded in each video. Seescripts/maps/README.md
for more details
extra_annotaitons/LBW/video_labels.xlsx
contains video-level labels indicating the time of day, location, and weather for each videoextra_annotations/LBW/train_test.json
is a train/val/test split used in our experimentsextra_annotations/LBW/gaze_data
is a set of Excel spreadsheets with gaze information with the following fields:subj_id, vid_id, frame_id
- subject, video, and frame idssegm_id
- segment id (in LBW, some frames are missing, frames with consecutive ids belong to the same segment)X, Y
- gaze location in the image plane- left and right eye coordinates in 2D and 3D
We used a combination of processing and manual labeling to identify maneuvers (lane changes and turns) and intersections for each route. This information has been added to the vehicle data for each video in every dataset.
We converted the vehicle information in BDD-A to match the format of DR(eye)VE. Since LBW does not provide vehicle data, it was approximated and is saved in the same format.
Task and context annotations are saved with the vehicle data in extra_annotations/<dataset>/vehicle_data
, which contains Excel spreadsheets (one for each video in the dataset) with the following columns:
frame
- frame id;speed
- ego-vehicle speed (km/h);acc
- ego-vehicle acceleration (m/s2) derived from speed;course
- ego-vehicle heading;lat
,lon
- original GPS coordinateslat_m
,lon_m
- map-matched GPS coordinateslat action
- labels for lateral action (left/right turn, left/right lane change, U-turn);context
- type of intersection (signalized, unsignalized, ), ego-vehicle priority (right-of-way, yield), and starting frame (frame where the driver first looked towards the intersection). These three values are separated by semicolon, e.g.unsignalized;right-of-way;1731
.
Utility functions for DR(eye)VE, LBW, BDD-A, and MAAD allow to print various dataset statistics and create data structures for evaluation.
There are functions to convert gaze and vehicle data from different dataset to the same format described above.
See data_utils/README.md
for more information.
-
Download DR(eye)VE dataset following the instructions on the official webpage
-
Download BDD-A dataset following the instructions on the official webpage
-
Create environment variables DREYEVE_PATH for DR(eye)VE and BDDA_PATH for BDD-A (e.g. add a line
export DREYEVE_PATH=/path/to/dreyeve/
to~/.bashrc
file) -
Extract frames from DR(eye)VE or BDD-A (see
scripts
folder, requires ffmpeg) -
Download new ground truth from here and extract the archives inside
extra_annotations/DReyeVE/new_ground_truth/
. Copy the new ground truth to DReyeVE dataset usingscripts/copy_DReyeVE_gt.sh
.
Instructions below use docker. To build the container, use the script in the docker
folder:
docker/build_docker.py
Update paths to the datasets (DR(eye)VE or BDD-A), extra_annotations
, and SCOUT
code folders in docker/run_docker.py
script. Then run the script:
docker/run_docker.py
Note: see comments in the script for available command line options.
If you prefer not to use docker, dependencies are listed in the docker/requirements.txt
.
To use the pretrained Video Swin Transformer, download pretrained weights by running download_weights.sh
inside the pretrained_weights
folder.
To train the model, run inside the docker:
python3 train.py --config <config_yaml> --save_dir <save_dir>
--save_dir
is a path where the trained model and results will be saved, if it is not provided, a directory with current datetime stamp will be created automatically.
See comments in the configs/SCOUT.yaml
and configs/SCOUT+.yaml
for available model parameters.
To test a trained model run:
python3 test.py --config_dir <path_to_dir> --evaluate --save_images
--config_dir
is a path to the trained model directory which must contain the config file and checkpoints.
--evaluate
if this option is specified, predictions for the best checkpoint will be evaluated and the results will be saved in an excel file in the provided config_dir
folder.
--save_images
if this option is specified, predicted saliency maps will be saved to config_dir/results/
folder.
The following pretrained weights are available here:
-
SCOUT (with task) trained on DR(eye)VE or BDD-A
-
SCOUT+ (with map) trained on DR(eye)VE or BDD-A
To use pretrained weights, download them and place them in train_runs/best_model/
.
The implementation of KL divergence in the DR(eye)VE metrics code produces incorrect results. Script test_saliency_metrics.py
demonstrates discrepancies between DR(eye)VE and two other KLdiv implementations. For evaluationg SCOUT and other models, we follow Fahimi & Bruce implementation. See also supplementary materials for more details.
If you used models or data from this repository, please consider citing these papers:
@inproceedings{2024_IV_SCOUT,
author = {Kotseruba, Iuliia and Tsotsos, John K.},
title = {Understanding and modeling the effects of task and context on drivers' gaze allocation},
booktitle = {IV},
year = {2024}
}
@inproceedings{2024_IV_SCOUT+,
author = {Kotseruba, Iuliia and Tsotsos, John K.},
title = {{SCOUT+: Towards practical task-driven drivers’ gaze prediction}},
booktitle = {IV},
year = {2024}
}
@inproceedings{2024_IV_data,
author = {Kotseruba, Iuliia and Tsotsos, John K.},
title = {Data limitations for modeling top-down effects on drivers’ attention},
booktitle = {IV},
year = {2024}
}
References for the DR(eye)VE, BDD-A, MAAD, and LBW datasets:
@article{2018_PAMI_Palazzi,
author = {Palazzi, Andrea and Abati, Davide and Calderara, Simone and Solera, Francesco and Cucchiara, Rita},
title = {{Predicting the driver's focus of attention: The DR (eye) VE Project}},
journal = {IEEE TPAMI},
volume = {41},
number = {7},
pages = {1720--1733},
year = {2018}
}
@inproceedings{2018_ACCV_Xia,
author = {Xia, Ye and Zhang, Danqing and Kim, Jinkyu and Nakayama, Ken and Zipser, Karl and Whitney, David},
title = {Predicting driver attention in critical situations},
booktitle = {ACCV},
pages = {658--674},
year = {2018}
}
@inproceedings{2021_ICCVW_Gopinath,
title={MAAD: A Model and Dataset for" Attended Awareness" in Driving},
author={Gopinath, Deepak and Rosman, Guy and Stent, Simon and Terahata, Katsuya and Fletcher, Luke and Argall, Brenna and Leonard, John},
booktitle={ICCVW},
pages={3426--3436},
year={2021}
}
@inproceedings{2022_ECCV_Kasahara,
author = {Kasahara, Isaac and Stent, Simon and Park, Hyun Soo},
title = {{Look Both Ways: Self-supervising driver gaze estimation and road scene saliency}},
booktitle = {ECCV},
pages = {126--142},
year = {2022}
}