The ability to converse with humans and follow natural language commands is crucial for intelligent unmanned aerial vehicles (a.k.a. drones). It can relieve people's burden of holding a controller all the time, allow multitasking, and make drone control more accessible for people with disabilities or with their hands occupied. To this end, we introduce Aerial Vision-and-Dialog Navigation (AVDN), to navigate a drone via natural language conversation. We build a drone simulator with a continuous photorealistic environment and collect a new AVDN dataset of over 3k recorded navigation trajectories with asynchronous human-human dialogs between commanders and followers. The commander provides initial navigation instruction and further guidance by request, while the follower navigates the drone in the simulator and asks questions when needed. During data collection, followers' attention on the drone's visual observation is also recorded. Based on the AVDN dataset, we study the tasks of aerial navigation from (full) dialog history and propose an effective Human Attention Aided Transformer model (HAA-Transformer), which learns to predict both navigation waypoints and human attention.
Todos:
- Data released
- Train code uploaded
- Inference code uploaded and checkpoint released
- Eval.ai challenge setup
- Dataset format explanation in detail
Based on the AVDN dataset, we are hosting an ICCV 2023 Challenge (co-located at the ICCV 2023 CLVL workshop) for the Aerial Navigation from Dialog History (ANDH) task on Eval.ai: https://eval.ai/web/challenges/challenge-page/2049/overview
Download xView data
Our AVDN dataset uses satellite images from the xView dataset. Follow the instruction at https://challenge.xviewdataset.org/data-download to download xView dataset.
Then move the images in xView dataset to under AVDN directory. (Assume the xView images are at ./XVIEW_images):
mkdir -p Aerial-Vision-and-Dialog-Navigation/datasets/AVDN/train_images
cp -r XVIEW_images/*.tif Aerial-Vision-and-Dialog-Navigation/datasets/AVDN/train_images/
Download AVDN datasets
(https://sites.google.com/view/aerial-vision-and-dialog/home):
mkdir -p Aerial-Vision-and-Dialog-Navigation/datasets/AVDN/annotations
gdown 1xUHnrYaNGe_IBG7W1ecaf6U2cyuBfYLr -O Aerial-Vision-and-Dialog-Navigation/datasets/AVDN/annotations/train_data.json
gdown 1mtT3AVJQNEbjKkH6aINX3kj7ROADkBET -O Aerial-Vision-and-Dialog-Navigation/datasets/AVDN/annotations/val_seen_data.json
gdown 17fVSHmuB3EFHkfNRZle6kgVcvZcumsJr -O Aerial-Vision-and-Dialog-Navigation/datasets/AVDN/annotations/val_unseen_data.json
gdown 14BijI07ukKCSDh3T_RmUG83z6Oa75M-U -O Aerial-Vision-and-Dialog-Navigation/datasets/AVDN/annotations/test_unseen_data.json
Download pre-trained xview-yolov3 weights and configuration file
mkdir -p Aerial-Vision-and-Dialog-Navigation/datasets/AVDN/pretrain_weights
gdown 1Ke-pA5jpq1-fsEwAch_iRCtJHx6rQc-Z -O Aerial-Vision-and-Dialog-Navigation/datasets/AVDN/pretrain_weights/best.pt
gdown 1n6RMWcHAbS6DA7BBug6n5dyN6NPjiPjh -O Aerial-Vision-and-Dialog-Navigation/datasets/AVDN/pretrain_weights/yolo_v3.cfg
Download the training checkpoints corresponding to the experiments in the AVDN paper
mkdir -p Aerial-Vision-and-Dialog-Navigation/datasets/AVDN/et_haa/ckpts/
mkdir -p Aerial-Vision-and-Dialog-Navigation/datasets/AVDN/lstm_haa/ckpts/
gdown 1fA6ckLVA-gsiOmWmOMkqJggTLbiJpFBI -O Aerial-Vision-and-Dialog-Navigation/datasets/AVDN/et_haa/ckpts/best_val_unseen
gdown 1RYjo_vc5m5ZRUcjIFojZjke8RhlfX90I -O Aerial-Vision-and-Dialog-Navigation/datasets/AVDN/lstm_haa/ckpts/best_val_unseen
Install requirements
pip install torch==1.11.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install torchvision==0.12.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
Run training or evaluation:
The script, scripts/avdn_paper/run_et_haa.sh
, includes commands for train and evaluate Human Attention Aided Transformer (HAA-Transformer) model.
The script, scripts/avdn_paper/run_lstm_haa.sh
, includes commands for train and evaluate Human Attention Aided LSTM (HAA-LSTM) model.
cd Aerial-Vision-and-Dialog-Navigation/src
# For Human Attention Aided Transformer model
bash scripts/avdn_paper/run_et_haa.sh
# For Human Attention Aided LSTM model
bash scripts/avdn_paper/run_lstm_haa.sh
@inproceedings{fan-etal-2023-aerial,
title = "Aerial Vision-and-Dialog Navigation",
author = "Fan, Yue and
Chen, Winson and
Jiang, Tongzhou and
Zhou, Chun and
Zhang, Yi and
Wang, Xin Eric",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-acl.190",
doi = "10.18653/v1/2023.findings-acl.190",
pages = "3043--3061",
}