Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation

This repository contains:

the implementation of navigation agents for our paper: Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation;
a dataset for pretraining outdoor VLN task.

Data

In this project, we use the Touchdown dataset and the StreetLearn dataset. More details regarding these two datasets can be found here.

Our pre-training dataset is built upon StreetLearn. The guiding instructions for the outdoor VLN task are provided in touchdown/datasets/.

To download the panoramas, please refer to Touchdown Dataset and StreetLearn Dataset.

Requirements & Setup

Python 3.6
PyTorch 1.7.0
Texar

We conduct experiments on Ubuntu 18.04 and Titan RTX.

Please run the following lines to download the code and install Texar:

git clone https://github.com/VegB/VLN-Transformer/
cd VLN-Transformer/
pip install [--user] -e .  # install Texar
cd touchdown/

Quick Start

Train VLN agent from scratch

Training can be performed with the following command:

python main.py --dataset [DATASET] --img_feat_dir [IMG_DIR] --model [MODEL] --exp_name [EXP_NAME]

DATASET is the dataset for outdoor navigation. This repo currently support the following three datasets:
- touchdown is a dataset for outdoor VLN, the instructions are written by human annotators;
- manh50 is a subset of StreetLearn, the instructions are generated by Google Map API;
- manh50_mask has the same trajectories as manh50, but the instructions are style-modified (which is what we do in this paper).
IMG_DIR contains the encoded panoramas for DATASET. After you get access to the panoramas, please encode them accordingly. Each file in this directory should be a numpy file [PANO_ID].npy that represent the panorama that has corresponding pano_id. The encoding process are described in Touchdown paper, Section D.1.
MODEL is the navigation agent, may be rconcat for RCONCAT or vlntrans for VLN Transformer.

More parameters and usage are listed here.

It should be noted here that vlntrans use BERT (bert-base-uncased) to encode the instruction and it takes a lot of space, which means you may need to adjust the batch size accordingly to fit the model into your GPU. In our experiments, we use 3 piece of Titan RTX and a batch size of 30. This is the command we use to pretrain VLN Transformer on our instruction-style-modified dataset:

CUDA_VISIBLE_DEVICES="0,1,2" python main.py --dataset 'manh50_mask' --img_feat_dir '/data/manh50_features_mean/' --model 'vlntrans' --batch_size 30 --max_num_epochs 15 --exp_name 'pretrain_mask'

Train VLN agent on top of pre-trained models

We can finetune the VLN agent on pre-trained models.

python main.py --dataset [DATASET] --img_feat_dir [IMG_DIR] --model [MODEL] --resume_from [PRETRAINED_MODEL] --resume [RESUME_OPTION]

PRETRAINED_MODEL specified the pre-trained model;
RESUME_OPTION specifies the checkpoint
- latest: the most recent ckpt;
- TC_best: the ckpt with the best TC score on dev set;
- SPD_best: the ckpt with the best SPD score on dev set.

Evaluate outdoor VLN performance

We can evaluate the agent's navigation performance on the test set and dev set with the following command:

python main.py --test True --dataset [DATASET] --img_feat_dir [IMG_DIR] --model [MODEL] --resume_from [PRETRAINED_MODEL] --resume [RESUME_OPTION] --CLS [True/False] --DTW [True/False]

The pre-trained models for VLN Transformer, RCONCAT and GA can be downloaded from here. Please place them in checkpoints/.

To reproduce the results in our paper, please use the following commands:

CUDA_VISIBLE_DEVICES="0" python main.py --test True --dataset 'touchdown' --img_feat_dir [IMG_DIR] --model 'rconcat' --resume_from [PRETRAINED_MODEL] --resume 'TC_best' --CLS True --DTW True
CUDA_VISIBLE_DEVICES="1" python main.py --test True --dataset 'touchdown' --img_feat_dir [IMG_DIR] --model 'ga' --resume_from [PRETRAINED_MODEL] --resume 'TC_best' --CLS True --DTW True
CUDA_VISIBLE_DEVICES="2" python main.py --test True --dataset 'touchdown' --img_feat_dir [IMG_DIR] --model 'vlntrans' --batch_size 30 --resume_from [PRETRAINED_MODEL] --resume 'TC_best' --CLS True --DTW True

PRETRAINED_MODEL specified the pre-trained model
- vanilla: Navigation agent trained on touchdown dataset without pre-training on auxiliary datasets.
- finetuned_manh50: Pre-trained on manh50 dataset, and finetuned on touchdown dataset.
- finetuned_mask: Pre-trained on manh50_mask dataset, and finetuned on touchdown dataset.

Citing our work

@misc{zhu2020multimodal,
    title={Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation},
    author={Wanrong Zhu and Xin Wang and Tsu-Jui Fu and An Yan and Pradyumna Narayana and Kazoo Sone and Sugato Basu and William Yang Wang},
    year={2020},
    eprint={2007.00229},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Acknowledgements

The code and data can't be built without streetlearn, speaker_follower, touchdown, and Texar. We also thank @Jiannan Xiang for his contribution in reproducing the Touchdown task.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
bin/utils		bin/utils
stubs/torch		stubs/torch
texar		texar
touchdown		touchdown
LICENSE		LICENSE
README.md		README.md
mypy.ini		mypy.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation

Data

Requirements & Setup

Quick Start

Train VLN agent from scratch

Train VLN agent on top of pre-trained models

Evaluate outdoor VLN performance

Citing our work

Acknowledgements

About

Releases

Packages

Languages

License

VegB/VLN-Transformer

Folders and files

Latest commit

History

Repository files navigation

Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation

Data

Requirements & Setup

Quick Start

Train VLN agent from scratch

Train VLN agent on top of pre-trained models

Evaluate outdoor VLN performance

Citing our work

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages