Skip to content

Implementation (R2R part) for the paper "Iterative Vision-and-Language Navigation"

License

Notifications You must be signed in to change notification settings

Bill1235813/IVLN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Iterative Vision-and-Language Navigation (IVLN part)

Jacob Krantz*, Shurjo Banerjee*, Wang Zhu, Jason Corso, Peter Anderson, Stefan Lee, and Jesse Thomason

[Paper] [Project Page] [GitHub (IVLN part)] [GitHub (IVLN-CE part)]

This is the official implementation of Iterative Vision-and-Language Navigation (IVLN) in discrete environments, a paradigm for evaluating language-guided agents navigating in a persistent environment over time. Existing Vision-and-Language Navigation (VLN) benchmarks erase the agent’s memory at the beginning of every episode, testing the ability to perform cold-start navigation with no prior information. However, deployed robots occupy the same environment for long periods of time. The IVLN paradigm addresses this disparity by training and evaluating VLN agents that maintain memory across tours of scenes that consist of up to 100 ordered instruction-following Room-to-Room (R2R) episodes each defined by an individual language instruction and a target path. This repository implements the Iterative Room-to-Room (IR2R) benchmark.

Installation

  1. Install requirements:
conda create --name vlnhamt python=3.8.5
conda activate vlnhamt
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
  1. Download R2R data from Dropbox, including processed annotations, features and pretrained models. Put the data under datasets directory.

  2. Download Matterport 3D adjacency maps and angle features from Dropbox. Put the files under datasets directory.

  3. Download the tour files for the original VLN-R2R and the prevalent augmented data from GDrive. Put the files under the iterative-vln directory. The directory structure after this will be like

iterative-vln
|___finetune_src
|___datasets
|   |___R2R
|   |___total_adj_list.json
|   |___angle_feature.npy
|   |___...
|___tours_iVLN.json
|___tours_iVLN_prevalent.json
|___...

Run the HAMT baseline

HAMT Teacher-forcing IL

cd finetune_src
bash scripts/run_r2r_il.sh

Run the TourHAMT model

cd finetune_src
bash scripts/iter_train_sep_hist_weight.sh

Model architecture:

IVLN

TourHAMT Variations

cd finetune_src
bash scripts/iter_train.sh                  # extended memory with prev episodes only
bash scripts/iter_train_hist.sh             # train hist encoder
bash scripts/iter_train_sep_hist.sh         # with prev hist identifier and train hist encoder

Citation

If you find this work useful, please consider citing:

@article{krantz2022iterative,
    title = "Iterative Vision-and-Language Navigation",
    author = "Krantz, Jacob and Banerjee, Shurjo and Zhu, Wang and Corso, Jason and Anderson, Peter and Lee, Stefan and Thomason, Jesse",
    year = "2022",
    publisher = "arXiv",
    url = {https://arxiv.org/abs/2210.03087},
}

Acknowledgement

Some of the codes are built upon HAMT, pytorch-image-models, UNITER and Recurrent-VLN-BERT. Thanks them for their great works!

License

This codebase is MIT licensed. Trained models and task datasets are considered data derived from the mp3d scene dataset. Matterport3D based task datasets and trained models are distributed with Matterport3D Terms of Use and under CC BY-NC-SA 3.0 US license.

About

Implementation (R2R part) for the paper "Iterative Vision-and-Language Navigation"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published