Abstract: In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their performance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT), a pioneering solution rooted in the paradigm of causal inference. By delving into both observable and unobservable confounders within vision, language, and history, we propose the back-door and front-door adjustment causal learning (BACL and FACL) modules to promote unbiased learning by comprehensively mitigating potential spurious correlations. Additionally, to capture global confounder features, we propose a cross-modal feature pooling (CFP) module supervised by contrastive learning, which is also shown to be effective in improving cross-modal representations during pre-training. Extensive experiments across multiple VLN datasets (R2R, REVERIE, RxR, and SOON) underscore the superiority of our proposed method over previous state-of-the-art approaches.
-
Install MatterPort3D Simulator: Start by installing the MatterPort3D simulator from the official repository.
-
Install Python Dependencies: Run the following command to install the necessary Python packages. Make sure to match the versions in
requirements.txt
to avoid compatibility issues, particularly when loading pre-trained weights for fine-tuning.pip install -r requirements.txt
-
Install en_core_web_sm: Run the following command:
pip install spacy wget https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.0/en_core_web_sm-2.3.0.tar.gz pip install en_core_web_sm-2.3.0.tar.gz
-
Install nltk_data: Run the following command to use the NLTK Downloader to obtain the resource:
python >>> import nltk >>> nltk.download('wordnet')
-
Download Resources:
- Datasets, Features and Trained-weights:: Available here.
- METER Pre-training (Optional): If you wish to pre-train GOAT using METER, download the model
meter_clip16_224_roberta_pretrain.ckpt
from here. - EnvEdit Weights (Optional): Available here.
- RoBERTa Tokenizer: If direct access to Hugging Face models is restricted, manually download
roberta-base
from Hugging Face and store it locally underdatasets/pretrained/roberta
.
Ensure your
datasets
directory follows this structure:datasets ├── R2R │ ├── annotations │ │ ├──pretrain_map │ │ └──RxR │ ├── connectivity │ ├── features │ ├── speaker │ ├── navigator │ ├── pretrain │ ├── test │ └── id_paths.json ├── REVERIE │ ├── annotations │ │ └──pretrain │ ├── speaker │ └── features ├── SOON │ ├── annotations │ ├── speaker │ └── features ├── RxR ├── EnvEdit └── pretrained ├── METER └── roberta
To pre-train the model, navigate to the pre-training source directory and execute the provided shell script. Replace r2r with the desired dataset name as needed.
cd pretrain_src
bash run_r2r_goat.sh
-
Extract BACL Features:
Navigate to the map navigation source directory and execute the scripts to extract BACL features. Refer to the ducumentation for more details.
cd map_nav_src bash do_utils/extract_room_type.bash python do_intervention.py
-
Extract FACL features:
Run the following script to extract FACL features, and store them in the respective
features
directory for each dataset.cd map_nav_src bash scripts/run_r2r_goat_CFPextract.sh
To fine-tune the model, use the command below:
cd map_nav_src
bash scripts/run_r2r_goat.sh
Note that we have observed that the use of speaker coupled with causal intervention is critical.
For model validation, execute the following:
cd map_nav_src
bash scripts/run_r2r_goat_valid.sh
- Panoramic trajectory visualization is provided by Speaker-Follower.
- Top-down maps for Matterport3D are available in NRNS.
- Instructions for extracting image features from Matterport3D scenes can be found in VLN-HAMT.
We extend our gratitude to all the authors for their significant contributions and for sharing their resources.
- Clean the code for SOON.
- Release the features and weights.
This project builds upon the work found in MP3DSim, DUET, EnvDrop, and METER. Some augmented datasets and features are from PREVALENT, RxR-Marky, and EnvEdit.
We express our sincere thanks to these authors for their outstanding work and generosity in sharing their resources.
If you find our work useful in your research, please consider citing:
@InProceedings{Wang2024GOAT,
author = {Wang, Liuyi and He, Zongtao and Dang, Ronghao and Shen, Mengjiao and Liu, Chengju and Chen, Qijun},
title = {Vision-and-Language Navigation via Causal Learning},
booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2024}
}