SPA: 3D SPatial-Awareness Enables Effective Embodied Representation

Haoyi Zhu, Honghui Yang, Yating Wang, Jiange Yang, Liming Wang, Tong He

SPA is a novel representation learning framework that emphasizes the importance of 3D spatial awareness in embodied AI. It leverages differentiable neural rendering on multi-view images to endow a vanilla Vision Transformer (ViT) with intrinsic spatial understanding. We also present the most comprehensive evaluation of embodied representation learning to date, covering 268 tasks across 8 simulators with diverse policies in both single-task and language-conditioned multi-task scenarios.

🥳 NEWS:

Jan. 2025: SPA is accepted by ICLR 2025!
Oct. 2024: Codebase and pre-trained checkpoints are released! Paper is available on arXiv.

🔭 Project Structure

Our codebase draws significant inspiration from the excellent Lightning Hydra Template. The directory structure of this project is organized as follows:

Show directory structure

├── .github                   <- Github Actions workflows
│
├── configs                   <- Hydra configs
│   ├── callbacks                         <- Callbacks configs
│   ├── data                              <- Data configs
│   ├── debug                             <- Debugging configs
│   ├── experiment                        <- Experiment configs
│   ├── extras                            <- Extra utilities configs
│   ├── hydra                             <- Hydra configs
│   ├── local                             <- Local configs
│   ├── logger                            <- Logger configs
│   ├── model                             <- Model configs
│   ├── paths                             <- Project paths configs
│   ├── trainer                           <- Trainer configs
|   |
│   └── train.yaml            <- Main config for training
│
├── data                   <- Project data
│
├── logs                   <- Logs generated by hydra and lightning loggers
│
├── scripts                <- Shell or Python scripts
|
├── spa                    <- Source code of SPA
│   ├── data                     <- Data scripts
│   ├── models                   <- Model scripts
│   ├── utils                    <- Utility scripts
│   │
│   └── train.py                 <- Run SPA pre-training
│
├── .gitignore                <- List of files ignored by git
├── .project-root             <- File for inferring the position of project root directory
├── requirements.txt          <- File for installing python dependencies
├── setup.py                  <- File for installing project as a package
└── README.md

🔨 Installation

Basics

# clone project
git clone https://github.com/HaoyiZhu/SPA.git
cd SPA

# crerate conda environment
conda create -n spa python=3.11 -y
conda activate spa

# install PyTorch, please refer to https://pytorch.org/ for other CUDA versions
# e.g. cuda 11.8:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# install basic packages
pip3 install -r requirements.txt

SPA

# (optional) if you want to use SPA's volume decoder
cd libs/spa-ops
pip install -e .
cd ../..

# install SPA, so that you can import from anywhere
pip install -e .

🌟 Usage

Example of Using SPA Pre-trained Encoder

We provide pre-trained SPA weights for feature extraction. The checkpoints are available on 🤗Hugging Face. You don't need to manually download the weights, as SPA will automatically handle this if needed.

import torch

from spa.models import spa_vit_base_patch16, spa_vit_large_patch16

image = torch.rand((1, 3, 224, 224))  # range in [0, 1]

# Example usage of SPA-Large (recommended)
# or you can use `spa_vit_base_patch16` for SPA-base
model = spa_vit_large_patch16(pretrained=True)
model.eval()

# Freeze the model
model.freeze()

# (Recommended) move to CUDA
image = image.cuda()
model = model.cuda()

# Obtain the [CLS] token
cls_token = model(image)  # torch.Size([1, 1024])

# Obtain the reshaped feature map concatenated with [CLS] token
feature_map_cat_cls = model(
    image, feature_map=True, cat_cls=True
)  # torch.Size([1, 2048, 14, 14])

# Obtain the reshaped feature map without [CLS] token
feature_map_wo_cls = model(
    image, feature_map=True, cat_cls=False
)  # torch.Size([1, 1024, 14, 14])

Note: The inputs will be automatically resized to 224 x 224 and normalized within the SPA ViT encoder.

🚀 Pre-Training

Example of Pre-Training on ScanNet

We give an example on pre-training SPA on the ScanNet v2 dataset.

Prepare the dataset
- Download the ScanNet v2 dataset.
- Pre-process and extract RGB-D images following PonderV2. The preprocessed data should be put under data/scannet/.
- Pre-generate metadata for fast data loading. The following command will generate metadata under data/scannet/metadata.
```
python scripts/generate_scannet_metadata.py
```
Run the following command for pre-training. Remember to modify hyper-parameters such as number of nodes and GPU devices according to your machines.
```
python spa/train.py experiment=spa_pretrain_vitl trainer.num_nodes=5 trainer.devices=8
```

💡 SPA Large-Scale Evaluation

VC-1 Evaluation

We evaluate on the VC-1's MetaWorld, Adroit, DMControl, and TriFinger benchmarks. Additionally, we have a forked version of the repository that includes code and configuration for evaluating SPA.

Clone the forked VC-1 repo, and follow the instructions in the CortexBench README to set up the MuJoCo and TriFinger environments, as well as download the required datasets.
Create a configuration for spa <spa_model>.yaml(e.g., using SPA-Large as in spa_vit_large.yaml) in <vc-1_path>/vc_models/src/vc_models/conf/model.
To run the VC-1 evaluation for spa, specify the model config as a parameter (embedding=<spa_model>) for each of the benchmarks in cortexbench.

🎉 Gotchas

Override any config parameter from command line

This codebase is based on Hydra, which allows for convenient configuration overriding:

python src/train.py trainer.max_epochs=20 seed=300

Note: You can also add new parameters with + sign.

python src/train.py +some_new_param=some_new_value

Train on CPU, GPU, multi-GPU and TPU

# train on CPU
python src/train.py trainer=cpu

# train on 1 GPU
python src/train.py trainer=gpu

# train on TPU
python src/train.py +trainer.tpu_cores=8

# train with DDP (Distributed Data Parallel) (4 GPUs)
python src/train.py trainer=ddp trainer.devices=4

# train with DDP (Distributed Data Parallel) (8 GPUs, 2 nodes)
python src/train.py trainer=ddp trainer.devices=4 trainer.num_nodes=2

# simulate DDP on CPU processes
python src/train.py trainer=ddp_sim trainer.devices=2

# accelerate training on mac
python src/train.py trainer=mps

Train with mixed precision

# train with pytorch native automatic mixed precision (AMP)
python src/train.py trainer=gpu +trainer.precision=16

Use different tricks available in Pytorch Lightning

# gradient clipping may be enabled to avoid exploding gradients
python src/train.py trainer.gradient_clip_val=0.5

# run validation loop 4 times during a training epoch
python src/train.py +trainer.val_check_interval=0.25

# accumulate gradients
python src/train.py trainer.accumulate_grad_batches=10

# terminate training after 12 hours
python src/train.py +trainer.max_time="00:12:00:00"

Note: PyTorch Lightning provides about 40+ useful trainer flags.

Easily debug

# runs 1 epoch in default debugging mode
# changes logging directory to `logs/debugs/...`
# sets level of all command line loggers to 'DEBUG'
# enforces debug-friendly configuration
python src/train.py debug=default

# run 1 train, val and test loop, using only 1 batch
python src/train.py debug=fdr

# print execution time profiling
python src/train.py debug=profiler

# try overfitting to 1 batch
python src/train.py debug=overfit

# raise exception if there are any numerical anomalies in tensors, like NaN or +/-inf
python src/train.py +trainer.detect_anomaly=true

# use only 20% of the data
python src/train.py +trainer.limit_train_batches=0.2 \
+trainer.limit_val_batches=0.2 +trainer.limit_test_batches=0.2

Note: Visit configs/debug/ for different debugging configs.

Resume training from checkpoint

python src/train.py ckpt_path="/path/to/ckpt/name.ckpt"

Note: Checkpoint can be either path or URL.

Note: Currently loading ckpt doesn't resume logger experiment, but it will be supported in future Lightning release.

Create a sweep over hyperparameters

# this will run 9 experiments one after the other,
# each with different combination of seed and learning rate
python src/train.py -m seed=100,200,300 model.optimizer.lr=0.0001,0.00005,0.00001

Note: Hydra composes configs lazily at job launch time. If you change code or configs after launching a job/sweep, the final composed configs might be impacted.

Execute all experiments from folder

python src/train.py -m 'exp_maniskill2_act_policy/maniskill2_task@maniskill2_task=glob(*)'

Note: Hydra provides special syntax for controlling behavior of multiruns. Learn more here. The command above executes all task experiments from configs/exp_maniskill2_act_policy/maniskill2_task.

Execute run for multiple different seeds

python src/train.py -m seed=100,200,300 trainer.deterministic=True

Note: trainer.deterministic=True makes pytorch more deterministic but impacts the performance.

For more instructions, refer to the official documentation for Pytorch Lightning, Hydra, and Lightning Hydra Template.

📚 License

This repository is released under the MIT license.

✨ Acknowledgement

Our work is primarily built upon PointCloudMatters, PonderV2, UniPAD, Pytorch Lightning, Hydra, Lightning Hydra Template, RLBench, PerAct, LIBERO, Meta-Wolrd, ACT, Diffusion Policy, DP3, TIMM, VC1, R3M. We extend our gratitude to all these authors for their generously open-sourced code and their significant contributions to the community.

Contact Haoyi Zhu if you have any questions or suggestions.

📝 Citation

@article{zhu2024spa,
    title = {SPA: 3D Spatial-Awareness Enables Effective Embodied Representation},
    author = {Zhu, Haoyi and and Yang, Honghui and Wang, Yating and Yang, Jiange and Wang, Limin and He, Tong},
    journal = {arXiv preprint arxiv:2410.08208},
    year = {2024},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPA: 3D SPatial-Awareness Enables Effective Embodied Representation

📋 Contents

🔭 Project Structure

🔨 Installation

🌟 Usage

🚀 Pre-Training

💡 SPA Large-Scale Evaluation

🎉 Gotchas

📚 License

✨ Acknowledgement

📝 Citation

About

Releases 1

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github		.github
assets		assets
configs		configs
libs/spa-ops		libs/spa-ops
scripts		scripts
spa		spa
.gitignore		.gitignore
.project-root		.project-root
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

HaoyiZhu/SPA

Folders and files

Latest commit

History

Repository files navigation

SPA: 3D SPatial-Awareness Enables Effective Embodied Representation

📋 Contents

🔭 Project Structure

🔨 Installation

🌟 Usage

🚀 Pre-Training

💡 SPA Large-Scale Evaluation

🎉 Gotchas

📚 License

✨ Acknowledgement

📝 Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages