Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy

This repository is the official implementation of Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy.

Generalizing language-conditioned robotic policies to new tasks remains a significant challenge, hampered by the lack of suitable simulation benchmarks. In this paper, we address this gap by introducing GemBench, a novel benchmark to assess generalization capabilities of vision-language robotic manipulation policies. As illustrated in the figure below, GemBench incorporates seven general action primitives and four levels of generalization, spanning novel placements, rigid and articulated objects, and complex long-horizon tasks.

We evaluate state-of-the-art approaches on GemBench and also introduce a new method. Our approach 3D-LOTUS leverages rich 3D information for action prediction conditioned on language. While 3D-LOTUS excels in both efficiency and performance on seen tasks, it struggles with novel tasks. To address this, we present 3D-LOTUS++ (see figure below), a framework that integrates 3D-LOTUS's motion planning capabilities with the task planning capabilities of LLMs and the object grounding accuracy of VLMs. 3D-LOTUS++ achieves state-of-the-art performance on novel tasks of GemBench, setting a new standard for generalization in robotic manipulation.

Installation

See INSTALL.md for detailed instructions in installation.

Dataset

The dataset can be found in Dropbox. Put the dataset in the data/gembench folder. Dataset structure is as follows:

- data
    - gembench
        - train_dataset
            - microsteps: 567M, initial configurations for each episode
            - keysteps_bbox: 160G, extracted keysteps data
            - keysteps_bbox_pcd: (used to train 3D-LOTUS)
                - voxel1m: 10G, processed point clouds
                - instr_embeds_clip.npy: instructions encoded by CLIP text encoder
            - motion_keysteps_bbox_pcd: (used to train 3D-LOTUS++ motion planner)
                - voxel1m: 2.8G, processed point clouds
                - action_embeds_clip.npy: action names encoded by CLIP text encoder
        - val_dataset
            - microsteps: 110M, initial configurations for each episode
            - keysteps_bbox_pcd:
                - voxel1m: 941M, processed point clouds
        - test_dataset
            - microsteps: 2.2G, initial configurations for each episode

3D-LOTUS Policy

Training

Train the 3D-LOTUS policy end-to-end on the GemBench train split. It takes about 14h with a single A100 GPU.

sbatch job_scripts/train_3dlotus_policy.sh

The trained checkpoints are available here. You should put them in the folder data/experiments/gembench/3dlotus/v1

Evaluation

# both validation and test splits
sbatch job_scripts/eval_3dlotus_policy.sh

The evaluation script evaluates the 3D-LOTUS policy on the validation (seed100) and test splits of the GemBench benchmark. The evaluation script skips any task that has already been evaluated before and whose results are already saved in data/experiments/gembench/3dlotus/v1/preds/ so make sure to clean it if you want to re-evaluate a task that you already evaluated.

We use the validation set to select the best checkpoint. The following script summarizes results on the validation split.

python scripts/summarize_val_results.py data/experiments/gembench/3dlotus/v1/preds/seed100/results.jsonl

The following script summarizes results on the test splits of four generalization levels:

python scripts/summarize_tst_results.py data/experiments/gembench/3dlotus/v1/preds 150000

3D-LOTUS++ Policy with LLM and VLM

Download llama3-8B model following instructions here, and modify the configuration path in genrobo3d/configs/rlbench/robot_pipeline.yaml.

Training

Train the 3D-LOTUS++ motion planning policy on the GemBench train split. It takes about 14h with a single A100 GPU.

sbatch job_scripts/train_3dlotusplus_motion_planner.sh

The trained checkpoints are available here. . You should put them in the folder data/experiments/gembench/3dlotusplus/v1

Evaluation

We have three evaluation modes:

groundtruth task planner + groundtruth object grounding + automatic motion planner
groundtruth task planner + automatic object grounding + automatic motion planner
automatic task planner + automatic object grounding + automatic motion planner

See comments in the following scripts:

# both validation and test splits
sbatch job_scripts/eval_3dlotusplus_policy.sh

Citation

If you use our GemBench benchmark or find our code helpful, please kindly cite our work:

 @inproceedings{garcia24gembench,
    author    = {Ricardo Garcia and Shizhe Chen and Cordelia Schmid},
    title     = {Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy},
    booktitle = {preprint},
    year      = {2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
figures		figures
genrobo3d		genrobo3d
job_scripts		job_scripts
prompts/rlbench		prompts/rlbench
scripts		scripts
.gitignore		.gitignore
INSTALL.md		INSTALL.md
README.md		README.md
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy

Installation

Dataset

3D-LOTUS Policy

Training

Evaluation

3D-LOTUS++ Policy with LLM and VLM

Training

Evaluation

Citation

About

Releases

Packages

Contributors 2

Languages

vlc-robot/robot-3dlotus

Folders and files

Latest commit

History

Repository files navigation

Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy

Installation

Dataset

3D-LOTUS Policy

Training

Evaluation

3D-LOTUS++ Policy with LLM and VLM

Training

Evaluation

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages