[October 02 2024] Cutting edge papers in 2024 are avaliable!!!
- Survey Paper
- Language-conditioned Reinforcement Learning
- Language-conditioned Imitation Learning
- Diffusion Policy
- Neuralsymbolic
- Enpowered by LLMs
- Enpowered by VLMs
- Comparative Analysis
This paper is basically based on the survey paper
Language-conditioned Learning for Robot Manipulation: A Survey
Hongkuan Zhou,
Xiangtong Yao,
Oier Mees,
Yuan Meng,
Dhruv Shah,
Ted Xiao,
Yonatan Bisk,
Edward Johns,
Mohit Shridhar,
Kai Huang,
Zhenshan Bing,
Alois Knoll
@article{zhou2023language,
title={Language-conditioned Learning for Robotic Manipulation: A Survey},
author={Zhou, Hongkuan and Yao, Xiangtong and Meng, Yuan and Sun, Siming and BIng, Zhenshan and Huang, Kai and Knoll, Alois},
journal={arXiv preprint arXiv:2312.10807},
year={2023}
}
- From language to goals: Inverse reinforcement learning for vision-based instruction following [paper]
- Grounding english commands to reward function [paper]
- Learning to understand goal specifications by modelling reward [paper]
- Beating atari with natural language guided reinforcement learning [paper] [code]
- Using natural language for reward shaping in reinforcement learning [paper]
- Gated-attention architectures for task-oriented language grounding [paper] [code]
- Mapping instructions and visual observations to actions with reinforcement learning [paper]
- Modular multitask reinforcement learning with policy sketches [paper]
- Representation learning for grounded spatial reasoning [paper]
- Lancon-learn: Learning with language to enable generalization in multi-task manipulation [paper] [code]
- Pixl2r: Guiding reinforcement learning using natural language by mapping pixels to rewards [paper][code]
- Learning from symmetry: Meta-reinforcement learning with symmetrical behaviors and language instructions [paper][website]
- Meta-reinforcement learning via language instructions [paper][code][website]
- Learning language-conditioned robot behavior from offline data and crowd-sourced annotation [paper]
- Concept2robot: Learning manipulation concepts from instructions and human demonstrations [paper]
- Language conditioned imitation learning over unstructured data [paper] [code] [website]
- Bc-z: Zero-shot task generalization with robotic imitation learning [paper]
- What matters in language-conditioned robotic imitation learning over unstructured data [paper] [code][website]
- Grounding language with visual affordances over unstructured data [paper] [code][website]
- Language-conditioned imitation learning with base skill priors under unstructured data [paper] [code] [website]
- Pay attention!- robustifying a deep visuomotor policy through task-focused visual attention [paper]
- Language-conditioned imitation learning for robot manipulation tasks [paper]
- Multimodal Diffusion Transformer for Learning from Play [paper]
- Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models [paper] [code] [website]
- PlayFusion: Skill Acquisition via Diffusion from Language-Annotated Play [paper] [website]
- ChainedDiffuser: Unifying Trajectory Diffusion and Keypose Prediction for Robotic Manipulation [paper] [code] [website]
- GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields [paper] [code] [website]
- DNAct: Diffusion Guided Multi-Task 3D Policy Learning [paper] [website]
- 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations [paper] [code] [website]
- Vision-Language Foundation Models as Effective Robot Imitators [paper]
- OpenVLA:An Open-Source Vision-Language-Action Model [paper] [code] [website]
- Open X-Embodiment: Robotic Learning Datasets and RT-X Models [paper]
- 3D-VLA: A 3D Vision-Language-Action Generative World Model [paper] [code] [website]
- Octo: An Open-Source Generalist Robot Policy [paper] [code] [website]
- Grounding english commands to reward function [paper]
- From language to goals: Inverse reinforcement learning for vision-based instruction following [paper]
- Multimodal Diffusion Transformer for Learning from Play [paper]
- Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models [paper] [code] [website]
- PlayFusion: Skill Acquisition via Diffusion from Language-Annotated Play [paper] [website]
- ChainedDiffuser: Unifying Trajectory Diffusion and Keypose Prediction for Robotic Manipulation [paper] [code] [website]
- DNAct: Diffusion Guided Multi-Task 3D Policy Learning [paper] [website]
- 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations [paper]
- Hierarchical understanding in robotic manipulation: A knowledge-based framework [paper]
- Semantic Grasping Via a Knowledge Graph of Robotic Manipulation: A Graph Representation Learning Approach [paper]
- Knowledge Acquisition and Completion for Long-Term Human-Robot Interactions using Knowledge Graph Embedding [paper]
- Tell me dave: Context-sensitive grounding of natural language to manipulation instructions [paper]
- Neuro-symbolic procedural planning with commonsense prompting [paper]
- Reinforcement Learning Based Navigation with Semantic Knowledge of Indoor Environments [paper]
- Learning Neuro-Symbolic Skills for Bilevel Planning [[paper]](Learning Neuro-Symbolic Skills for Bilevel Planning)
- Learning Neuro-symbolic Programs for Language Guided Robot Manipulation [paper] [code] [website]
- Long-term robot manipulation task planning with scene graph and semantic knowledge [paper]
- Sayplan: Grounding large language models using 3d scene graphs for scalable task planning [paper]
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents [paper]
- Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents [paper]
- Progprompt: Generating situated robot task plans using large language models [paper]
- Robots that ask for help: Uncertainty alignment for large language model planners [paper]
- Task and motion planning with large language models for object rearrangement [paper]
- Do as i can, not as i say: Grounding language in robotic affordances [paper]
- The 2014 international planning competition: Progress and trends [paper]
- Robot task planning via deep reinforcement learning: a tabletop object sorting application [paper]
- Robot task planning and situation handling in open worlds [paper] [code] [website]
- Embodied Task Planning with Large Language Models [paper] [code] [website]
- Text2motion: From natural language instructions to feasible plans [paper] [website]
- Large language models as commonsense knowledge for large-scale task planning [paper] [code] [website]
- Alphablock: Embodied finetuning for vision-language reasoning in robot manipulation [paper]
- Learning to reason over scene graphs: a case study of finetuning gpt-2 into a robot language model for grounded task planning [paper] [code]
- Scaling up and distilling down: Language-guided robot skill acquisition [paper][code] [website]
- Stap: Sequencing task-agnostic policies [paper] [code][website]
- Inner monologue: Embodied reasoning through planning with language models [paper] [website]
- Rearrangement:A challenge for embodied ai [paper]
- The threedworld transport challenge: A visually guided task and motion planning benchmark for physically realistic embodied ai [paper]
- Tidy up my room: Multi-agent cooperation for service tasks in smart environments [paper]
- A quantifiable stratification strategy for tidy-up in service robotics [paper]
- Tidybot: Personalized robot assistance with large language models [paper]
- Housekeep: Tidying virtual households using commonsense reasoning [paper]
- Building cooperative embodied agents modularly with large language models [paper]
- Socratic models: Composing zero-shot multimodal reasoning with language [paper]
- Voyager: An open-ended embodied agent with large language models [paper]
- Translating natural language to planning goals with large-language models [paper]
- Cliport: What and where pathways for robotic manipulation [paper] [code] [website]
- Transporter networks: Rearranging the visual world for robotic manipulation [paper] [code] [website]
- Simple but effective: Clip embeddings for embodied ai [paper]
- Instruct2act: Mapping multi-modality instructions to robotic actions with large language model [paper] [code]
- Latte: Language trajectory transformer [paper] [code]
- Embodied Task Planning with Large Language Models [paper] [code] [website]
- Palm-e: An embodied multimodal language model [paper] [website]
- Socratic models: Composing zero-shot multimodal reasoning with language [paper]
- Pretrained language models as visual planners for human assistance [paper] [code]
- Open-world object manipulation using pre-trained vision-language models [paper] [website]
- Robotic skill acquisition via instruction augmentation with vision-language models [paper] [website]
- Language reward modulation for pretraining reinforcement learning [paper] [code]
- Vision-language models as success detectors [paper]
- A Generalist Agent [paper]
- RT-1: Robotics Transformer for Real-World Control at Scale [paper] [code] [website]
- RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [paper]
- Vision-Language Foundation Models as Effective Robot Imitators [paper]
- OpenVLA:An Open-Source Vision-Language-Action Model [paper] [code] [website]
- Open X-Embodiment: Robotic Learning Datasets and RT-X Models [paper]
- 3D-VLA: A 3D Vision-Language-Action Generative World Model [paper] [code] [website]
Simulator | Description |
---|---|
PyBullet | With its origins rooted in the Bullet physics engine, PyBullet transcends the boundaries of conventional simulation platforms, offering a wealth of tools and resources for tasks ranging from robot manipulation and locomotion to computer-aided design analysis. |
MuJoCo | MuJoCo, short for "Multi-Joint dynamics with Contact", originates from the vision of creating a physics engine tailored for simulating articulated and deformable bodies. It has evolved into an essential tool for exploring diverse domains, from robot locomotion and manipulation to human movement and control. |
CoppeliaSim | CoppeliaSim is formerly known as V-REP (Virtual Robot Experimentation Platform). It offers a comprehensive environment for simulating and prototyping robotic systems, enabling users to create, analyze, and optimize a wide spectrum of robotic applications. Its origins as an educational tool have evolved into a full-fledged simulation framework, revered for its versatility and user-friendly interface. |
NVIDIA Omniverse | NVIDIA Omniverse offers real-time physics simulation and lifelike rendering, creating a virtual environment for comprehensive testing and fine-tuning of robotic manipulation algorithms and control strategies, all prior to their actual deployment in the physical realm. |
Unity | Unity is a cross-platform game engine developed by Unity Technologies. Renowned for its user-friendly interface and powerful capabilities, Unity has become a cornerstone in the worlds of video games, augmented reality (AR), virtual reality (VR), and also simulations. |
Benchmark | Simulation Engine | Manipulator | Observation | Tool used | Multi-agents | Long-horizon | ||
---|---|---|---|---|---|---|---|---|
RGB | Depth | Masks | ||||||
CALVIN | PyBullet | Franka Panda | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
Meta-world | MuJoCo | Sawyer | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
LEMMA | NVIDIA Omniverse | UR10 & UR5 | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ |
RLbench | CoppeliaSim | Franka Panda | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ |
VIMAbench | Pybullet | UR5 | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ |
LoHoRavens | Pybullet | UR5 | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
ARNOLD | NVIDIA Isaac Gym | Franka Panda | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ |
Model | Year | Benchmark | Simulation Engine | Language Module | Perception Module | Real World Experiment | LLM | Reinforcement Learning | Imitation Learning |
---|---|---|---|---|---|---|---|---|---|
DREAMCELL | 2019 | # | - | LSTM | * | ❌ | ❌ | ❌ | ✅ |
PixL2R | 2020 | Meta-World | MuJoCo | LSTM | CNN | ❌ | ❌ | ✅ | ❌ |
Concept2Robot | 2020 | # | PyBullet | BERT | ResNet-18 | ❌ | ❌ | ❌ | ✅ |
LanguagePolicy | 2020 | # | CoppeliaSim | GLoVe | Faster RCNN | ❌ | ❌ | ❌ | ✅ |
LOReL | 2021 | Meta-World | MuJoCo | distillBERT | CNN | ✅ | ❌ | ❌ | ✅ |
CARE | 2021 | Meta-World | MuJoCo | RoBERTa | * | ❌ | ✅ | ✅ | ❌ |
MCIL | 2021 | # | MuJoCo | MUSE | CNN | ❌ | ❌ | ❌ | ✅ |
BC-Z | 2021 | # | - | MUSE | ResNet18 | ✅ | ❌ | ❌ | ✅ |
CLIPort | 2021 | # | Pybullet | CLIP | CLIP/ResNet | ✅ | ❌ | ❌ | ✅ |
LanCon-Learn | 2022 | Meta-World | MuJoCo | GLoVe | * | ❌ | ❌ | ✅ | ✅ |
MILLON | 2022 | Meta-World | MuJoCo | GLoVe | * | ✅ | ❌ | ✅ | ❌ |
PaLM-SayCan | 2022 | # | - | PaLM | ViLD | ✅ | ✅ | ✅ | ✅ |
ATLA | 2022 | # | PyBullet | BERT-Tiny | CNN | ❌ | ✅ | ✅ | ❌ |
HULC | 2022 | CALVIN | Pybullet | MiniLM-L3-v2 | CNN | ❌ | ❌ | ❌ | ✅ |
PerAct | 2022 | RLbench | CoppelaSim | CLIP | ViT | ✅ | ❌ | ❌ | ✅ |
RT-1 | 2022 | # | - | USE | EfficientNet-B3 | ✅ | ✅ | ❌ | ❌ |
LATTE | 2023 | # | CoppeliaSim | distillBERT, CLIP | CLIP | ✅ | ❌ | ❌ | ❌ |
DIAL | 2022 | # | - | CLIP | CLIP | ✅ | ✅ | ❌ | ✅ |
R3M | 2022 | # | - | distillBERT | ResNet | ✅ | ❌ | ❌ | ✅ |
Inner Monologue | 2022 | # | - | CLIP | CLIP | ✅ | ✅ | ❌ | ❌ |
NLMap | 2023 | # | - | CLIP | ViLD | ✅ | ✅ | ❌ | ✅ |
Code as Policies | 2023 | # | - | GPT3, Codex | ViLD | ✅ | ✅ | ❌ | ❌ |
PROGPROMPT | 2023 | Virtualhome | Unity3D | GPT-3 | * | ✅ | ✅ | ❌ | ❌ |
Language2Reward | 2023 | # | MuJoCo MPC | GPT-4 | * | ✅ | ✅ | ✅ | ❌ |
LfS | 2023 | Meta-World | MuJoCo | Cons. Parser | * | ✅ | ❌ | ✅ | ❌ |
HULC++ | 2023 | CALVIN | PyBullet | MiniLM-L3-v2 | CNN | ✅ | ❌ | ❌ | ✅ |
LEMMA | 2023 | LEMMA | NVIDIA Omniverse | CLIP | CLIP | ❌ | ❌ | ❌ | ✅ |
SPIL | 2023 | CALVIN | PyBullet | MiniLM-L3-v2 | CNN | ✅ | ❌ | ❌ | ✅ |
PaLM-E | 2023 | # | PyBullet | PaLM | ViT | ✅ | ✅ | ❌ | ✅ |
LAMP | 2023 | RLbench | CoppelaSim | ChatGPT | R3M | ❌ | ✅ | ✅ | ❌ |
MOO | 2023 | # | - | OWL-ViT | OWL-ViT | ✅ | ❌ | ❌ | ✅ |
Instruction2Act | 2023 | VIMAbench | PyBullet | ChatGPT | CLIP | ❌ | ✅ | ❌ | ❌ |
VoxPoser | 2023 | # | SAPIEN | CPT-4 | OWL-ViT | ✅ | ✅ | ❌ | ❌ |
SuccessVQA | 2023 | # | IA Playroom | Flamingo | Flamingo | ✅ | ✅ | ❌ | ❌ |
VIMA | 2023 | VIMAbench | PyBullet | T5 model | ViT | ✅ | ✅ | ❌ | ✅ |
TidyBot | 2023 | # | - | GPT-3 | CLIP | ✅ | ✅ | ❌ | ❌ |
Text2Motion | 2023 | # | - | GPT-3, Codex | * | ✅ | ✅ | ✅ | ❌ |
LLM-GROP | 2023 | # | Gazebo | GPT-3 | * | ✅ | ✅ | ❌ | ❌ |
Scaling Up | 2023 | # | MuJoCo | CLIP, GPT-3 | ResNet-18 | ✅ | ✅ | ❌ | ✅ |
Socratic Models | 2023 | # | - | RoBERTa, GPT-3 | CLIP | ✅ | ✅ | ❌ | ❌ |
SayPlan | 2023 | # | - | GPT-4 | * | ✅ | ✅ | ❌ | ❌ |
RT-2 | 2023 | # | - | PaLI-X, PaLM-E | PaLI-X, PaLM-E | ✅ | ✅ | ❌ | ❌ |
KNOWNO | 2023 | # | PyBullet | PaLM-2L | * | ✅ | ✅ | ❌ | ❌ |
Diffusion Policy | 2023 | Push-T | MuJoCo | - | CNN | ✅ | ❌ | ❌ | ✅ |
MDT | 2023 | CALVIN | PyBullet | CLIP | CLIP | ❌ | ❌ | ✅ | ✅ |
Scaling Up | 2023 | # | MuJoCo | CLIP | CLIP | ✅ | ❌ | ❌ | ✅ |
Playfussion | 2023 | CALVIN | PyBullet | Sentence-BERT | ResNet-18 | ✅ | ❌ | ❌ | ✅ |
ChainedDiffuer | 2023 | RLbench | CoppelaSim | CLIP | CLIP | ✅ | ❌ | ❌ | ✅ |
GNFactor | 2023 | RLbench | CoppelaSIm | CLIP | NeRF | ✅ | ❌ | ❌ | ✅ |
DNAct | 2024 | RLbench | CoppelaSim | CLIP | NeRF, PointNext | ✅ | ❌ | ❌ | ✅ |
3D Diffuser Actor | 2024 | CALVIN | PyBullet | CLIP | CLIP | ✅ | ❌ | ❌ | ✅ |
RoboFlamingo | 2024 | CALVIN | PyBullet | OpenFlamingo | OpenFlamingo | ❌ | ✅ | ❌ | ✅ |
OpenVLA | 2024 | Open X-Embodiment | - | Llama 2 7B | DinoV2 & SigLIP | ✅ | ✅ | ❌ | ✅ |
RT-X | 2024 | Open X-Embodiment | - | PaLi-X/PaLM-E | PaLi-X/PaLM-E | ✅ | ✅ | ❌ | ✅ |
PIVOT | 2024 | Open X-Embodiment | - | GPT-4/Gemini | GPT-4/Gemini | ✅ | ✅ | ❌ | ❌ |
3D-VLA | 2024 | RL-Bench & CALVIN | CoppeliaSim & PyBullet | 3D-LLM | 3D-LLM | ❌ | ✅ | ❌ | ✅ |
Octo | 2024 | Open X-Embodiment | - | T5 | CNN | ✅ | ✅ | ❌ | ✅ |
ECoT | 2024 | BridgeData V2 | - | Llama 2 7B | DinoV2 & SigLIP | ✅ | ✅ | ❌ | ✅ |