This repository compiles a list of seminal and cutting-edge papers that explore the application of video technology in the field of robotics. Continual improvements are being made to this repository, and contributions are welcome. If you come across any relevant papers that should be included, please don't hesitate to open an issue.
-
Towards Generalist Robot Learning from Internet Video: A Survey
- Robert McCarthy, Daniel C.H. Tan, Dominik Schmidt, Fernando Acero, Nathan Herr, Yilun Du, Thomas G. Thuruthel, Zhibin Li
- Paper
-
Learning by Watching: A Review of Video-based Learning Approaches for Robot Manipulation
- Chrisantus Eze, Christopher Crick
- Paper
-
Let Me Help You! Neuro-Symbolic Short-Context Action Anticipation
-
Generalization with Lossy Affordances: Leveraging Broad Offline Data for Learning Visuomotor Tasks
-
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training
-
SOAR: Autonomous Improvement of Instruction Following Skills via Foundation Models
-
HRP: Human Affordances for Robotic Pre-Training
- Mohan Kumar Srirama, Sudeep Dasari, Shikhar Bahl, Abhinav Gupta
- Paper
- Robotics Science and Systems 2024
- CMU
-
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
-
This&That: Language-Gesture Controlled Video Generation for Robot Planning
-
Policy Composition From and For Heterogeneous Robot Learning
-
Simultaneous Localization and Affordance Prediction for Tasks in Egocentric Video
- Zachary Chavis, Hyun Soo Park, and Stephen J. Guy
- Paper
- Department of Computer Science and Engineering, University of Minnesota
-
Flow as the Cross-domain Manipulation Interface
-
R+X: Retrieval and Execution from Everyday Human Videos
-
Octo: An Open-Source Generalist Robot Policy
- Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, Sergey Levine
- Paper
- Website
- Code
- UC Berkeley || Stanford || Carnegie Mellon University || Google Deepmind
-
HRP: Human Affordances for Robotic Pre-Training
-
RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation
-
Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers
-
OpenVLA: An Open-Source Vision-Language-Action Model
- Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, Chelsea Finn
- Paper
- Website
- Code
- Stanford University || UC Berkeley || Toyota Research Institute || Google DeepMind || Physical Intelligence || MIT
-
Video Language Planning
-
Manipulate-Anything: Automating Real-World Robots using Vision-Language Models
-
Dreamitate: Real-World Visuomotor Policy Learning via Video Generation
-
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
-
Large-Scale Actionless Video Pre-Training via Discrete Diffusion for Efficient Policy Learning
- Haoran He, Chenjia Bai, Ling Pan, Weinan Zhang, Bin Zhao, Xuelong Li
- Paper
- Website
- Hong Kong University of Science and Technology || Shanghai Artificial Intelligence Laboratory || Shanghai Jiao Tong University || Northwestern Polytechnical University || Institute of Artificial In- telligence (TeleAI), China Telecom Corp Ltd.
-
ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data
-
Vision-based Manipulation from Single Human Video with Open-World Object Graphs
-
Learning to Act from Actionless Videos through Dense Correspondences
- Track2Act: Predicting Point Tracks from Internet Videos Enables Diverse Zero-shot Manipulation
-
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
- Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, Youngwoon Lee, Marius Memmel, Sungjae Park, Ilija Radosavovic, Kaiyuan Wang, Albert Zhan, Kevin Black, Cheng Chi, Kyle Beltran Hatch, Shan Lin, Jingpei Lu, Jean Mercat, Abdul Rehman, Pannag R Sanketi, Archit Sharma, Cody Simpson, Quan Vuong, Homer Rich Walke, Blake Wulfe, Ted Xiao, Jonathan Heewon Yang, Arefeh Yavary, Tony Z. Zhao, Christopher Agia, Rohan Baijal, Mateo Guaman Castro, Daphne Chen, Qiuyu Chen, Trinity Chung, Jaimyn Drake, Ethan Paul Foster, Jensen Gao, David Antonio Herrera, Minho Heo, Kyle Hsu, Jiaheng Hu, Donovon Jackson, Charlotte Le, Yunshuang Li, Kevin Lin, Roy Lin, Zehan Ma, Abhiram Maddukuri, Suvir Mirchandani, Daniel Morton, Tony Nguyen, Abigail O'Neill, Rosario Scalise, Derick Seale, Victor Son, Stephen Tian, Emi Tran, Andrew E. Wang, Yilin Wu, Annie Xie, Jingyun Yang, Patrick Yin, Yunchu Zhang, Osbert Bastani, Glen Berseth, Jeannette Bohg, Ken Goldberg, Abhinav Gupta, Abhishek Gupta, Dinesh Jayaraman, Joseph J Lim, Jitendra Malik, Roberto Martín-Martín, Subramanian Ramamoorthy, Dorsa Sadigh, Shuran Song, Jiajun Wu, Michael C. Yip, Yuke Zhu, Thomas Kollar, Sergey Levine, Chelsea Finn
- Paper
- Website
- check the website and get much sources
-
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
- Open X-Embodiment Collaboration, Abby O'Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie, Anthony Brohan, Antonin Raffin, Archit Sharma, Arefeh Yavary, Arhan Jain, Ashwin Balakrishna, Ayzaan Wahid, Ben Burgess-Limerick, Beomjoon Kim, Bernhard Schölkopf, Blake Wulfe, Brian Ichter, Cewu Lu, Charles Xu, Charlotte Le, Chelsea Finn, Chen Wang, Chenfeng Xu, Cheng Chi, Chenguang Huang, Christine Chan, Christopher Agia, Chuer Pan, Chuyuan Fu, Coline Devin, Danfei Xu, Daniel Morton, Danny Driess, Daphne Chen, Deepak Pathak, Dhruv Shah, Dieter Büchler, Dinesh Jayaraman, Dmitry Kalashnikov, Dorsa Sadigh, Edward Johns, Ethan Foster, Fangchen Liu, Federico Ceola, Fei Xia, Feiyu Zhao, Felipe Vieira Frujeri, Freek Stulp, Gaoyue Zhou, Gaurav S. Sukhatme, Gautam Salhotra, Ge Yan, Gilbert Feng, Giulio Schiavi, Glen Berseth, Gregory Kahn, Guangwen Yang, Guanzhi Wang, Hao Su, Hao-Shu Fang, Haochen Shi, Henghui Bao, Heni Ben Amor, Henrik I Christensen, Hiroki Furuta, Homanga Bharadhwaj, Homer Walke, Hongjie Fang, Huy Ha, Igor Mordatch, Ilija Radosavovic, Isabel Leal, Jacky Liang, Jad Abou-Chakra, Jaehyung Kim, Jaimyn Drake, Jan Peters, Jan Schneider, Jasmine Hsu, Jay Vakil et al. (192 additional authors not shown)
- Paper
- Website
- Code
-
BridgeData V2: A Dataset for Robot Learning at Scale
-
Ego4DSounds:A diverse egocentric dataset with high action-audio correspondence
-
RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot