CoRL 2019 会议总结


2019年的CoRL(Conference on Robot Learning)会议在日本大阪举行,时间是2019.10.29-11.1。这是第3届,第2届在瑞士苏黎世举办,我也有幸去参加了一下,但当时是个彻头彻尾的萌新,科研方向是SLAM,对于强化学习这件事根本不了解,也不知道自己要做什么,就广撒网==什么都在听。今年对于自己要做的事逐渐清晰,所以听的时候有一些侧重。

Acceptance rate

  • 2018:237 篇文章中接收了75篇,接收率是31.6%
  • 2019:358篇文章中接收了120篇,接收率是27.6%




  • Perception and manipulation
  • Planning and control
  • Reinforcement learning
  • Imitation learning
  • Human-robot interaction


Best Paper Award

3B-03: A Divergence Minimization Perspective on Imitation Learning Methods (Imitation learning) Seyed Kamyar Seyed Ghasemipour, Richard Semel, Shixiang Gu

Best Paper Award Finalist

3F-04: Disentangled Relational Representations for Explaining and Learning from Demonstration (Human-Robot Interaction) Yordan Hristov, Daniel Angelov, Michael Burke, Alex Lascarides, Subramanian Ramamoorthy

Best System Paper Award

1B-03: Learning to Manipulate Object Collections Using Grounded State Representations (Perception and manipulation) Matthew Wilson, Tucker Hermans

Best System Paper Award Finalist

1C-07: PyRoboLearn: A Python Framework for Robot Learning Practitioners Code (poster) Brian Delhaisse, Leonel Rozo, Darwin G. Caldwell

Best Presentation Award

2B-02: Bayesian Optimization Meets Riemannian Manifolds in Robot Learning (Reinforcement learning) Noemie Jaquier, Leonel Rozo, Sylvain Calinon, Mathias Burger




  • Connectivity Guaranteed Multi-robot Navigation via Deep Reinforcement Learning
  • Predictive Safety Network for Resource-constrained Multi-agent Systems
  • PIC: Permutation Invariant Critic for Multi-Agent Deep Reinforcement Learning
  • Macro-Action-Based Deep Multi-Agent Reinforcement Learning
  • Multi-Agent Reinforcement Learning with Multi-Step Generative Models[多机械臂协同,但是思路很有趣,可以看看]
  • Learning from My Partner's Actions: Roles in Decentralized Robot Teams
  • 图神经网络
    • Learning Decentralized Controllers for Robot Swarms with Graph Neural Networks
    • Graph Policy Gradients for Large Scale Robot Control


  • 视觉 导航
    • Learning to Navigate Using Mid-level Visual Priors
    • Combining Optimal Control and Learning for Visual Navigation in Novel Environments
    • Learning Navigation Subroutines from Egocentric Videos
  • 语言-视觉 导航
    • Conditional Driving from Natural Language Instructions
    • Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight
    • Vision-and-Dialog Navigation
    • Language-guided Semantic Mapping and Mobile Manipulation in Partially Observable Environments


  • MAME: Model-Agnostic Meta-Exploration
  • Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning Code


  • An Online Learning Procedure for Feedback Linearization Control without Torque Measurements
  • Model-based planning with energy based models Code
  • On-Policy Robot Imitation Learning from a Converging Supervisor
  • Data Efficient Reinforcement Learning for Legged Robots


  • TuneNet: One-Shot Residual Tuning for System Identification and Sim-to-Real Robot Task Transfer
  • Multi-Agent Manipulation via Locomotion using Hierarchical Sim2Real


  • Graph-Structured Visual Imitation


  • Two Stream Networks for Self-supervised Ego-Motion Estimation
  • Robust Semi-Supervised Monocular Depth Estimation with Reprojected Distances


MAME:Model-Agnostic Meta-Exploration

作者:Swaminathan Gurumurthy Sumit Kumar Katia Sycara

组织:Robotics Institute, Carnegie Mellon University (CMU)


Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning Code

作者: Tianhe Yu, Deirdre Quillen, Zhanpeng He], Ryan Julian, Karol Hausman, Chelsea Finn, Sergey Levine

组织:UC berkeley


  • multi-task reinforcement learning的目的是学习一个policy,能够更有效地处理很多tasks,并且要比单独地学习task要好(promise: learn s single policy that can solve multiple tasks more efficiency than learning the tasks individually)。现在的multi-task RL的benchmark有DM Lab和Atari,有两个缺点:
    • 都是游戏设定,缺少实际应用(limited to game setting and lack of realistic use cases)
    • 迁移到不相关的游戏上不好使(little effciency to be gained on disjoint games)
  • meta-RL的目的是利用以往的experiences有效地学习新的tasks(promise: efficiently acquire new tasks by leveraging experiences from past tasks)。现在的Meta-RL的benchmark就是mujoco那些设定,有3个缺点:
    • 任务的分布非常限制(task distributions are very narrow)
    • 适应的是相同任务的新变化而已(adaptation to new variations of the same task)
    • 称之为“multi-goal” benchmark更合适(better characterized as “multi-goal” benchmarks)
  • 所以本文的目标是使meta-RL能够泛化新的不同的技能上并且去评价泛化性能(goal:enable meta-RL to generalize to new distinct skills and evaluate the generalization performance)
    • large, diverse task set-->generalization to new tasks
  • Meta-world的独特之处在于:一个有很多task的多任务和元强化的数据集,为了研究元强化学习如何加速新任务的学习(A new multi-task and meta-RL benchmark with a wide range of tasks to study how meta-RL accelerates acquisition of new tasks),
    • 还是抓取数据集。(50 robotics manipulation tasks)
    • 用5种不同的模式评价meta-RL算法。(on five different modes)

PyRoboLearn: A Python Framework for Robot Learning Practitioners Code

作者:Brian Delhaisse, Leonel Rozo, Darwin G. Caldwell


On-Policy Robot Imitation Learning from a Converging Supervisor


- future work - develop finite-sample + worst case guarantees - study trade-offs between supervision quality and quantity - Experiments on higher-dimensional tasks with image-space observations

Model-based planning with energy based models Code



  • 要解决的问题(problem try to tackle)

    • 学习一个planning的model
    • 用energy-based model
      • good online model learning
      • allows efficient planning as inference
      • natural exploration for learning models
  • Trajectory Modeling

    • train a EBM to model state transitions $s_t, s_{t+1}$

    • mark 没有看懂算法,会场上去问了也么得搞懂,等我补习一下EBM回来填坑。

  • 仿真环境做了实验,Particles, Maze, Reacher。还有一个额外的no goal的探索任务。

Data Efficient Reinforcement Learning for Legged Robots


Connectivity Guaranteed Multi-robot Navigation via Deep Reinforcement Learning Video

作者: Juntong Lin, Xuyun Yang, Peiwei zheng, Hui Cheng


内容:大概的任务是让多个机器人能够保持一定的连接(欧氏距离小于某个阈值),提出了一个CSPF(constraint satisfying parametric function),仿真环境用的是Virtual policy extended environment (VP2E) ,跟我们的好像是差不多的,然后在实际环境上也做了实验(setup跟我们很像,也是optitrack来做全局定位,车载处理器也是TX2,不过我们的车更小,他的车是robomaster,有点大。),感觉场景比较简单,任务也比较简单,场地很小,机器人跑几步就到达目标了,算法的优势在实际环境中并不明显。

Macro-Action-Based Deep Multi-Agent Reinforcement Learning


An Online Learning Procedure for Feedback Linearization Control without Torque Measurements

组织:SAPIENZA University of ROMA


  • 背景

    • model-based control 需要精确的建模。
    • 估计系统的动力学参数很有挑战性,特别是考虑关节摩擦和负载。
    • 离线实现,随着时间需要重复这个过程。
    • 模型辨识需要很多技术:动力学参数回归,参数化和非参数化学习
  • 动机

    • 传统方法

      • 离线
      • 没有系统动力学参数在线调整
      • 需要测量力
      • 对有噪声的传感器数据进行过滤,性能随之下降
    • 我们的方法

      • 在线,机器人可以边提高性能边执行任务。
      • 不需要力测量(只需要编码器)
  • 在线算法的步骤:

    • 给定一个初始的参考轨迹
    • 执行nominal model得到的理想加速度(desired acceleration)
    • 测量一下当前处于什么状态,肯定是与理想的不一样(未建模的动力学 unmodeled dynamics)
    • 计算达到新状态所需要的真实系统加速度
    • 用nominal model和加速度误差来重建模型不匹配误差(model dismatch)
  • 在线学习的流程图

    • 在一个7自由度的机械臂上做了仿真实验,没有真实实验。

    • 主要的限制在于:

      • 不支持欠驱动。fully feedback linearizable system is required (no underactuation)
      • 不支持静态接触。a change in the joint velocities and positions (no static contact)

Learning from My Partner's Actions: Roles in Decentralized Robot Teams

内容:这篇文章思路很是清奇,任务设定是2个机械臂配合完成一个任务,做决策的时候要考虑合作者的action,也就是$\pi_1(a_1 | s_1, a_2)$。理解队友的行为背后的含义是很难的:机器人会有很多不同的理由来选择行为。提到了一篇文章“what do you think I think you think”,思路是做策略推理,本文的观点是如果团队划分了角色,并且每个角色的行为有独特的含义,那么是可以考虑队友行为的。之前的任务,一个机械臂是speaker,一个是listener,speaker正常操作,listener的策略是$\pi_2(a_2 | s_2, a_1)$。但是这还不够!

  • 机器人还需要变换角色(some teams become unstable under fixed roles)
  • 怎么变换角色(switch at a prefined frequency)
  • 角色多有效(as frequency increses,decentralized team converges to centralized team)
  • speaker要干啥(good speaker exaggerate to convey information,我猜测他想说强调有用信息吧)


  • 如果两个机械臂独立控制,抵抗力就会越来越大;如果是有角色的,那么抵抗力保持在较低的水平。
  • 换角色的频率,越高成功率越高,实时的话最高,其实就是集中控制。横轴是障碍数,纵轴是成功率
  • 动态角色。按照一定频率切换。频率越高越好。
  • 保持静态角色,组合不一样。speaker-speaker的成功率就不如speaker-listener组合。横轴是障碍数,纵轴是成功率


  • 没有在人机上进行实验。(作者提到这种方法其实用于人机合作比较好)
    • 机器学习出人扮演的角色 learn what role humans use
    • 影响人的行为,互相适应。influence humans for mutual adaptation


  • 当团队分享控制的时候,通信很必要。(When team share control, communication is necessary)

  • 人类通过action隐含着通信。(Human implicitly communicate through actions)

  • 本文:

    • 把分享控制变成了一个学习问题。(Formalized shared control as a learning problem)
    • 在implicit communication 引入角色。(Introduced roles for implicit communication)
    • 从理论和实践上解释角色是怎么工作的。(Explained how roles work in theory and practice)

Worst Cases Policy Gradients



  • WCPG extends actor-critics to optimize for risk-sensitive criterion

  • increases generalization and robustness

  • 算法合起来就是 值分布RL+CVaR+DDPG


Graph Policy Gradients for Large Scale Robot Control

内容:这篇用图神经网络来做大规模无人机集群控制,要实现分布式控制,Designing analytic decentralized controllers for desired swarm behavior is challenging, except for simple cases。把RL用在多机控制上的挑战有:

  • 集中奖励机制的信用归属?(Credit attribution with centralized reward scheme)

  • 部分感知环境(Robots only partially sense their environment)

  • 现有的算法大于3-5机器人就搞不定(Existing methods do not scale beyond 3-5 robots)

  • 需要大量的训练样本(Large number of training samples required)


但是,训练大量的机器人还是bottleneck,本文利用排列不变性(permutation invariance)来减少训练时间。

用了permutation invariance,有2个implications:

  • 在每一步中把图固定可以减少训练步数。(Permutation invariance can be used to reduce number of training episodes required by keeping the graph fixed during an episode even as the robots evolve in space and time during training.)

  • 用少量机器人学出来的filter可以扩展到大规模集群,加入局部拓扑不变的话。(Permutation invariance also implies that filters learned with a small number of robots can be extended to a larger swarm if the local topology is constant.)

  • 局限(caveats)

    • 假设机器人有完美的状态信息

    • 假设控制是理想的无噪声

  • 未来工作

    • 同构机器人转异构机器人(Homogeneous robots->Heterogeneous robots)

    • 其他的策略优化器

  • 总结

    • GPG给大规模的机器人提供了一个on policy,连续的控制
    • 通过local structure可以得到很高的样本效率
    • zero-shot的策略迁移,从n到N (n << N)

Learning Decentralized Controllers for Robot Swarms with Graph Neural Networks

组织:Penn Engineering


  • Flocking

    • 机器人的控制是用加速度,2D的,更新位置r和速度v,通信半径R。
    • 局部观测使agents
      • align velocities
      • maintain regular spacing
    • agent可以观测到邻居的位置和速度
  • Delayed Aggregation GNN

    • 所有的机器人共享局部策略

    • Aggregation helps when communication is limited

    • 仿真环境是Airsim


Kenji Doya

主题: Reinforcement learning in Meachines and the Brain

组织:Okinawa Institute Science of Technology


首先一部分工作是用强化学习来做机器人的控制,他们组实际上很早就开始做机器人控制(比如185年的learn to walk和2001年的learn to stand up,以及他的学生Pavvo Parmas用强化学习做自平衡车 learn to Bounce up and balance,感觉他们的工作都是在线学习,用TD Error直接做的),他总结了一个model-based和model-free RL的区别,model-free RL在真实机器人上很难用,主要是慢。

on-policy 稳定但是样本效率低,off-policy 样本效率高但是不稳定,所以有actor-critic。



Chelsea Finn


内容:Finn从自己的博士课题由来讲起,举了一个学习的例子,机器人从0开始在1个环境中学习1个任务(learn one task in one environment, starting from the scratch),但是人做的工作比机器人要多得多(捡球放球,收集数据),这样搞下去是不可行的(rely on detailed supervision and guidance)。如果我们想要造一个能够理解世界并且与世界交互的机器人,那么我们需要从起点(outset)再好好思考一下。放了一个动图:婴儿在屋子里滚来滚去自己玩各种玩具,引出几个关键词:Can we learn reusable models from ==raw sensor inputs== in diverse environments with minimal supervision?

  • 实验室之间共享数据集:step1: Collect a dataset step2: evaluate if it's useful (RoboNet那个工作)

Anca Dragan

主题:How to assume people are(approximately)optimal and get away with it

内容:对人的行为进行建模,任务是:一架无人机要飞过一个门,需要躲避一个人。人如果按照某个行为一直走,无人机检测出来以后,可以进行建模,然后修正自己的路线,绕开这个人。这时候地上突然撒了一杯咖啡,人走着走着就需要改变自己的路线,绕开这杯咖啡,(noisy-rationality: too narrow)无人机要继续能work的话就需要做一些操作:

  • how do you leverage the model when it's right, but become conservative when it's wrong?

  • 加一个rationality coefficient $\beta$ ,来estimate apparent rationality.物理意义是如果发现人的行为跟模型不符合,对模型保持怀疑(If the human appears too soboptimal to the model, be skeptical of the model)。

  • human rationality = model confidence

  • mind the gap (real human behavior and noisy rationality assumption)

Angela Schoellig

主题:Machine learning in the Closed Loop: Safety and Performance Guarantees for Robot Learning

组织:University of Toronto


  • 做机器人有两大方式:控制和机器学习。各有优缺点,机器学习的方式没有考虑最坏的例子,安全性也不够。系统必须可以learn和adapt,要safety和data efficiency,所以要结合model和data
  • 划分了几个板块,大家可以对号入座,看看自己的任务在哪个板块下。对于有充分的先验信息的情况,无论任务层次有多难,多机和单机都可以采用model-based control and planning。
  • 相关论文

    • [ICRA10]learn triple flips

    • [ICRA16, CDC17,RAL18,ECC19]Multi-task learning with training phase

    • [ECC15, IJRR16, JFR16, RAL19]Probabilistic Learning Models for Continuous Improvement
  • Robot Learning:各种buff升级之后,需要做一些特殊手段,比如Fast Adaptation & Long-term Learning

  • 相关论文

    • [ICRA17, IROS18, RAL18, JACSP19, RAL19]Dealing with Changing Dynamics

    • [RAL18]Multi-Robot, Multi-Task Transfer

  • Robot Control & Decision Making

Jan Peters

主题:Learning motor skills on real robot systems


  • 科学的机器人的worldview

  • 深度学习的worldview

  • 机器人学习要解决

  • research