[TOC]

CoRL 2019 会议总结

1.会议简介

2019年的CoRL(Conference on Robot Learning)会议在日本大阪举行，时间是2019.10.29-11.1。这是第3届，第2届在瑞士苏黎世举办，我也有幸去参加了一下，但当时是个彻头彻尾的萌新，科研方向是SLAM，对于强化学习这件事根本不了解，也不知道自己要做什么，就广撒网==什么都在听。今年对于自己要做的事逐渐清晰，所以听的时候有一些侧重。

大阪要比北京暖和很多（白天有10多度度甚至20多度），温差也相对小一些，一落地就立马冬装换秋装，转了2趟地铁线，终于在天刚擦黑的时候到达了酒店（Senri Hankyu Hotel Osaka）。

Senri Hankyu Hotel Osaka

酒店距离本次开会的地点（Senri Life Science Center）大概800m的距离，距离千里中央站（Senri Chuo Station）有5分钟步行路程。整体都不错，早晨有自助早餐（chagall餐厅）或者日本料理（Tsuruya餐厅），有可以自由交流的独立空间，如果在夏季去开会的话还可以享受室外游泳池。先上几张酒店图，个人评价还不错~

Senri Life Science Center

会议的地点Senri Life Science Center在新大阪，感觉比较偏，离繁华都市（比如道顿堀）有30分钟的地铁距离，开会主要集中在Senri Life Science Center的5层。

会议容量是420人，采取single track形式，因为人数比较多，所以组织方设置了2个大厅，2号大厅有同步直播，听和看是没有问题的，就是么得办法提问。学生注册会议会有优惠，不过去取材料的时候要记得带上学生卡，组织方需要看一下。中午会提供lunch bag，比较凉，味道还不错。每天晚上都有晚宴，不过我没去参加，去繁华都市体验了一下大阪的特色美食。

CoRL每年都会有一个WiML(Women in Mechaine Learning)的小会议，主要面向女性，不过男性也可以参加，主要是大家互相认识一下。这次安排在第一天会议的中午，Finn大佬来做了报告，不过没有说到元学习的东西，主要是分享了自己博士课题的背景和相关的一些工作。

除了Finn做了报告之外，还有Anca Dragan，她是CoRL的常客，去年也见到她了。她的报告是无人机如何建模人并且躲避他。

sony公司做了一个展览，3D vision sensing摄像头自带处理器，直接产生一个八叉树地图，还带地图拼接的那种，实时性也很高，做SLAM的可以关注下。
还有这个机械手也很强，只有漏出来的那一个小电机，全是机械结构实现的，能够抓取不同种类的物体，我去跟他握了一下手，劲超级大，但是不感觉突兀，这才是真机械！我难以置信还反复问他你这里没有任何处理器？那个人说只有机械结构，只能点赞。

2.文章概况

Acceptance rate

2018：237 篇文章中接收了75篇，接收率是31.6%
2019：358篇文章中接收了120篇，接收率是27.6%

Proceedings

https://drive.google.com/drive/folders/17I7wD1WrKRNqaRFjEIzeNrXkA0FvHoNb

Program

https://www.robot-learning.org/home/program

Session

Perception and manipulation
Planning and control
Reinforcement learning
Imitation learning
Human-robot interaction

Award

Best Paper Award

3B-03: A Divergence Minimization Perspective on Imitation Learning Methods (Imitation learning) Seyed Kamyar Seyed Ghasemipour, Richard Semel, Shixiang Gu

Best Paper Award Finalist

3F-04: Disentangled Relational Representations for Explaining and Learning from Demonstration (Human-Robot Interaction) Yordan Hristov, Daniel Angelov, Michael Burke, Alex Lascarides, Subramanian Ramamoorthy

Best System Paper Award

1B-03: Learning to Manipulate Object Collections Using Grounded State Representations (Perception and manipulation) Matthew Wilson, Tucker Hermans

Best System Paper Award Finalist

1C-07: PyRoboLearn: A Python Framework for Robot Learning Practitioners Code (poster) Brian Delhaisse, Leonel Rozo, Darwin G. Caldwell

Best Presentation Award

2B-02: Bayesian Optimization Meets Riemannian Manifolds in Robot Learning (Reinforcement learning) Noemie Jaquier, Leonel Rozo, Sylvain Calinon, Mathias Burger

3.文章列表

整理了一下跟自己课题组相关的方向，题目先甩上来。发现一个特点，基本每篇文章都有附件，要么是视频，要么是代码，视频偏多一些。这可能也反映了作者的态度吧，以后投文章也要记得拍个视频出来，毕竟做机器人的，没有实物等于白说。

多机

Connectivity Guaranteed Multi-robot Navigation via Deep Reinforcement Learning
Predictive Safety Network for Resource-constrained Multi-agent Systems
PIC: Permutation Invariant Critic for Multi-Agent Deep Reinforcement Learning
Macro-Action-Based Deep Multi-Agent Reinforcement Learning
Multi-Agent Reinforcement Learning with Multi-Step Generative Models[多机械臂协同，但是思路很有趣，可以看看]
Learning from My Partner's Actions: Roles in Decentralized Robot Teams
图神经网络
- Learning Decentralized Controllers for Robot Swarms with Graph Neural Networks
- Graph Policy Gradients for Large Scale Robot Control

视觉导航
- Learning to Navigate Using Mid-level Visual Priors
- Combining Optimal Control and Learning for Visual Navigation in Novel Environments
- Learning Navigation Subroutines from Egocentric Videos
语言-视觉导航
- Conditional Driving from Natural Language Instructions
- Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight
- Vision-and-Dialog Navigation
- Language-guided Semantic Mapping and Mobile Manipulation in Partially Observable Environments

元学习

MAME: Model-Agnostic Meta-Exploration
Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning Code

在线学习

An Online Learning Procedure for Feedback Linearization Control without Torque Measurements
Model-based planning with energy based models Code
On-Policy Robot Imitation Learning from a Converging Supervisor
Data Efficient Reinforcement Learning for Legged Robots

Sim-to-Real

TuneNet: One-Shot Residual Tuning for System Identification and Sim-to-Real Robot Task Transfer
Multi-Agent Manipulation via Locomotion using Hierarchical Sim2Real

图神经网络

Graph-Structured Visual Imitation

位姿估测

Two Stream Networks for Self-supervised Ego-Motion Estimation
Robust Semi-Supervised Monocular Depth Estimation with Reprojected Distances

4.文章解析

MAME：Model-Agnostic Meta-Exploration

作者：Swaminathan Gurumurthy Sumit Kumar Katia Sycara

组织：Robotics Institute, Carnegie Mellon University (CMU)

内容：

Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning Code

作者： Tianhe Yu, Deirdre Quillen, Zhanpeng He], Ryan Julian, Karol Hausman, Chelsea Finn, Sergey Levine

组织：UC berkeley

内容：

multi-task reinforcement learning的目的是学习一个policy，能够更有效地处理很多tasks，并且要比单独地学习task要好（promise: learn s single policy that can solve multiple tasks more efficiency than learning the tasks individually）。现在的multi-task RL的benchmark有DM Lab和Atari，有两个缺点：
- 都是游戏设定，缺少实际应用（limited to game setting and lack of realistic use cases）
- 迁移到不相关的游戏上不好使（little effciency to be gained on disjoint games）
meta-RL的目的是利用以往的experiences有效地学习新的tasks（promise: efficiently acquire new tasks by leveraging experiences from past tasks）。现在的Meta-RL的benchmark就是mujoco那些设定，有3个缺点：
- 任务的分布非常限制（task distributions are very narrow）
- 适应的是相同任务的新变化而已（adaptation to new variations of the same task）
- 称之为“multi-goal” benchmark更合适（better characterized as “multi-goal” benchmarks）
所以本文的目标是使meta-RL能够泛化新的不同的技能上并且去评价泛化性能（goal：enable meta-RL to generalize to new distinct skills and evaluate the generalization performance）
- large, diverse task set-->generalization to new tasks
Meta-world的独特之处在于：一个有很多task的多任务和元强化的数据集，为了研究元强化学习如何加速新任务的学习（A new multi-task and meta-RL benchmark with a wide range of tasks to study how meta-RL accelerates acquisition of new tasks），
- 还是抓取数据集。（50 robotics manipulation tasks）
- 用5种不同的模式评价meta-RL算法。（on five different modes）

PyRoboLearn: A Python Framework for Robot Learning Practitioners Code

作者：Brian Delhaisse, Leonel Rozo, Darwin G. Caldwell

内容：这是一篇推广文，介绍python的新库pyrobolearn。机器人有很多仿真环境，比如：ROS,GAZEBO,MuJoCo,Bullet,RBDL(mark)，仿真环境的语言是C++>Matlab>Python;机器学习的种类特别多，框架就是tensorflow和pytorch，再加个gym框架，语言是Python>C++>Matlab。那么，大一统思想自然而然出现了。PyRoboLearn从ROS订阅一些话题进各种simulator->对应world和robot->定义state和action->创建出env，定义policy->recorder（类似于跑一段数据？）和interface（传感器）->定义task。然后就开始训练和测试。overview和示例使用如下图所示：

On-Policy Robot Imitation Learning from a Converging Supervisor

内容：

- future work - develop finite-sample + worst case guarantees - study trade-offs between supervision quality and quantity - Experiments on higher-dimensional tasks with image-space observations

Model-based planning with energy based models Code

组织：MIT

内容：

要解决的问题（problem try to tackle）
- 学习一个planning的model
- 用energy-based model
  - good online model learning
  - allows efficient planning as inference
  - natural exploration for learning models
Trajectory Modeling
- train a EBM to model state transitions $s_t, s_{t+1}$
- mark 没有看懂算法，会场上去问了也么得搞懂，等我补习一下EBM回来填坑。
仿真环境做了实验，Particles, Maze, Reacher。还有一个额外的no goal的探索任务。

Data Efficient Reinforcement Learning for Legged Robots

内容：

Connectivity Guaranteed Multi-robot Navigation via Deep Reinforcement Learning Video

作者： Juntong Lin, Xuyun Yang, Peiwei zheng, Hui Cheng

组织：中山大学

内容：大概的任务是让多个机器人能够保持一定的连接（欧氏距离小于某个阈值），提出了一个CSPF（constraint satisfying parametric function），仿真环境用的是Virtual policy extended environment (VP2E) ，跟我们的好像是差不多的，然后在实际环境上也做了实验(setup跟我们很像，也是optitrack来做全局定位，车载处理器也是TX2，不过我们的车更小，他的车是robomaster，有点大。)，感觉场景比较简单，任务也比较简单，场地很小，机器人跑几步就到达目标了，算法的优势在实际环境中并不明显。

Macro-Action-Based Deep Multi-Agent Reinforcement Learning

内容：

An Online Learning Procedure for Feedback Linearization Control without Torque Measurements

组织：SAPIENZA University of ROMA

内容：

背景
- model-based control 需要精确的建模。
- 估计系统的动力学参数很有挑战性，特别是考虑关节摩擦和负载。
- 离线实现，随着时间需要重复这个过程。
- 模型辨识需要很多技术：动力学参数回归，参数化和非参数化学习
动机
- 传统方法
  - 离线
  - 没有系统动力学参数在线调整
  - 需要测量力
  - 对有噪声的传感器数据进行过滤，性能随之下降
- 我们的方法
  - 在线，机器人可以边提高性能边执行任务。
  - 不需要力测量（只需要编码器）
在线算法的步骤：
- 给定一个初始的参考轨迹
- 执行nominal model得到的理想加速度（desired acceleration）
- 测量一下当前处于什么状态，肯定是与理想的不一样（未建模的动力学 unmodeled dynamics）
- 计算达到新状态所需要的真实系统加速度
- 用nominal model和加速度误差来重建模型不匹配误差（model dismatch）
在线学习的流程图
- 在一个7自由度的机械臂上做了仿真实验，没有真实实验。
- 主要的限制在于：
  - 不支持欠驱动。fully feedback linearizable system is required (no underactuation)
  - 不支持静态接触。a change in the joint velocities and positions (no static contact)

Learning from My Partner's Actions: Roles in Decentralized Robot Teams

内容：这篇文章思路很是清奇，任务设定是2个机械臂配合完成一个任务，做决策的时候要考虑合作者的action，也就是$\pi_1(a_1 | s_1, a_2)$。理解队友的行为背后的含义是很难的：机器人会有很多不同的理由来选择行为。提到了一篇文章“what do you think I think you think”，思路是做策略推理，本文的观点是如果团队划分了角色，并且每个角色的行为有独特的含义，那么是可以考虑队友行为的。之前的任务，一个机械臂是speaker，一个是listener，speaker正常操作，listener的策略是$\pi_2(a_2 | s_2, a_1)$。但是这还不够！

机器人还需要变换角色（some teams become unstable under fixed roles）
怎么变换角色（switch at a prefined frequency）
角色多有效（as frequency increses，decentralized team converges to centralized team）
speaker要干啥（good speaker exaggerate to convey information，我猜测他想说强调有用信息吧）

做了一个真实环境的实验，几个实验结果对比。

如果两个机械臂独立控制，抵抗力就会越来越大；如果是有角色的，那么抵抗力保持在较低的水平。
换角色的频率，越高成功率越高，实时的话最高，其实就是集中控制。横轴是障碍数，纵轴是成功率
动态角色。按照一定频率切换。频率越高越好。
保持静态角色，组合不一样。speaker-speaker的成功率就不如speaker-listener组合。横轴是障碍数，纵轴是成功率

局限性：

没有在人机上进行实验。（作者提到这种方法其实用于人机合作比较好)
- 机器学习出人扮演的角色 learn what role humans use
- 影响人的行为，互相适应。influence humans for mutual adaptation

总结：

当团队分享控制的时候，通信很必要。（When team share control, communication is necessary）
人类通过action隐含着通信。（Human implicitly communicate through actions）
本文：
- 把分享控制变成了一个学习问题。（Formalized shared control as a learning problem）
- 在implicit communication 引入角色。（Introduced roles for implicit communication）
- 从理论和实践上解释角色是怎么工作的。（Explained how roles work in theory and practice）

Worst Cases Policy Gradients

内容：做自动驾驶的人可以看看，问题设置一张图就能说明白：

所以是要考虑最坏的情况该怎么办？

WCPG extends actor-critics to optimize for risk-sensitive criterion
increases generalization and robustness
算法合起来就是值分布RL+CVaR+DDPG

仿真环境就是2D交通流，2个场景是左转弯和高速公路并线，实验结果如下：

Graph Policy Gradients for Large Scale Robot Control

内容：这篇用图神经网络来做大规模无人机集群控制，要实现分布式控制，Designing analytic decentralized controllers for desired swarm behavior is challenging, except for simple cases。把RL用在多机控制上的挑战有：

集中奖励机制的信用归属？（Credit attribution with centralized reward scheme）
部分感知环境（Robots only partially sense their environment）
现有的算法大于3-5机器人就搞不定（Existing methods do not scale beyond 3-5 robots）
需要大量的训练样本（Large number of training samples required）

我们的方法：N个机器人组成一个N节点的图。

但是，训练大量的机器人还是bottleneck，本文利用排列不变性（permutation invariance）来减少训练时间。

用了permutation invariance，有2个implications：

在每一步中把图固定可以减少训练步数。（Permutation invariance can be used to reduce number of training episodes required by keeping the graph fixed during an episode even as the robots evolve in space and time during training.）
用少量机器人学出来的filter可以扩展到大规模集群，加入局部拓扑不变的话。(Permutation invariance also implies that filters learned with a small number of robots can be extended to a larger swarm if the local topology is constant.)

局限（caveats）
- 假设机器人有完美的状态信息
- 假设控制是理想的无噪声
未来工作
- 同构机器人转异构机器人（Homogeneous robots->Heterogeneous robots）
- 其他的策略优化器
总结
- GPG给大规模的机器人提供了一个on policy，连续的控制
- 通过local structure可以得到很高的样本效率
- zero-shot的策略迁移，从n到N (n << N)

Learning Decentralized Controllers for Robot Swarms with Graph Neural Networks

组织：Penn Engineering

内容：

Flocking
- 机器人的控制是用加速度，2D的，更新位置r和速度v，通信半径R。
- 局部观测使agents
  - align velocities
  - maintain regular spacing
- agent可以观测到邻居的位置和速度
Delayed Aggregation GNN
- 所有的机器人共享局部策略
- Aggregation helps when communication is limited
- 仿真环境是Airsim

5.Keynote

Kenji Doya

主题: Reinforcement learning in Meachines and the Brain

组织：Okinawa Institute Science of Technology

内容：脑科学和强化学习是共同发展的，上一张总结图，有一个工作是多巴胺作为reward在猴子身上做实验。

首先一部分工作是用强化学习来做机器人的控制，他们组实际上很早就开始做机器人控制（比如185年的learn to walk和2001年的learn to stand up，以及他的学生Pavvo Parmas用强化学习做自平衡车 learn to Bounce up and balance，感觉他们的工作都是在线学习，用TD Error直接做的），他总结了一个model-based和model-free RL的区别，model-free RL在真实机器人上很难用，主要是慢。

on-policy 稳定但是样本效率低，off-policy 样本效率高但是不稳定，所以有actor-critic。

然后他们组还在研究强化学习和脑结构的一些关系，1999年发表了文章，认为每一个脑结构都完成了特殊的功能，比如监督学习，强化学习和无监督学习，看起来他们是在小鼠和猴子身上做实验。

最后提出了自动化AI的danger：

Chelsea Finn

主题：如何加快学习

内容：Finn从自己的博士课题由来讲起，举了一个学习的例子，机器人从0开始在1个环境中学习1个任务（learn one task in one environment, starting from the scratch）,但是人做的工作比机器人要多得多（捡球放球，收集数据），这样搞下去是不可行的（rely on detailed supervision and guidance）。如果我们想要造一个能够理解世界并且与世界交互的机器人，那么我们需要从起点（outset）再好好思考一下。放了一个动图：婴儿在屋子里滚来滚去自己玩各种玩具，引出几个关键词：Can we learn reusable models from ==raw sensor inputs== in diverse environments with minimal supervision?

实验室之间共享数据集：step1: Collect a dataset step2: evaluate if it's useful (RoboNet那个工作)

Anca Dragan

主题：How to assume people are（approximately）optimal and get away with it

内容：对人的行为进行建模，任务是：一架无人机要飞过一个门，需要躲避一个人。人如果按照某个行为一直走，无人机检测出来以后，可以进行建模，然后修正自己的路线，绕开这个人。这时候地上突然撒了一杯咖啡，人走着走着就需要改变自己的路线，绕开这杯咖啡，（noisy-rationality: too narrow）无人机要继续能work的话就需要做一些操作：

how do you leverage the model when it's right, but become conservative when it's wrong?
加一个rationality coefficient $\beta$ ,来estimate apparent rationality.物理意义是如果发现人的行为跟模型不符合，对模型保持怀疑（If the human appears too soboptimal to the model, be skeptical of the model）。
human rationality = model confidence
mind the gap （real human behavior and noisy rationality assumption）

Angela Schoellig

主题：Machine learning in the Closed Loop: Safety and Performance Guarantees for Robot Learning

组织：University of Toronto

内容：

做机器人有两大方式：控制和机器学习。各有优缺点，机器学习的方式没有考虑最坏的例子，安全性也不够。系统必须可以learn和adapt，要safety和data efficiency，所以要结合model和data

划分了几个板块，大家可以对号入座，看看自己的任务在哪个板块下。对于有充分的先验信息的情况，无论任务层次有多难，多机和单机都可以采用model-based control and planning。

相关论文
- [ICRA10]learn triple flips
- [ICRA16, CDC17,RAL18,ECC19]Multi-task learning with training phase
- [ECC15, IJRR16, JFR16, RAL19]Probabilistic Learning Models for Continuous Improvement
Robot Learning:各种buff升级之后，需要做一些特殊手段，比如Fast Adaptation & Long-term Learning

相关论文
- [ICRA17, IROS18, RAL18, JACSP19, RAL19]Dealing with Changing Dynamics
- [RAL18]Multi-Robot, Multi-Task Transfer
Robot Control & Decision Making

Jan Peters

主题：Learning motor skills on real robot systems

内容：

科学的机器人的worldview
深度学习的worldview
机器人学习要解决
research

Files

CoRL2019.md

Latest commit

History