improved Cross Entropy Method for trajectory optimization
[presentation and experiments at https://martius-lab.github.io/iCEM/]
Abstract: Trajectory optimizers for model-based reinforcement learning, such as the Cross-Entropy Method (CEM), can yield compelling results even in high-dimensional control tasks and sparse-reward environments. However, their sampling inefficiency prevents them from being used for real-time planning and control. We propose an improved version of the CEM algorithm for fast planning, with novel additions including temporally-correlated actions and memory, requiring 2.7-22x less samples and yielding a performance increase of 1.2-10x in high-dimensional control problems.
Installed via the provided Pipfile with pipenv install
, then pipenv shell
to activate virtualenv
- Inside icem folder run
python main.py settings/[env]/[json]
- To render all envs: set
"render": true
iniCEM/icem/settings/defaults/gt_default_env.json
The iCEM controller file is located here and it contains the following additions, which you can also extract and add to your codebase:
- colored-noise, line68:
It uses the packagecolorednoise
which generatesnum_sim_traj
temporally correlated action sequences along the planning horizon dimensionh
.
The parameter you have to change depending on the task isnoise_beta
and it has an intuitive significance: higher β for low-frequency control (FETCH PICK&PLACE, RELOCATE, etc.) and lower β for high-frequency control (HALFCHEETAH RUNNING)
iCEM/CEM with ground truth | iCEM with PlaNet | |
---|---|---|
horizon h | 30 | 12 |
colored-noise exponent β | 0.25 HALFCHEETAH RUNNING | 0.25 CHEETAH RUN |
2.0 HUMANOID STANDUP | 0.25 CARTPOLE SWINGUP | |
2.5 DOOR | 2.5 WALKER WALK | |
2.5 DOOR (sparse reward) | 2.5 CUP CATCH | |
3.0 FETCH PICK&PLACE | 2.5 REACHER EASY | |
3.5 RELOCATE | 2.5 FINGER SPIN |
-
clipping actions at boundaries, line79:
Instead of sampling from a truncated normal distribution, we sample from the unmodified normal distribution (or colored-noise distribution) and clip the results to lie inside the permitted action interval. This allows to sample maximal actions more frequently. -
decay of population size, line126:
Since the standard deviation of the CEM-distribution shrinks at every CEM-iteration, we introduce then an exponential decrease in population size of a fixed factor γ.:num_sim_traj
now becomesmax(self.elites_size * 2, int(num_sim_traj / self.factor_decrease_num))
The max operation ensures that the population size is at least double the elites' size. -
keep previous elites, line143:
We store the elite-set generated at each inner CEM-iteration and add a small fraction of them (fraction_elites_reused
) to the pool of the next iteration, instead of discarding the elite-set in each CEM-iteration. -
shift previous elites, line131:
We store a small fraction of the elite-set of the last CEM-iteration and add each a random action at the end to use it in the next environment step.
This is done with the functionelites_2_action_sequences
.
The reason for not shifting the entire elite-set in both cases is that it would shrink the variance of CEM drastically in the first CEM-iteration because the last elites are quite likely dominating the new samples and have small variance. We use afraction_elites_reused
=0.3 in all experiments. -
execute best action, line163:
The purpose of the original CEM algorithm is to estimate an unknown probability distribution. Using CEM as a trajectory optimizer detaches it from its original purpose. In the MPC context we are interested in the best possible action to be executed.
For this reason, we choose the first action of the best seen action sequence, rather than executing the first mean action, which was actually never evaluated. -
add mean to samples (at last iCEM-iteration), line87:
We decided to add the mean of the iCEM distribution as a sample for two reasons:- because as dimensionality of the action space increases, it gets more and more difficult to sample an action sequence closer to the mean of the distribution.
- because executing the mean might be beneficial for many tasks which require “clean” action sequences like manipulation, object-reaching, or any linear trajectory in the state-space.
In practice, we add the mean just at the last iteration for reasons explained in the paper, section E.2, and we simply substitute it to one of the samples:
sampled_from_distribution[0] = self.mean
In the figure below we present the ablations and additions of the improvements mentioned above, for all environments and a selection of budgets. As
we use the same hyperparameters for all experiments in some environments a few of the ablated versions perform slightly better but overall our final version has the best performance.
As we can see not all components are equally helpful in the different environments as each environment poses different challenges. For instance, in HUMANOID STANDUP the optimizer can get easily stuck in a local optimum corresponding to a sitting posture. Keeping balance in a standing position is also not trivial since small errors can lead to unrecoverable states. In the FETCH PICK&PLACE environment, on the other hand, the initial exploration is critical since the agent receives a meaningful
reward only if it is moving the box. Then colored noise and keep elites and shifting elites is most important.