Authors: John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, Pieter Abbeel
Year: 2015
Algorithm: TRPO
-
Problem
- The policy improvement process in policy gradient methods contain a lot of inefficient update steps.
-
Hypothesis
- Minimizing a certain surrogate objective function guarantees policy improvement with non-trivial step sizes.
-
Methods
-
Two ways to collect sets of trajectories:
-
Optimize a surrogate objective with a penalty on KL divergence.
-
Enforce a constraint (i.e., a trust region constraint) on the KL divergence between the new policy and the old policy during each update.
This method of optimization guarantees monotonic improvement. i.e., the new policy generated is no worse than the old policy.
(For the deduction process of this equation and proofs of relative theorems, please refer to the original paper, or read the algorithm docs from OpenAI Spinning Up.)
-
Algorithm from the paper:
-