Authors: John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, Pieter Abeel
Year: 2017
Algorithm: GAE
(Note: The summary here omits deduction processes of each formula. For greater details, please refer to the original paper. A clear explanation of GAE can also be found in the blog post here.)
-
Problems
- Two main challenges for policy gradient methods:
- The large number of sample required
- Difficulty in obtaining stable and steady improvement
- High bias is more harmful than high variance - it can cause the algorithm to fail to converge, or converge to a poor solution.
- Two main challenges for policy gradient methods:
-
Proposed solution
- For the first challenge: Use value functions to reduce the variance of policy gradient estimates with an estimator of the advantage function.
- For the second challenge: Use Trust Region Optimization (TRPO) Procedure for both the policy and the value function.
-
GAE (Generalized Advantage Estimator)
-
What it is: A family of policy gradient estimators
-
Goal: Significantly reduce variance while maintaining a tolerable level of bias
-
(This algorithm reduces policy gradient's variance at the cost of introducing bias)
-
A general summary of policy gradient methods
-
Define gamma-just for an estimator
-
Producing an accurate estimator
-
Using the generalized advantage estimator, the discounted policy gradient is thus:
-
-
Value function estimation
-
Use TRPO for adjusting policy network
(For details about TRPO, please refer to the summary here)
-