Proximal policy optimization was introduced in the paper "Proximal policy optimization algorithms" by Schulman et. al. (here)
The modification introduced by PPO to the vanilla PG algorithm seems like it is very simple and easy to implement, but there are a lot of other crucial details that also need to be taken into consideration. Checkout the brilliant blog post "The 37 details of proximal policy optimization" for a thorough discussion on all the details.
The problem that the authors are trying to solve is to come up with an algorithm
that allows us to take the biggest possible update step on the policy parameters
before throwing out the collected rollout data. The approach taken in this paper
is to allow for multiple update steps that, when combined, would approximate
this maximum possible update. Note that after we update the policy parameters
once,
It seems like we could compute the objective to update the new policy weights using the data collected with the old policy weights, as long as we correct with the importance sampling weight:
Thus, we could perform multiple update steps using the collected rollout data.
However, note that, in order to compute the correct gradient estimate, the
actions have to be sampled under
How bad is that?
It turns out that if
There are two different proximal policy algorithms each using a different
heuristic to try to ensure that
- PPO-Penalty - constraints the KL divergence between the two distributions by adding it as a penalty to the objective:
-
PPO-CLIP - clips the objective function if
$\pi_\theta$ deviates too much from$\pi_{\theta_{old}}$ :
The algorithm implemented here is PPO-CLIP augmented with a check for
early stopping. If the mean KL divergence between
In Algorithm 1. of the PPO paper the authors state that they use "fixed-length trajectory segments" for training the agent. What this means is that there are two phases to training:
- Phase 1 - Rollout phase. The agent performs a fixed-length
$T$ step rollout collecting (state, action, reward) triples. - Phase 2 - Learning phase. The agent performs multiple ppo updates to the
policy weights using the collected data.
Once the learning phase is over the agent starts a new rollout phase but
continues to step the environment from where it left off, i.e. the environment
is not reset at the beginning of the rollout phase. This allows PPO to learn in
long-horizon tasks where episodes could be extremely long (think 100K steps). In
addition, rollout is performed on
$N$ environments in parallel, allowing for a more efficient data collection.
After the rollout phase is over the experiences are stored in the following
observations
, actions
, rewards
, done
. The done
tensor provides information whether each of the observations is a terminal state
for the environment, i.e. whether the environment was terminated or truncated at
that step. Note that each fixed-length segment could contain multiple episodes
that are concatenated one after another. An example tensor of observations is:
obs = [
[ o_11, o_12, o_13, o_14, o_15,| o_71, o_72, o_73]
[ o_21, o_22, o_23, o_24, o_25, o_26, o_27, o_28]
[ o_31, o_32, o_33,| o_51, o_52, o_53,| o_81, o_82]
[ o_41, o_42, o_43, o_44,| o_61, o_62, o_63, o_64]
]
We have 4 trajectories each consisting of 8 time-steps. The corresponding done
tensor indicates which of the observations are from terminal states:
done = [
[ 0, 0, 0, 0, True, 0, 0, 0]
[ 0, 0, 0, 0, 0, 0, 0, 0]
[ 0, 0, True, 0, 0, True, 0, 0]
[ 0, 0, 0, True, 0, 0, 0, 0]
]
One more issue that needs to be addressed when using fixed-length segments is
the computation of the advantage (or return). Vanilla policy gradient estimates
the return using a Monte-Carlo estimate, which requires a full-episode rollout
giving all of the rewards from the given episode. However, in order to compute
the advantage
The estimator used is a truncated version of the generalized advantage estimator introduced in the paper "High-Dimensional Continuous Control Using Generalized Advantage Estimation" by Schulman et. el.:
where
During the learning phase we optimize the objective for
At every iteration, before computing the loss, the advantages are normalized on
the mini-batch level to have zero mean and unit variance. This usually boosts
performance as it stabilizes the training of the neural network. This
modification makes use of a constant baseline for all (state, action) pairs in
the batch and effectively rescales the learning rate by a factor of
adv = (adv - adv.mean()) / adv.std()
The loss is further augmented with an entropy regularization term calculated over the mini-batch. Trying to maximize the entropy has the effect of pushing the policy distribution to be more random, preventing it from becoming a delta function and thus increasing exploration during training.
entropy_loss = distributions.Categorical(pi(obs)).entropy()
Finally, at the end of every epoch we check the KL divergence between the newest and the original policy and stop the learning phase if the threshold is reached.
logp_old = pi(obs) # compute the log prob with the old weights
for _ in range(K):
for o, a, ad, lp_old in minibatch_loader(zip(obs, acts, adv, logp_old)):
adv = (adv - adv.mean()) / adv.std() # normalize on the mini-batch level
logits = pi(o)
logp = -F.cross_entropy(logits, a, reduction="none")
rho = (logp - lp_old).exp()
loss = min(rho * ad, clip(rho, 1-eps, 1+eps) * ad)
entropy_loss = distributions.Categorical(pi(obs)).entropy()
loss = -loss.mean() - c * entropy_loss # c ~ [0.01, 0.1]
optim.zero_grad()
loss.backward()
optim.step()
logp = pi(obs) # compute the log prob with the newest weights
KL = (logp - logp_old).mean()
if KL > threshold:
break
In addition to optimizing the policy we also need to optimize the weights of the value network. The value network is optimized by minimizing the mean squared error between the value net predictions and the calculated returns. Just like the clipped objective for the policy network, we also clip the value loss before updating the parameters:
And again we perform