The objective that we are trying to maximize is the expected return starting
from any state drawn from
Here
The formula for updating the policy network using the policy gradient theorem is:
where
In vanilla policy gradient (vpg) the return is calculated as a Monte-Carlo
estimate. We perform a full episode rollout and we collect the states, actions,
and rewards (
Finally we compute the gradient as:
Each of the states was visited using the current policy and the state distribution function of the environment, so for each state-action pair we can compute an estimate of the gradient. Then, we average over the samples, much like a mini-batch update in supervised learning.
Note that calculating the gradient this way is only useful if the samples were collected using the latest policy parameters. If the policy parameters change then the collected samples are no longer an estimate of the expectation. Thus, in theory, we are allowed to update the policy only once before we throw away the data. Because of this inefficient use of the data people usually perform a single episode rollout before estimating the gradient and updating the model parameters.
Compute the discounted cumulative reward-to-go at every time-step t
.
When computing the return we will discount future rewards by a discount factor.
Multiplying the rewards by a discount factor can be interpreted as encouraging
the agent to focus more on the rewards that are closer in time. This can also be
thought of as a means for reducing variance, because there is more variance
possible when considering rewards that are further into the future. The
cumulative return at time-step t
is computed as the sum of all future rewards
starting from the current time-step. The discounted cumulative return for all
states of the episode can be computed as a matrix multiplication between the
rewards vector and a special toeplitz matrix.
returns = rewards @ toeplitz
The single-sample Monte-Carlo estimate of the returns might have a lot of variance. A good approach for reducing the variance of the estimation is to add a baseline to the estimate of the return. We can baseline the returns using the value function by approximating it using a second neural network which is trained concurrently with the policy. The formula for the gradient would become:
where
One simple trick done in practice is to center final returns and normalize them to have an std of 1.0 before performing the backward pass. This usually boosts performance as it stabilizes the training of the neural network.
returns = (returns - returns.mean()) / returns.std()
This modification makes use of a constant baseline at all time-steps and
effectively rescales the learning rate by a factor of
One issue that might arise with longer episodes is that the size of the observations is too large to fit on the GPU at once. In this case you need to iterate through the data using mini-batches, but remember that your are allowed to update only once! Thus, when iterating the gradients are only accumulated, and only one optimization step is performed at the end.
optim.zero_grad()
for o, a, r in minibatch_loader(zip(obs, acts, returns)):
# move o, a and r to device
r = (r - r.mean()) / r.std() # normalize on the mini-batch level
logits = pi(obs)
logp = F.cross_entropy(logits, a, reduction="none")
loss = (logp * r).mean()
loss.backward()
# maybe do gradient clipping here
optim.step()