My understanding of DRL - Pytorch
- DGANs in Pytorch using Atari game frames from Open AI Gym
- CatPole Balancing using CrossEntropy Model-Free, Policy-Based, On-Policy
- Play N number of episodes using our current model and environment.
- Calculate the total reward for every episode and decide on a reward boundary. Usually, we use some percentile of all rewards, such as 50th or 70th.
- Throw away all episodes with a reward below the boundary.
- Train on the remaining "elite" episodes using observations as the input and issued actions as the desired output.
- Repeat from step i until we become satisfied with the result.
- FrozenLake using CrossEntropy Model-Free, Policy-Based, On-Policy, Discounted Reward -
- Reward 1 on completion Reward 0 on failing.
- Calulation discousted reward of each episode to penalize long episodes.
- Keeping better performing episodes forto tackel splling randomnes of the enviornment.
- Need more than 10K iterations to provide acceptable results.