Combining improvements in deep Q Learning for fast and stable training with a modular, configurable agent.
Pranjal Tandon's Pytorch Soft Actor Critic is used as a baseline. I've added the following optional components atop it:
- Asynchronous Environment rollouts and Parameter Updates base on a combination of Horgan et al's APEX Pipeline and Petrenko et al's SampleFactory. Discussed here
- He et al's variant of n-step returns : using the sampled return as a lower-bound constraint (penalty actually) on Q predictions to accelerate convergence
- Hindsight Experience Replay : A data augmentation technique for Goal-directed Environments. It creates synthetic experiences where we pretend the goal state we achieved was the goal state we desired all along, and recalculate the rewards that we would have achieved accordingly.
- Discrete Policy for SAC based on Wah Loon Keng's work : We use the Gumbel Softmax trick to create a differentiable rsample of a discrete distribution, and feed this to the critic.
- Kuznetson et al's Truncated Mixture of Continuous Distributional Quantile Critics : We use an ensemble of Q networks with multiple predictions to predict quantiles of an approximate distribution of Q trained using quantile regression, and also use it to handle over-estimation bias by droping the top-N target predictions. Based on SamsungLabs Pytorch port
- A State dependent exploration method based on Raffin & Stulp's gSDE to make SAC more robust to environments that act like low-pass filters
The state of the art in Deep RL has been through ramping up in scale scale. But with enough effort, patience and time in optimizing pipelines, people can achieve 80-90%-ish of state of art results with commodity hardware.
I'm setting out to create such from scratch to learn the intricacies of writing fast Reinforcement Learning pipelines, and combining improvements from published work to attain general algorithmic speed improvements.
I will start from simple classic control environments, then ramp up through to standard benchmarks like RoboSchool, then through to pixel-based environments like Atari.
My goal is to have a single algorithm solve all of these out-of-the-box with the same set of hyper parameters.
main.py
configures the experiments. I haven't setup an argparse system or reading configs from file yet (on the todo list), for now, all configuration is done by edditing the config instances in main, then running it.
This was tested on windows 10 with torch 1.3.0.