Fast Deep Q Learning

Combining improvements in deep Q Learning for fast and stable training with a modular, configurable agent.
Pranjal Tandon's Pytorch Soft Actor Critic is used as a baseline. I've added the following optional components atop it:

Features:

Asynchronous Environment rollouts and Parameter Updates base on a combination of Horgan et al's APEX Pipeline and Petrenko et al's SampleFactory. Discussed here
He et al's variant of n-step returns : using the sampled return as a lower-bound constraint (penalty actually) on Q predictions to accelerate convergence
Hindsight Experience Replay : A data augmentation technique for Goal-directed Environments. It creates synthetic experiences where we pretend the goal state we achieved was the goal state we desired all along, and recalculate the rewards that we would have achieved accordingly.
Discrete Policy for SAC based on Wah Loon Keng's work : We use the Gumbel Softmax trick to create a differentiable rsample of a discrete distribution, and feed this to the critic.
Kuznetson et al's Truncated Mixture of Continuous Distributional Quantile Critics : We use an ensemble of Q networks with multiple predictions to predict quantiles of an approximate distribution of Q trained using quantile regression, and also use it to handle over-estimation bias by droping the top-N target predictions. Based on SamsungLabs Pytorch port

WIP:

A State dependent exploration method based on Raffin & Stulp's gSDE to make SAC more robust to environments that act like low-pass filters

Motivation:

The state of the art in Deep RL has been through ramping up in scale scale. But with enough effort, patience and time in optimizing pipelines, people can achieve 80-90%-ish of state of art results with commodity hardware.

I'm setting out to create such from scratch to learn the intricacies of writing fast Reinforcement Learning pipelines, and combining improvements from published work to attain general algorithmic speed improvements.

I will start from simple classic control environments, then ramp up through to standard benchmarks like RoboSchool, then through to pixel-based environments like Atari.
My goal is to have a single algorithm solve all of these out-of-the-box with the same set of hyper parameters.

Usage

main.py configures the experiments. I haven't setup an argparse system or reading configs from file yet (on the todo list), for now, all configuration is done by edditing the config instances in main, then running it.

This was tested on windows 10 with torch 1.3.0.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.github/workflows		.github/workflows
experiments		experiments
franQ		franQ
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
env.yml		env.yml
main.py		main.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast Deep Q Learning

Features:

WIP:

Motivation:

Usage

About

Releases 2

Languages

License

llucid-97/FastDeepQLearning

Folders and files

Latest commit

History

Repository files navigation

Fast Deep Q Learning

Features:

WIP:

Motivation:

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Languages