Skip to content

trunghng/deep_rl_zoo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deep RL Zoo

A collection of Deep Reinforcement Learning algorithms implemented with Pytorch, strongly based on OpenAI's Spinning Up.

The collection is divided into two sets:

Setup

The project is running on Python 3.11. To install dependencies, run the command

pip install -r requirements.txt

Running experiment

Each experiment with default setting can be run directly as

python zoo/single/sac.py

or can be run through run.py

python -m run sac

The latter enables running n experiments with different seeds (0, ..., n-1) at once. For example, to perform 5 experiments with SAC agent (with default settings), run the command

python -m run sac -n 5

To customize experiment settings, check out each algorithm file for more detailed. For example, here are the arguments used in SAC

usage: sac.py [-h] [--env ENV] [--exp-name EXP_NAME] [--seed SEED]
              [--hidden-sizes HIDDEN_SIZES [HIDDEN_SIZES ...]] [--lr LR]
              [--epochs EPOCHS] [--steps-per-epoch STEPS_PER_EPOCH]
              [--max-ep-len MAX_EP_LEN] [--buffer-size BUFFER_SIZE]
              [--batch-size BATCH_SIZE] [--start-step START_STEP]
              [--update-every UPDATE_EVERY] [--update-after UPDATE_AFTER]
              [--gamma GAMMA] [--tau TAU] [--ent-coeff ENT_COEFF]
              [--adjust-ent-coeff] [--ent-coeff-init ENT_COEFF_INIT]
              [--ent-target ENT_TARGET] [--test-episodes TEST_EPISODES] [--save]
              [--save-every SAVE_EVERY] [--render] [--plot]

Soft Actor-Critic

optional arguments:
  -h, --help            show this help message and exit
  --env ENV             Environment ID
  --exp-name EXP_NAME   Experiment name
  --seed SEED           Seed for RNG
  --hidden-sizes HIDDEN_SIZES [HIDDEN_SIZES ...]
                        Sizes of policy & Q networks' hidden layers
  --lr LR               Learning rate for policy, Q networks & entropy coefficient
                        optimizers
  --epochs EPOCHS       Number of epochs
  --steps-per-epoch STEPS_PER_EPOCH
                        Maximum number of steps for each epoch
  --max-ep-len MAX_EP_LEN
                        Maximum episode/trajectory length
  --buffer-size BUFFER_SIZE
                        Replay buffer size
  --batch-size BATCH_SIZE
                        Minibatch size
  --start-step START_STEP
                        Start step to begin action selection according to policy
                        network
  --update-every UPDATE_EVERY
                        Parameters update frequency
  --update-after UPDATE_AFTER
                        Number of steps after which update is allowed
  --gamma GAMMA         Discount factor
  --tau TAU             Soft (Polyak averaging) update coefficient
  --ent-coeff ENT_COEFF
                        Entropy regularization coefficient
  --adjust-ent-coeff    Whether to enable automating entropy adjustment scheme
  --ent-coeff-init ENT_COEFF_INIT
                        Initial value for automating entropy adjustment scheme
  --ent-target ENT_TARGET
                        Desired entropy, used for automating entropy adjustment
  --test-episodes TEST_EPISODES
                        Number of episodes to test the deterministic policy at the
                        end of each epoch
  --save                Whether to save the final model
  --save-every SAVE_EVERY
                        Model saving frequency
  --render              Whether to render the training result
  --plot                Whether to plot the training statistics

Plotting results

usage: plot.py [-h] [--log-dirs LOG_DIRS [LOG_DIRS ...]] [-x [{epoch,total-env-interacts} ...]] [-y [Y_AXIS ...]]
               [-s SAVEDIR]

Results plotting

optional arguments:
  -h, --help            show this help message and exit
  --log-dirs LOG_DIRS [LOG_DIRS ...]
                        Directories for saving log files
  -x [{epoch,total-env-interacts} ...], --x-axis [{epoch,total-env-interacts} ...]
                        Horizontal axes to plot
  -y [Y_AXIS ...], --y-axis [Y_AXIS ...]
                        Vertical axes to plot
  -s SAVEDIR, --savedir SAVEDIR
                        Directory to save plotting results

Testing policy

Result policy can be tested via the following command

python -m run test_policy --log-dir path/to/the/log/dir

where path/to/the/log/dir is path to the log directory, which stores model file, config file, etc. For more details, check out the following

usage: test_policy.py [-h] --log-dir LOG_DIR [--eps EPS] [--max-ep-len MAX_EP_LEN]
                      [--render]

Policy testing

optional arguments:
  -h, --help            show this help message and exit
  --log-dir LOG_DIR     Path to the log directory, which stores model file, config file,
                        etc
  --eps EPS             Number of episodes
  --max-ep-len MAX_EP_LEN
                        Maximum length of an episode
  --render              Whether to render the experiment

Some results

Total-env-interacts-perf SAC-HalfCheetah

References

[1] Josh Achiam. Spinning Up in Deep Reinforcement Learning. SpinningUp2018, 2018.
[2] Richard S. Sutton & Andrew G. Barto. Reinforcement Learning: An Introduction. MIT press, 2018.
[3] Vlad Mnih, et al. Playing Atari with Deep Reinforcement Learning, 2013.
[4] Vlad Mnih, et al. Human Level Control Through Deep Reinforcement Learning. Nature, 2015.
[5] Hado van Hasselt, Arthur Guez, David Silver. Deep Reinforcement Learning with Double Q-learning. AAAI16, 2016.
[6] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, Nando de Freitas. Dueling Network Architectures for Deep Reinforcement Learning. arXiv preprint, arXiv:1511.06581, 2015.
[7] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. Proximal Policy Optimization Algorithms. arXiv preprint, arXiv:1707.06347, 2017.
[8] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel. Trust Region Policy Optimization. ICML'15, pp 1889–1897, 2015.
[9] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, Pieter Abbeel. High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR 2016.
[10] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, Martin Riedmiller. Deterministic Policy Gradient Algorithms. JMLR 2014.
[11] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra. Continuous control with deep reinforcement learning. ICLR 2016.
[12] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, Sergey Levine. Reinforcement Learning with Deep Energy-Based Policies. ICML, 2017.
[13] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, Igor Mordatch. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. NIPS 2017.
[14] Eric Jang, Shixiang Gu, Ben Poole. Categorical Reparameterization with Gumbel-Softmax. ICLR 2017.
[15] Chris J. Maddison, Andriy Mnih, Yee Whye Teh. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. ICLR 2017.

Releases

No releases published

Packages

No packages published

Languages