A collection of Deep Reinforcement Learning algorithms implemented with Pytorch, strongly based on OpenAI's Spinning Up.
The collection is divided into two sets:
- Single-agent methods:
- Multi-agent methods:
- Consisting of MADDPG.
- Training & testing environment: Multi Particle Environments (MPE) - PettingZoo.
The project is running on Python 3.11. To install dependencies, run the command
pip install -r requirements.txt
Each experiment with default setting can be run directly as
python zoo/single/sac.py
or can be run through run.py
python -m run sac
The latter enables running n
experiments with different seeds (0, ..., n-1)
at once. For example, to perform 5 experiments with SAC agent (with default settings), run the command
python -m run sac -n 5
To customize experiment settings, check out each algorithm file for more detailed. For example, here are the arguments used in SAC
usage: sac.py [-h] [--env ENV] [--exp-name EXP_NAME] [--seed SEED]
[--hidden-sizes HIDDEN_SIZES [HIDDEN_SIZES ...]] [--lr LR]
[--epochs EPOCHS] [--steps-per-epoch STEPS_PER_EPOCH]
[--max-ep-len MAX_EP_LEN] [--buffer-size BUFFER_SIZE]
[--batch-size BATCH_SIZE] [--start-step START_STEP]
[--update-every UPDATE_EVERY] [--update-after UPDATE_AFTER]
[--gamma GAMMA] [--tau TAU] [--ent-coeff ENT_COEFF]
[--adjust-ent-coeff] [--ent-coeff-init ENT_COEFF_INIT]
[--ent-target ENT_TARGET] [--test-episodes TEST_EPISODES] [--save]
[--save-every SAVE_EVERY] [--render] [--plot]
Soft Actor-Critic
optional arguments:
-h, --help show this help message and exit
--env ENV Environment ID
--exp-name EXP_NAME Experiment name
--seed SEED Seed for RNG
--hidden-sizes HIDDEN_SIZES [HIDDEN_SIZES ...]
Sizes of policy & Q networks' hidden layers
--lr LR Learning rate for policy, Q networks & entropy coefficient
optimizers
--epochs EPOCHS Number of epochs
--steps-per-epoch STEPS_PER_EPOCH
Maximum number of steps for each epoch
--max-ep-len MAX_EP_LEN
Maximum episode/trajectory length
--buffer-size BUFFER_SIZE
Replay buffer size
--batch-size BATCH_SIZE
Minibatch size
--start-step START_STEP
Start step to begin action selection according to policy
network
--update-every UPDATE_EVERY
Parameters update frequency
--update-after UPDATE_AFTER
Number of steps after which update is allowed
--gamma GAMMA Discount factor
--tau TAU Soft (Polyak averaging) update coefficient
--ent-coeff ENT_COEFF
Entropy regularization coefficient
--adjust-ent-coeff Whether to enable automating entropy adjustment scheme
--ent-coeff-init ENT_COEFF_INIT
Initial value for automating entropy adjustment scheme
--ent-target ENT_TARGET
Desired entropy, used for automating entropy adjustment
--test-episodes TEST_EPISODES
Number of episodes to test the deterministic policy at the
end of each epoch
--save Whether to save the final model
--save-every SAVE_EVERY
Model saving frequency
--render Whether to render the training result
--plot Whether to plot the training statistics
usage: plot.py [-h] [--log-dirs LOG_DIRS [LOG_DIRS ...]] [-x [{epoch,total-env-interacts} ...]] [-y [Y_AXIS ...]]
[-s SAVEDIR]
Results plotting
optional arguments:
-h, --help show this help message and exit
--log-dirs LOG_DIRS [LOG_DIRS ...]
Directories for saving log files
-x [{epoch,total-env-interacts} ...], --x-axis [{epoch,total-env-interacts} ...]
Horizontal axes to plot
-y [Y_AXIS ...], --y-axis [Y_AXIS ...]
Vertical axes to plot
-s SAVEDIR, --savedir SAVEDIR
Directory to save plotting results
Result policy can be tested via the following command
python -m run test_policy --log-dir path/to/the/log/dir
where path/to/the/log/dir
is path to the log directory, which stores model file, config file, etc. For more details, check out the following
usage: test_policy.py [-h] --log-dir LOG_DIR [--eps EPS] [--max-ep-len MAX_EP_LEN]
[--render]
Policy testing
optional arguments:
-h, --help show this help message and exit
--log-dir LOG_DIR Path to the log directory, which stores model file, config file,
etc
--eps EPS Number of episodes
--max-ep-len MAX_EP_LEN
Maximum length of an episode
--render Whether to render the experiment
[1] Josh Achiam. Spinning Up in Deep Reinforcement Learning. SpinningUp2018, 2018.
[2] Richard S. Sutton & Andrew G. Barto. Reinforcement Learning: An Introduction. MIT press, 2018.
[3] Vlad Mnih, et al. Playing Atari with Deep Reinforcement Learning, 2013.
[4] Vlad Mnih, et al. Human Level Control Through Deep Reinforcement Learning. Nature, 2015.
[5] Hado van Hasselt, Arthur Guez, David Silver. Deep Reinforcement Learning with Double Q-learning. AAAI16, 2016.
[6] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, Nando de Freitas. Dueling Network Architectures for Deep Reinforcement Learning. arXiv preprint, arXiv:1511.06581, 2015.
[7] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. Proximal Policy Optimization Algorithms. arXiv preprint, arXiv:1707.06347, 2017.
[8] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel. Trust Region Policy Optimization. ICML'15, pp 1889–1897, 2015.
[9] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, Pieter Abbeel. High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR 2016.
[10] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, Martin Riedmiller. Deterministic Policy Gradient Algorithms. JMLR 2014.
[11] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra. Continuous control with deep reinforcement learning. ICLR 2016.
[12] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, Sergey Levine. Reinforcement Learning with Deep Energy-Based Policies. ICML, 2017.
[13] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, Igor Mordatch. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. NIPS 2017.
[14] Eric Jang, Shixiang Gu, Ben Poole. Categorical Reparameterization with Gumbel-Softmax. ICLR 2017.
[15] Chris J. Maddison, Andriy Mnih, Yee Whye Teh. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. ICLR 2017.