Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

PPO on MuJoCo benchmarks

This example trains a PPO agent (Proximal Policy Optimization Algorithms) on MuJoCo benchmarks from OpenAI Gym.

We follow the training and evaluation settings of Deep Reinforcement Learning that Matters, which provides thorough, highly tuned benchmark results.

Requirements

  • MuJoCo Pro 1.5
  • mujoco_py>=1.50, <2.1

Running the Example

To run the training example:

python train_ppo.py [options]

We have already pretrained models from this script for all the domains list in the results section. To load a pretrained model:

python train_ppo.py --demo --load-pretrained --env HalfCheetah-v2 --gpu -1

Useful Options

  • --gpu. Specifies the GPU. If you do not have a GPU on your machine, run the example with the option --gpu -1. E.g. python train_ppo.py --gpu -1.
  • --env. Specifies the environment. E.g. python train_ppo.py --env HalfCheetah-v2.
  • --render. Add this option to render the states in a GUI window.
  • --seed. This option specifies the random seed used.
  • --outdir This option specifies the output directory to which the results are written.
  • --demo. Runs an evaluation, instead of training the agent.
  • --load-pretrained Loads the pretrained model. Both --load and --load-pretrained cannot be used together.

To view the full list of options, either view the code or run the example with the --help option.

Known differences

Results

These scores are evaluated by average return +/- standard error of 100 evaluation episodes after 2M training steps.

Reported scores are taken from Table 1 of Deep Reinforcement Learning that Matters.

ChainerRL scores are based on 20 trials using different random seeds, using the following command.

python train_ppo.py --gpu -1 --seed [0-19] --env [env]
Environment ChainerRL Score Reported Score
HalfCheetah 2404+/-185 2201+/-323
Hopper 2719+/-67 2790+/-62
Walker2d 2994+/-113 N/A
Swimmer 111+/-4 N/A

Training times

These training times were obtained by running train_ppo.py on a single CPU and no GPU.

Game ChainerRL Time
HalfCheetah 2.054 hours
Hopper 2.057 hours
Swimmer 2.051 hours
Walker2d 2.065 hours
Statistic
Mean time (in hours) across all domains 2.057
Fastest Domain Swimmer 2.051
Slowest Domain Walker2d 2.065

Learning Curves

The shaded region represents a standard deviation of the average evaluation over 20 trials.

HalfCheetah-v2 Hopper-v2 Walker2d-v2 Swimmer-v2