Skip to content

Latest commit

 

History

History
32 lines (22 loc) · 1.61 KB

README.md

File metadata and controls

32 lines (22 loc) · 1.61 KB

Lunar Lander with Reinforcement Learning

Soft-Actor Critic (SAC)

Deep Q Learning (DQN)

Proximal Policy Optimization (PPO)

Results

Hardware: Google Colab T4

Model Type Discrete Average Reward Training Time Total Training Steps
PPO No 266.01 1:35:29 501,747
PPO Yes 223.38 2:07:30 501,721
SAC No 278.36 1:21:13 299,998
DQN Yes 155.64 1:59:15 999,999

Training Notes

  • Set ent_coef for PPO as it encourages exploration of other actions. Stable Baselines3 defaults the value to 0.0. More Information
  • Do not set your eval_freq too low, as it can sometimes cause instability during learning due to being interrupted by evaluation. (e.g. >=10,000)
  • Stable Baseline3's DQN parameters exploration_initial_eps and exploration_final_eps help determine how exploratory your model is at the beginning and end of training.

Finding Theta Blog Posts: