Lunar Lander with Reinforcement Learning

Soft-Actor Critic (SAC)

Hardware: Google Colab T4

Model Type	Discrete	Average Reward	Training Time	Total Training Steps
PPO	No	266.01	1:35:29	501,747
PPO	Yes	223.38	2:07:30	501,721
SAC	No	278.36	1:21:13	299,998
DQN	Yes	155.64	1:59:15	999,999

Set ent_coef for PPO as it encourages exploration of other actions. Stable Baselines3 defaults the value to 0.0. More Information
Do not set your eval_freq too low, as it can sometimes cause instability during learning due to being interrupted by evaluation. (e.g. >=10,000)
Stable Baseline3's DQN parameters exploration_initial_eps and exploration_final_eps help determine how exploratory your model is at the beginning and end of training.