PG_LunarLander_v2

Simple policy gradient sovling LunarLander_v2 without baseline, wtih temporal structure.
However, collected batch of trajectories are normalized according to the mean and standard devation.
This helps the training, performing like using mean baseline.
The problem was solved at 11840 episode.

What I observed here is that...

simple policy gradient with small learning rate is good enough in the environment
although it takes a lot of episodes to reach a high score
learning speed is expected to be increased by using actor-critic architecture

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
check		check
.gitignore		.gitignore
PG_LunarLander.py		PG_LunarLander.py
README.md		README.md
reward_plot.png		reward_plot.png
reward_plot2.png		reward_plot2.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PG_LunarLander_v2

Reward plot

About

Releases

Packages

Languages

SHINDONGMYUNG/PG_LunarLander_v2

Folders and files

Latest commit

History

Repository files navigation

PG_LunarLander_v2

Reward plot

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages