Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about steps related to the reward #11

Open
congling opened this issue Sep 3, 2016 · 6 comments
Open

about steps related to the reward #11

congling opened this issue Sep 3, 2016 · 6 comments

Comments

@congling
Copy link

congling commented Sep 3, 2016

Hi Kosuke,
I've tried your model on breakout game. The performance was amazing, the average score went up to 520 after 80M steps. It's far more better than any other model I've tried.
But the average score didn't went up much after 80M. Sometimes game costs plenty of steps. When the bricks left 1 or 2 pieces, the ball went between your board and the wall, but never hits the remain bricks.
Do you think if it's better to add steps of games as the penalty to calculate R? such as
R-=beta*sqrt(steps)
Thanks

BTW, I've do some changes
1 changed the ACTION_SIZE= 4, because breakout have 4 actions in ALE.
2 If lives is lost, treat it as terminal
#if not terminal_end:
if lives==new_lives and not terminal_end:
R = self.local_network.run_value(sess, self.game_state.s_t)
else:
#print("lives cost from %d to %d"%(lives,new_lives))
lives = new_lives

@miyosuda
Copy link
Owner

miyosuda commented Sep 3, 2016

But the average score didn't went up much after 80M.

With default "MAX_TIME_STEP" constant in constants.py, MAX_TIME_STEP is 100M.
Learning rate is annealed to 0.0 along 100M steps, so learning rate around 80M will be small.
One way to try is using bigger MAX_TIME_STEP like 150M or 200M, but this might not take effect.

BTW, @Itsukara has also reported same result that score stops around 80M.
http://itsukara.hateblo.jp/entry/2016/08/02/190029

Maybe he is using A3C-FF mode?

He also tried live-loss-as-terminal, and the learning speed was faster.
(Reaching score 400 around 18M steps?)
https://cdn-ak.f.st-hatena.com/images/fotolife/I/Itsukara/20160824/20160824034536.png
http://itsukara.hateblo.jp/entry/2016/08/11/003715

Do you think if it's better to add steps of games as the penalty to calculate R?

Ah I see. It might take effect.

There is another A3C implementation which reports higher breakout score.

https://github.com/ppwwyyxx/tensorpack/tree/master/examples/Atari2600

Their A3C learning code seems not yet released, but please check once their code is released.

@congling
Copy link
Author

congling commented Sep 4, 2016

Thanks for your reply, @Itsukara improvement looks quite good. I'm trying it now.

@miyosuda
Copy link
Owner

miyosuda commented Sep 4, 2016

@congling
Sorry my mistake, his graph
https://cdn-ak.f.st-hatena.com/images/fotolife/I/Itsukara/20160824/20160824034536.png
was graph of "live loss as -1 reward."

And he said that he was using A3C-FF when recording this graph.
thanks @Itsukara !

@sahiliitm
Copy link

Btw the code which @congling has used is not equivalent to resetting game on life lost. :p This is because all it says is that if you just lost a life, then it does not matter what the value estimate of next state is, treat it as 0.
A more correct approach to resetting simulation on lost lives is :
if terminal or (RESET_ON_LOST_LIFE and old_lives != new_lives): terminal_end = True
on this line.

@1601214542
Copy link

@congling conglinghello, i am a little confused where to add the code with lives in the released code. I have problem trained with Breakout and i only reached 50 when steps are 18m. Can you give me your implemention code ? thank you.

@xiaoschannel
Copy link

@1601214542 hey i have this problem too! hmm. is 50 train or test score(in which you get 5 lives to spare)? how often is it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants