Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature-request] N-step returns for TD methods #47

Open
araffin opened this issue Jun 8, 2020 · 5 comments
Open

[feature-request] N-step returns for TD methods #47

araffin opened this issue Jun 8, 2020 · 5 comments
Labels
enhancement New feature or request

Comments

@araffin
Copy link
Member

araffin commented Jun 8, 2020

Originally posted by @partiallytyped in hill-a/stable-baselines#821
"
N-step returns allow for much better stability, and improve performance when training DQN, DDPG etc, so it will be quite useful to have this feature.

A simple implementation of this would be as a wrapper around ReplayBuffer so it would work with both Prioritized and Uniform sampling. The wrapper keeps a queue of observed experiences compute the returns and add the experience to the buffer.
"

Roadmap: v1.1+ (see #1 )

@araffin araffin added the enhancement New feature or request label Jun 8, 2020
@araffin
Copy link
Member Author

araffin commented Jun 8, 2020

@partiallytyped I thought about that one, and we just need to change the sampling not the storage, no? (as a first approximation)

What I mean: at sampling time, we could re-create the trajectory (until a done is found or the buffer ends) by simply going through the indexes.

@m-rph
Copy link
Contributor

m-rph commented Jun 9, 2020

This approach sounds better than what I initially came up with, seems to have fewer moving parts and will be easier to reason about. I will get on it once V1.0 is released.

@araffin araffin added this to the v1.1 milestone Jun 9, 2020
@m-rph
Copy link
Contributor

m-rph commented Jun 19, 2020

How would you like this to be implemented? As a wrapper around the buffer, as a derived class from the buffer, or as it's own object that adheres to the buffer API?

@araffin
Copy link
Member Author

araffin commented Jun 19, 2020

A class that derives from the replay buffer class seems the natural option I would say.

@araffin
Copy link
Member Author

araffin commented Jul 23, 2021

As an update, I have an experimental version of SAC + Peng Q-Lambda in the contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/tree/feat/peng-q-lambda
I'm using an adapted version of the HER replay buffer (storing things by episodes) which can probably easily be updated for an n-step buffer (in fact, lambda=1 is the n-step version).
I also had to hack a bit SAC in order to have access to actor and target q-value.

Original repo by @robintyh1: https://github.com/robintyh1/icml2021-pengqlambda

@araffin araffin removed this from the v1.1 milestone Nov 4, 2021
Shunian-Chen pushed a commit to Shunian-Chen/AIPI530 that referenced this issue Nov 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants