-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature-request] N-step returns for TD methods #47
Comments
@partiallytyped I thought about that one, and we just need to change the sampling not the storage, no? (as a first approximation) What I mean: at sampling time, we could re-create the trajectory (until a done is found or the buffer ends) by simply going through the indexes. |
This approach sounds better than what I initially came up with, seems to have fewer moving parts and will be easier to reason about. I will get on it once V1.0 is released. |
How would you like this to be implemented? As a wrapper around the buffer, as a derived class from the buffer, or as it's own object that adheres to the buffer API? |
A class that derives from the replay buffer class seems the natural option I would say. |
As an update, I have an experimental version of SAC + Peng Q-Lambda in the contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/tree/feat/peng-q-lambda Original repo by @robintyh1: https://github.com/robintyh1/icml2021-pengqlambda |
Add callback support
Originally posted by @partiallytyped in hill-a/stable-baselines#821
"
N-step returns allow for much better stability, and improve performance when training DQN, DDPG etc, so it will be quite useful to have this feature.
A simple implementation of this would be as a wrapper around ReplayBuffer so it would work with both Prioritized and Uniform sampling. The wrapper keeps a queue of observed experiences compute the returns and add the experience to the buffer.
"
Roadmap: v1.1+ (see #1 )
The text was updated successfully, but these errors were encountered: