-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] APPO enhancements (new API stack) vol 01: Add circular buffer #48798
[RLlib] APPO enhancements (new API stack) vol 01: Add circular buffer #48798
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Awesome changes. Some small nit.
rllib/algorithms/appo/appo.py
Outdated
@@ -193,6 +198,14 @@ def training( | |||
on before updating the target networks and tune the kl loss | |||
coefficients. NOTE: This parameter is only applicable when using the | |||
Learner API (enable_rl_module_and_learner=True). | |||
circular_buffer_num_batches: The number of train batches that fit | |||
into the circular buffer. Each such train batch can be sampled for | |||
training max. `circular_buffer_iterations_per_batch_K` times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The circular_buffer_iterations_per_batch_K
is irritating as it is desribed nowhere. Instead we have a very similar attribute below.
"APPO! In order to train on multi-GPU, use " | ||
"`config.learners(num_learners=[number of GPUs], " | ||
"num_gpus_per_learner=1)`. To scale the throughput of batch-to-GPU-" | ||
"pre-loading on each of your `Learners`, set " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nice descriptions! This helps users! We might even make an extra section in the docs for "Tuning your applications" where we describe for different algorithms some scenarios.
@@ -28,6 +29,11 @@ class APPOLearner(IMPALALearner): | |||
|
|||
@override(IMPALALearner) | |||
def build(self): | |||
self._learner_thread_in_queue = CircularBuffer( | |||
num_batches=self.config.circular_buffer_num_batches_N, | |||
iterations_per_batch=self.config.circular_buffer_iterations_per_batch_K, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, so it does exist :)
rllib/algorithms/appo/utils.py
Outdated
The buffer holds at most N batches, which are sampled at random (uniformly). | ||
If full and a new batch is added, the oldest batch is discarded. Also, each batch | ||
currently in the buffer can be sampled at most K times (after which it is also | ||
discrded). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discrded -> discarded ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
# A valid entry (w/ a batch whose k has not been reach K yet) was dropped. | ||
if dropped_entry is not None and dropped_entry[0] is not None: | ||
dropped_ts += dropped_entry[0].env_steps() * ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this still rely on MultiAgentBatch
es instead episodes? Or has it already passed the learner connector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is post-learner connector. So everything in here is a ready-to-be-trained (multi-agent) batch.
while True: | ||
# Only initially, the buffer may be empty -> Just wait for some time. | ||
if len(self._buffer) == 0: | ||
time.sleep(0.001) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So rare to find something like this in code :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) This sleep will only ever be used at the very beginning, when the buffer is still empty, but the Learner thread already starts sampling from it.
time.sleep(0.001) | ||
continue | ||
# Sample a random buffer index. | ||
with self._lock: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the locks such that the buffer is not accessed by multiple threads while some are pushing in and others are sampling from, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, that would mess with the buffer-discard logic.
|
||
# Increase k += 1 for this batch. | ||
assert k is not None | ||
entry[1] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet. Really simple and smooth implementation!
) | ||
# Add the batch directly to the circular buffer. | ||
if isinstance(self._learner_thread_in_queue, CircularBuffer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the batch has already passed the learner connector, could there be any transformations in the connector that were made with a module that could become stale after some time of the batch in the buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be. I think that's by design, though (and v-trace is there for the defense :) ).
For example the paper talks about doing the target net forward pass on the batch already before adding it to the buffer. However, we don't do this, we perform a new target forward pass each time we compute a loss. I don't think it's crucial, though, that we do it this way. We would save a few (cheap, non-gradient) forward passes.
self.metrics.log_value( | ||
(ALL_MODULES, LEARNER_THREAD_ENV_STEPS_DROPPED), | ||
ts_dropped, | ||
reduce="sum", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So nice with the different aggregation methods. Just put it in and aggregate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's the idea: Push and forget.
…_enhancements_01_circular_buffer
…ray-project#48798) Signed-off-by: Connor Sanders <connor@elastiflow.com>
…ray-project#48798) Signed-off-by: hjiang <dentinyhao@gmail.com>
APPO enhancements (new API stack) vol 01: Add circular buffer
Adds the circular buffer described in the IMPACT paper to the (new stack) implementation.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.