-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] New ConnectorV2 API #04: Changes to Learner/LearnerGroup API to allow updating from Episodes. #41235
[RLlib] New ConnectorV2 API #04: Changes to Learner/LearnerGroup API to allow updating from Episodes. #41235
Conversation
rllib/core/learner/learner.py
Outdated
*, | ||
minibatch_size: Optional[int] = None, | ||
num_iters: int = 1, | ||
batch: Optional[MultiAgentBatch] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to discuss the alternative to provide two different (mutually exclusive?) methods that the user/algo can decide to call: update_from_batch
(for algos that do NOT require episode processing, such as DQN) or update_from_episodes
(for algos that require a view on the sampled episodes for e.g. vf-bootstrapping, vtrace, etc..).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it if the two methods are separated. I don't think there would be a case where a specific algorithm's learner would have both methods implemented. ie DQN will only implemented update_from_batch
, and PPO would only implement update_from_episodes
. This is much much cleaner than mixing both into one function. The user will have to deal with less cognitive load if they are separated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I separated them in the LearnerGroup and Learner APIs:
- update_from_batch(async=False|True)
- update_from_episodes(async=False|True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think it's nicer to have the async_update
bool option as an extra arg (instead of separate method) for better consistency and less code bloat.
rllib/core/learner/learner.py
Outdated
raise ValueError( | ||
"Batch contains module ids that are not in the learner: " | ||
f"{missing_module_ids}" | ||
if batch is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the alternative design (two update methods), we could avoid these rather ugly if-blocks then.
rllib/core/learner/learner.py
Outdated
@@ -1309,12 +1319,12 @@ def update( | |||
metrics_per_module=defaultdict(dict, **metrics_per_module), | |||
) | |||
self._check_result(result) | |||
# TODO (sven): Figure out whether `compile_metrics` should be forced | |||
# TODO (sven): Figure out whether `compile_results` should be forced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
rllib/core/learner/learner.py
Outdated
# to return all numpy/python data, then we can skip this conversion | ||
# step here. | ||
results.append(convert_to_numpy(result)) | ||
|
||
batch = self._set_slicing_by_batch_id(batch, value=False) | ||
self._set_slicing_by_batch_id(batch, value=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch
never used.
rllib/core/learner/learner.py
Outdated
@@ -1330,6 +1340,34 @@ def update( | |||
# dict. | |||
return reduce_fn(results) | |||
|
|||
@OverrideToImplementCustomLogic | |||
def _preprocess_train_data(self, *, batch, episodes) -> Tuple[Any, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this should be private or public?
rllib/core/learner/learner_group.py
Outdated
*, | ||
minibatch_size: Optional[int] = None, | ||
num_iters: int = 1, | ||
batch: Optional[MultiAgentBatch] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same discussion as above.
rllib/core/learner/learner_group.py
Outdated
] | ||
) | ||
) | ||
# TODO (sven): Implement the case in which both batch and episodes might |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have mutually exclusive methods: update_from_batch
and update_from_episodes
, this case (both batch AND episodes provided by user) would not exist anyways.
for module_id in rl_module_ckpt_dirs.keys() | ||
): | ||
raise ValueError( | ||
f"module_id {module_id} was specified in both " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: This error message will NOT have the correct missing module_id
.
…runner_support_connectors_04_learner_api_changes
@@ -1201,6 +1223,39 @@ def update( | |||
# dict. | |||
return reduce_fn(results) | |||
|
|||
@OverrideToImplementCustomLogic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is any neural network inference, does it happen here or in the connector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! The answer is: sometimes both.
For example: If you have some preprocessing needs for your training data (no matter whether episodes or batches), then you might want to do some preprocessing on this data (e.g. clip rewards, extend episodes by one artificial timestep for v-trace or GAE) and then perform a pre-forward pass through your network (e.g. to get the value estimates). For that pre-forward pass, you'll need to call your connector first to make sure this batch has all custom-required data formats (e.g. LSTM zero-padding). Only after all these preprocessing steps, you will be able to continue with the regular forward_train
+ loss + ... procedure.
rllib/core/learner/learner.py
Outdated
*, | ||
minibatch_size: Optional[int] = None, | ||
num_iters: int = 1, | ||
batch: Optional[MultiAgentBatch] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it if the two methods are separated. I don't think there would be a case where a specific algorithm's learner would have both methods implemented. ie DQN will only implemented update_from_batch
, and PPO would only implement update_from_episodes
. This is much much cleaner than mixing both into one function. The user will have to deal with less cognitive load if they are separated.
|
||
|
||
@DeveloperAPI | ||
class ShardEpisodesIterator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a unittest for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
…runner_support_connectors_04_learner_api_changes
…runner_support_connectors_04_learner_api_changes
…runner_support_connectors_04_learner_api_changes
…runner_support_connectors_04_learner_api_changes
…Group API to allow updating from Episodes. (ray-project#41235)
This PR introduces some major changes to the Learner and LearnerGroup APIs'
update()
methods.This is in preparation for the rollout of the new ConnectorV2 API that will be run from within EnvRunners (env-to-module and module-to-env connector pipelines) as well as Learners (learner connector pipelines).
Changes:
update()
methods betweenupdate_from_batch
vsupdate_from_episodes()
.update_from_episodes()
will allow the final train batch to be computed/compiled by the Learner (vs EnvRunner orAlgorithm.training_step()
), further offloading things to where they belong and reducing the mental load on the users (e.g. they should not be concerned with gathering data in the EnvRunners that is only later needed for training).Learner/LearnerGroup.update()
methods.self._postprocess_episodes()
method is called by Learner during theupdate_from_...()
and can be overridden by specific algorithms. By default, this is simply a noop. This will allow us to pave the way for the Learner ConnectorV2 pipeline to perform possible preprocessing of the train data (batch or episodes).Why are these changes needed?
Deeper purpose:
This will enable PPO - once ConnectorV2 are used by EnvRunner and Learner - to fully separate value function predictions from the sampling phase. Instead, values predictions AND bootstrap value predictions will be performed solely on the Learner side (in distributed fashion, on GPU where available). PPO RLModule's
forward_exploration/inference
methods will then no longer be required to "think about" what the Learner might need and solely focus on action computations alone. Other algos will equally benefit from this separation of concerns.As algos will be able to determine what special data they need from the sampled episodes (e.g. PG-style algos always need vf predictions), users will also be able to write custom (env-runner AND learner) connectors that transform the raw episode data into a data dict that's acceptable by their custom RLModule. For example, a user might provide an RLModule (independent of the algo used) that always requires the previous rewards. The user will then write a env-to-module as well as a learner connector that extracts this data from the episodes and builds it into the resulting action-computing- or train-batch.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.