-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] RLModule API change: If "actions" key returned from forward_inference|exploration, use actions as-is. #36067
Changes from 2 commits
7210ef8
20cc017
1d877c7
d2f775a
2ec74b8
8c15607
aa377d8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,7 +16,7 @@ RL Modules (Alpha) | |
|
||
.. note:: | ||
|
||
This is an experimental module that serves as a general replacement for ModelV2, and is subject to change. It will eventually match the functionality of the previous stack. If you only use high-level RLlib APIs such as :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` you should not experience siginficant changes, except for a few new parameters to the configuration object. If you've used custom models or policies before, you'll need to migrate them to the new modules. Check the Migration guide for more information. | ||
This is an experimental module that serves as a general replacement for ModelV2, and is subject to change. It will eventually match the functionality of the previous stack. If you only use high-level RLlib APIs such as :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` you should not experience significant changes, except for a few new parameters to the configuration object. If you've used custom models or policies before, you'll need to migrate them to the new modules. Check the Migration guide for more information. | ||
|
||
The table below shows the list of migrated algorithms and their current supported features, which will be updated as we progress. | ||
|
||
|
@@ -33,19 +33,19 @@ RL Modules (Alpha) | |
* - **PPO** | ||
- |pytorch| |tensorflow| | ||
- |pytorch| |tensorflow| | ||
- |pytorch| | ||
- |pytorch| |tensorflow| | ||
- | ||
- |pytorch| | ||
* - **Impala** | ||
- |pytorch| |tensorflow| | ||
- |pytorch| |tensorflow| | ||
- |pytorch| | ||
- |pytorch| |tensorflow| | ||
- | ||
- |pytorch| | ||
* - **APPO** | ||
- |tensorflow| | ||
- |tensorflow| | ||
- | ||
- |pytorch| |tensorflow| | ||
- |pytorch| |tensorflow| | ||
- |pytorch| |tensorflow| | ||
- | ||
- | ||
|
||
|
@@ -426,7 +426,26 @@ What your customization could have looked like before: | |
return None, None, None | ||
|
||
|
||
All of the ``Policy.compute_***`` functions expect that `~ray.rllib.core.rl_module.rl_module.RLModule.forward_exploration` and `~ray.rllib.core.rl_module.rl_module.RLModule.forward_inference` return a dictionary that contains the key "action_dist_inputs", whose value are the parameters (inputs) of a ``ray.rllib.models.distributions.Distribution`` class. Commonly used distribution implementations can be found under ``ray.rllib.models.tf.tf_distributions`` for tensorflow and ``ray.rllib.models.torch.torch_distributions`` for torch. You can choose to return determinstic actions, by creating a determinstic distribution instance. See `Writing Custom Single Agent RL Modules`_ for more details on how to implement your own custom RL Module. | ||
All of the ``Policy.compute_***`` functions expect that | ||
`~ray.rllib.core.rl_module.rl_module.RLModule.forward_exploration` and `~ray.rllib.core.rl_module.rl_module.RLModule.forward_inference` | ||
return a dictionary that either contains the key "actions" and/or the key "action_dist_inputs". | ||
|
||
If you return the "actions" key: | ||
* RLlib will use the actions provided thereunder directly and as-is. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The formatting is broken (these are not rendered as bullet points) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. |
||
* If you also returned the "action_dist_inputs" key: RLlib will also create a ``ray.rllib.models.distributions.Distribution`` object from the distribution parameters under that key and - in the case of ``forward_exploration()`` - compute action probs and logp values from the given actions automatically. | ||
|
||
If you do not return the "actions" key: | ||
* You must return the "action_dist_inputs" key instead from your ``forward_inference()`` and ``forward_exploration()`` methods. | ||
* RLlib will create a ``ray.rllib.models.distributions.Distribution`` object from the distribution parameters under that key and sample actions from the thus generated distribution. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All of these classes we mention here have to be linked to their class definition API reference. You can use something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed all of these via |
||
* In the case of ``forward_exploration()``, RLlib will also compute action probs and logp values from the sampled actions automatically. | ||
|
||
Note that in the case of ``forward_inference()``, the generated distributions (from returned key "action_dist_inputs") will always be made deterministic via | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's make this a Note box via There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
the ``ray.rllib.models.distributions.Distribution.to_deterministic`` utility before a possible action sample step. | ||
Thus, for example, sampling from a Categorical distribution will be reduced to simply selecting the argmax actions from the distribution's logits/probs. | ||
|
||
Commonly used distribution implementations can be found under ``ray.rllib.models.tf.tf_distributions`` for tensorflow and | ||
``ray.rllib.models.torch.torch_distributions`` for torch. You can choose to return determinstic actions, by creating a determinstic distribution instance. | ||
See `Writing Custom Single Agent RL Modules`_ for more details on how to implement your own custom RL Module. | ||
|
||
.. tab-set:: | ||
|
||
|
@@ -454,6 +473,63 @@ All of the ``Policy.compute_***`` functions expect that `~ray.rllib.core.rl_modu | |
... | ||
|
||
|
||
.. tab-item:: Returning "actions" | ||
|
||
.. code-block:: python | ||
|
||
""" | ||
An RLModule whose forward_exploration/inference methods return the | ||
"actions" key. | ||
""" | ||
|
||
class MyRLModule(TorchRLModule): | ||
... | ||
|
||
def _forward_inference(self, batch): | ||
... | ||
return { | ||
"actions": ... # actions will be used as-is | ||
# "action_dist_inputs": ... # this is optional | ||
} | ||
|
||
def _forward_exploration(self, batch): | ||
... | ||
return { | ||
"actions": ... # actions will be used as-is (no sampling step!) | ||
"action_dist_inputs": ... # optional: If provided, will be used to compute action probs and logp. | ||
} | ||
|
||
.. tab-item:: Not returning "actions" | ||
|
||
.. code-block:: python | ||
|
||
""" | ||
An RLModule whose forward_exploration/inference methods do NOT return the | ||
"actions" key. | ||
""" | ||
|
||
class MyRLModule(TorchRLModule): | ||
... | ||
|
||
def _forward_inference(self, batch): | ||
... | ||
return { | ||
# - Generate distribution from these parameters. | ||
# - Convert distribution to a deterministic equivalent. | ||
# - "sample" from the deterministic distribution. | ||
"action_dist_inputs": ... | ||
} | ||
|
||
def _forward_exploration(self, batch): | ||
... | ||
return { | ||
# - Generate distribution from these parameters. | ||
# - "sample" from the (stochastic) distribution. | ||
# - Compute action probs/logs automatically using the sampled | ||
# actions and the generated distribution object. | ||
"action_dist_inputs": ... | ||
} | ||
|
||
Notable TODOs | ||
------------- | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This added documentation is not specific to those who are migrating from policy API. We should put it under the right section. I think adding it somewhere close to Writing Custom Single Agent RL Modules would be the way to go (it requires getting rid of those policy specific numenclature)
Maybe we can consolidate your paragraph with the description of what needs to be implemented for each forward method shown here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I moved this up into the suggested section and created a new table to explain the difference between returning the "actions" key and NOT returning the "actions" key.