ray-project · sven1977 · Jun 17, 2023 · Jun 5, 2023 · Jun 5, 2023 · Jun 5, 2023
@@ -16,7 +16,7 @@ RL Modules (Alpha)
 
 .. note::
 
-    This is an experimental module that serves as a general replacement for ModelV2, and is subject to change. It will eventually match the functionality of the previous stack. If you only use high-level RLlib APIs such as :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` you should not experience siginficant changes, except for a few new parameters to the configuration object. If you've used custom models or policies before, you'll need to migrate them to the new modules. Check the Migration guide for more information.
+    This is an experimental module that serves as a general replacement for ModelV2, and is subject to change. It will eventually match the functionality of the previous stack. If you only use high-level RLlib APIs such as :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` you should not experience significant changes, except for a few new parameters to the configuration object. If you've used custom models or policies before, you'll need to migrate them to the new modules. Check the Migration guide for more information.
 
     The table below shows the list of migrated algorithms and their current supported features, which will be updated as we progress.
 
@@ -33,19 +33,19 @@ RL Modules (Alpha)
        * - **PPO**
          - |pytorch| |tensorflow|
          - |pytorch| |tensorflow|
-         - |pytorch|
+         - |pytorch| |tensorflow|
          -
          - |pytorch|
        * - **Impala**
          - |pytorch| |tensorflow|
          - |pytorch| |tensorflow|
-         - |pytorch|
+         - |pytorch| |tensorflow|
          -
          - |pytorch|
        * - **APPO**
-         - |tensorflow|
-         - |tensorflow|
-         - 
+         - |pytorch| |tensorflow|
+         - |pytorch| |tensorflow|
+         - |pytorch| |tensorflow|
          - 
          - 
 
@@ -426,7 +426,26 @@ What your customization could have looked like before:
                     return None, None, None
 
 
-All of the ``Policy.compute_***`` functions expect that `~ray.rllib.core.rl_module.rl_module.RLModule.forward_exploration` and `~ray.rllib.core.rl_module.rl_module.RLModule.forward_inference` return a dictionary that contains the key "action_dist_inputs", whose value are the parameters (inputs) of a ``ray.rllib.models.distributions.Distribution`` class. Commonly used distribution implementations can be found under ``ray.rllib.models.tf.tf_distributions`` for tensorflow and ``ray.rllib.models.torch.torch_distributions`` for torch. You can choose to return determinstic actions, by creating a determinstic distribution instance. See `Writing Custom Single Agent RL Modules`_ for more details on how to implement your own custom RL Module.
+All of the ``Policy.compute_***`` functions expect that
+`~ray.rllib.core.rl_module.rl_module.RLModule.forward_exploration` and `~ray.rllib.core.rl_module.rl_module.RLModule.forward_inference`
+return a dictionary that either contains the key "actions" and/or the key "action_dist_inputs".
+
+If you return the "actions" key:
+* RLlib will use the actions provided thereunder directly and as-is.
+* If you also returned the "action_dist_inputs" key: RLlib will also create a ``ray.rllib.models.distributions.Distribution`` object from the distribution parameters under that key and - in the case of ``forward_exploration()`` - compute action probs and logp values from the given actions automatically.
+
+If you do not return the "actions" key:
+* You must return the "action_dist_inputs" key instead from your ``forward_inference()`` and ``forward_exploration()`` methods.
+* RLlib will create a ``ray.rllib.models.distributions.Distribution`` object from the distribution parameters under that key and sample actions from the thus generated distribution.
+* In the case of ``forward_exploration()``, RLlib will also compute action probs and logp values from the sampled actions automatically.
+
+Note that in the case of ``forward_inference()``, the generated distributions (from returned key "action_dist_inputs") will always be made deterministic via
+the ``ray.rllib.models.distributions.Distribution.to_deterministic`` utility before a possible action sample step.
+Thus, for example, sampling from a Categorical distribution will be reduced to simply selecting the argmax actions from the distribution's logits/probs.
+
+Commonly used distribution implementations can be found under ``ray.rllib.models.tf.tf_distributions`` for tensorflow and
+``ray.rllib.models.torch.torch_distributions`` for torch. You can choose to return determinstic actions, by creating a determinstic distribution instance.
+See `Writing Custom Single Agent RL Modules`_ for more details on how to implement your own custom RL Module.
 
 .. tab-set::
 
@@ -454,6 +473,63 @@ All of the ``Policy.compute_***`` functions expect that `~ray.rllib.core.rl_modu
                     ...
 
 
+    .. tab-item:: Returning "actions"
+
+        .. code-block:: python
+
+            """
+            An RLModule whose forward_exploration/inference methods return the
+            "actions" key.
+            """
+
+            class MyRLModule(TorchRLModule):
+                ...
+
+                def _forward_inference(self, batch):
+                    ...
+                    return {
+                        "actions": ...  # actions will be used as-is
+                        # "action_dist_inputs": ...  # this is optional
+                    }
+
+                def _forward_exploration(self, batch):
+                    ...
+                    return {
+                        "actions": ...  # actions will be used as-is (no sampling step!)
+                        "action_dist_inputs": ...  # optional: If provided, will be used to compute action probs and logp.
+                    }
+
+    .. tab-item:: Not returning "actions"
+
+        .. code-block:: python
+
+            """
+            An RLModule whose forward_exploration/inference methods do NOT return the
+            "actions" key.
+            """
+
+            class MyRLModule(TorchRLModule):
+                ...
+
+                def _forward_inference(self, batch):
+                    ...
+                    return {
+                        # - Generate distribution from these parameters.
+                        # - Convert distribution to a deterministic equivalent.
+                        # - "sample" from the deterministic distribution.
+                        "action_dist_inputs": ...
+                    }
+
+                def _forward_exploration(self, batch):
+                    ...
+                    return {
+                        # - Generate distribution from these parameters.
+                        # - "sample" from the (stochastic) distribution.
+                        # - Compute action probs/logs automatically using the sampled
+                        #   actions and the generated distribution object.
+                        "action_dist_inputs": ...
+                    }
+
 Notable TODOs
 -------------
 

@@ -854,22 +854,36 @@ def _compute_actions_helper_rl_module_explore(
         input_dict[STATE_IN] = None
         input_dict[SampleBatch.SEQ_LENS] = None
 
-        action_dist_class = self.model.get_exploration_action_dist_cls()
         fwd_out = self.model.forward_exploration(input_dict)
-        action_dist = action_dist_class.from_logits(
-            fwd_out[SampleBatch.ACTION_DIST_INPUTS]
-        )
-        actions = action_dist.sample()
+
+        # ACTION_DIST_INPUTS field returned by `forward_exploration()` ->
+        # Create a distribution object.
+        action_dist = None
+        if SampleBatch.ACTION_DIST_INPUTS in fwd_out:
+            action_dist_class = self.model.get_exploration_action_dist_cls()
+            action_dist = action_dist_class.from_logits(
+                fwd_out[SampleBatch.ACTION_DIST_INPUTS]
+            )
+
+        # If `forward_exploration()` returned actions, use them here as-is.
+        if SampleBatch.ACTIONS in fwd_out:
+            actions = fwd_out[SampleBatch.ACTIONS]
+        # Otherwise, sample actions from the distribution.
+        else:
+            assert action_dist
+            actions = action_dist.sample()
 
         # Anything but action_dist and state_out is an extra fetch
         for k, v in fwd_out.items():
             if k not in [SampleBatch.ACTIONS, "state_out"]:
                 extra_fetches[k] = v
 
-        # Action-logp and action-prob.
-        logp = action_dist.logp(actions)
-        extra_fetches[SampleBatch.ACTION_LOGP] = logp
-        extra_fetches[SampleBatch.ACTION_PROB] = tf.exp(logp)
+        # Compute action-logp and action-prob from distribution and add to
+        # `extra_fetches`, if possible.
+        if action_dist is not None:
+            logp = action_dist.logp(actions)
+            extra_fetches[SampleBatch.ACTION_LOGP] = logp
+            extra_fetches[SampleBatch.ACTION_PROB] = tf.exp(logp)
 
         return actions, {}, extra_fetches
 
@@ -895,13 +909,25 @@ def _compute_actions_helper_rl_module_inference(
         input_dict[STATE_IN] = None
         input_dict[SampleBatch.SEQ_LENS] = None
 
-        action_dist_class = self.model.get_inference_action_dist_cls()
         fwd_out = self.model.forward_inference(input_dict)
-        action_dist = action_dist_class.from_logits(
-            fwd_out[SampleBatch.ACTION_DIST_INPUTS]
-        )
-        action_dist = action_dist.to_deterministic()
-        actions = action_dist.sample()
+
+        # ACTION_DIST_INPUTS field returned by `forward_exploration()` ->
+        # Create a (deterministic) distribution object.
+        action_dist = None
+        if SampleBatch.ACTION_DIST_INPUTS in fwd_out:
+            action_dist_class = self.model.get_inference_action_dist_cls()
+            action_dist = action_dist_class.from_logits(
+                fwd_out[SampleBatch.ACTION_DIST_INPUTS]
+            )
+            action_dist = action_dist.to_deterministic()
+
+        # If `forward_inference()` returned actions, use them here as-is.
+        if SampleBatch.ACTIONS in fwd_out:
+            actions = fwd_out[SampleBatch.ACTIONS]
+        # Otherwise, sample actions from the distribution.
+        else:
+            assert action_dist
+            actions = action_dist.sample()
 
         # Anything but action_dist and state_out is an extra fetch
         for k, v in fwd_out.items():

@@ -1127,30 +1127,57 @@ def _compute_action_helper(
         if self.model:
             self.model.eval()
 
-        extra_fetches = None
+        extra_fetches = dist_inputs = logp = None
+
+        # New API stack: `self.model` is-a RLModule.
         if isinstance(self.model, RLModule):
             if explore:
-                action_dist_class = self.model.get_exploration_action_dist_cls()
                 fwd_out = self.model.forward_exploration(input_dict)
-                action_dist = action_dist_class.from_logits(
-                    fwd_out[SampleBatch.ACTION_DIST_INPUTS]
-                )
-                actions = action_dist.sample()
-                logp = action_dist.logp(actions)
+
+                # ACTION_DIST_INPUTS field returned by `forward_exploration()` ->
+                # Create a distribution object.
+                action_dist = None
+                if SampleBatch.ACTION_DIST_INPUTS in fwd_out:
+                    dist_inputs = fwd_out[SampleBatch.ACTION_DIST_INPUTS]
+                    action_dist_class = self.model.get_exploration_action_dist_cls()
+                    action_dist = action_dist_class.from_logits(dist_inputs)
+
+                # If `forward_exploration()` returned actions, use them here as-is.
+                if SampleBatch.ACTIONS in fwd_out:
+                    actions = fwd_out[SampleBatch.ACTIONS]
+                # Otherwise, sample actions from the distribution.
+                else:
+                    assert action_dist
+                    actions = action_dist.sample()
+
+                # Compute action-logp and action-prob from distribution and add to
+                # `extra_fetches`, if possible.
+                if action_dist is not None:
+                    logp = action_dist.logp(actions)
             else:
-                action_dist_class = self.model.get_inference_action_dist_cls()
                 fwd_out = self.model.forward_inference(input_dict)
-                action_dist = action_dist_class.from_logits(
-                    fwd_out[SampleBatch.ACTION_DIST_INPUTS]
-                )
-                action_dist = action_dist.to_deterministic()
-                actions = action_dist.sample()
-                logp = None
+
+                # ACTION_DIST_INPUTS field returned by `forward_exploration()` ->
+                # Create a distribution object.
+                action_dist = None
+                if SampleBatch.ACTION_DIST_INPUTS in fwd_out:
+                    dist_inputs = fwd_out[SampleBatch.ACTION_DIST_INPUTS]
+                    action_dist_class = self.model.get_inference_action_dist_cls()
+                    action_dist = action_dist_class.from_logits(dist_inputs)
+                    action_dist = action_dist.to_deterministic()
+
+                # If `forward_inference()` returned actions, use them here as-is.
+                if SampleBatch.ACTIONS in fwd_out:
+                    actions = fwd_out[SampleBatch.ACTIONS]
+                # Otherwise, sample actions from the distribution.
+                else:
+                    assert action_dist
+                    actions = action_dist.sample()
 
             # Anything but actions and state_out is an extra fetch.
             state_out = fwd_out.pop(STATE_OUT, {})
             extra_fetches = fwd_out
-            dist_inputs = fwd_out[SampleBatch.ACTION_DIST_INPUTS]
+
         elif is_overridden(self.action_sampler_fn):
             action_dist = None
             actions, logp, dist_inputs, state_out = self.action_sampler_fn(