[RLlib] New ConnectorV2 API #04: Changes to Learner/LearnerGroup API to allow updating from Episodes. #41235

sven1977 · 2023-11-17T12:00:29Z

This PR introduces some major changes to the Learner and LearnerGroup APIs' update() methods.
This is in preparation for the rollout of the new ConnectorV2 API that will be run from within EnvRunners (env-to-module and module-to-env connector pipelines) as well as Learners (learner connector pipelines).

Changes:

Disseminate the existing update() methods between update_from_batch vs update_from_episodes().
Using update_from_episodes() will allow the final train batch to be computed/compiled by the Learner (vs EnvRunner or Algorithm.training_step()), further offloading things to where they belong and reducing the mental load on the users (e.g. they should not be concerned with gathering data in the EnvRunners that is only later needed for training).
Soft-deprecate (warning) the Learner/LearnerGroup.update() methods.
An additional self._postprocess_episodes() method is called by Learner during the update_from_...() and can be overridden by specific algorithms. By default, this is simply a noop. This will allow us to pave the way for the Learner ConnectorV2 pipeline to perform possible preprocessing of the train data (batch or episodes).

Why are these changes needed?

Deeper purpose:
This will enable PPO - once ConnectorV2 are used by EnvRunner and Learner - to fully separate value function predictions from the sampling phase. Instead, values predictions AND bootstrap value predictions will be performed solely on the Learner side (in distributed fashion, on GPU where available). PPO RLModule's forward_exploration/inference methods will then no longer be required to "think about" what the Learner might need and solely focus on action computations alone. Other algos will equally benefit from this separation of concerns.

As algos will be able to determine what special data they need from the sampled episodes (e.g. PG-style algos always need vf predictions), users will also be able to write custom (env-runner AND learner) connectors that transform the raw episode data into a data dict that's acceptable by their custom RLModule. For example, a user might provide an RLModule (independent of the algo used) that always requires the previous rewards. The user will then write a env-to-module as well as a learner connector that extracts this data from the episodes and builds it into the resulting action-computing- or train-batch.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 · 2023-11-17T12:03:47Z

rllib/core/learner/learner.py

        *,
-        minibatch_size: Optional[int] = None,
-        num_iters: int = 1,
+        batch: Optional[MultiAgentBatch] = None,


Happy to discuss the alternative to provide two different (mutually exclusive?) methods that the user/algo can decide to call: update_from_batch (for algos that do NOT require episode processing, such as DQN) or update_from_episodes (for algos that require a view on the sampled episodes for e.g. vf-bootstrapping, vtrace, etc..).

I like it if the two methods are separated. I don't think there would be a case where a specific algorithm's learner would have both methods implemented. ie DQN will only implemented update_from_batch, and PPO would only implement update_from_episodes. This is much much cleaner than mixing both into one function. The user will have to deal with less cognitive load if they are separated.

I separated them in the LearnerGroup and Learner APIs:

update_from_batch(async=False|True)

update_from_episodes(async=False|True)

Also, I think it's nicer to have the async_update bool option as an extra arg (instead of separate method) for better consistency and less code bloat.

sven1977 · 2023-11-17T12:04:11Z

rllib/core/learner/learner.py

-            raise ValueError(
-                "Batch contains module ids that are not in the learner: "
-                f"{missing_module_ids}"
+        if batch is not None:


In the alternative design (two update methods), we could avoid these rather ugly if-blocks then.

sven1977 · 2023-11-17T12:04:18Z

rllib/core/learner/learner.py

@@ -1309,12 +1319,12 @@ def update(
                metrics_per_module=defaultdict(dict, **metrics_per_module),
            )
            self._check_result(result)
-            # TODO (sven): Figure out whether `compile_metrics` should be forced
+            # TODO (sven): Figure out whether `compile_results` should be forced


sven1977 · 2023-11-17T12:04:29Z

rllib/core/learner/learner.py

            #  to return all numpy/python data, then we can skip this conversion
            #  step here.
            results.append(convert_to_numpy(result))

-        batch = self._set_slicing_by_batch_id(batch, value=False)
+        self._set_slicing_by_batch_id(batch, value=False)


batch never used.

sven1977 · 2023-11-17T12:05:00Z

rllib/core/learner/learner.py

@@ -1330,6 +1340,34 @@ def update(
        # dict.
        return reduce_fn(results)

+    @OverrideToImplementCustomLogic
+    def _preprocess_train_data(self, *, batch, episodes) -> Tuple[Any, Any]:


Not sure this should be private or public?

sven1977 · 2023-11-17T12:05:19Z

rllib/core/learner/learner_group.py

        *,
-        minibatch_size: Optional[int] = None,
-        num_iters: int = 1,
+        batch: Optional[MultiAgentBatch] = None,


same discussion as above.

sven1977 · 2023-11-17T12:06:03Z

rllib/core/learner/learner_group.py

+                        ]
+                    )
+                )
+            # TODO (sven): Implement the case in which both batch and episodes might


If we have mutually exclusive methods: update_from_batch and update_from_episodes, this case (both batch AND episodes provided by user) would not exist anyways.

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 · 2023-11-20T15:18:20Z

rllib/core/learner/learner_group.py

-                    for module_id in rl_module_ckpt_dirs.keys()
-                ):
-                    raise ValueError(
-                        f"module_id {module_id} was specified in both "


Bug: This error message will NOT have the correct missing module_id.

…runner_support_connectors_04_learner_api_changes

Signed-off-by: sven1977 <svenmika1977@gmail.com>

kouroshHakha · 2024-01-01T20:53:41Z

rllib/core/learner/learner.py

@@ -1201,6 +1223,39 @@ def update(
        # dict.
        return reduce_fn(results)

+    @OverrideToImplementCustomLogic


If there is any neural network inference, does it happen here or in the connector?

Good question! The answer is: sometimes both.

For example: If you have some preprocessing needs for your training data (no matter whether episodes or batches), then you might want to do some preprocessing on this data (e.g. clip rewards, extend episodes by one artificial timestep for v-trace or GAE) and then perform a pre-forward pass through your network (e.g. to get the value estimates). For that pre-forward pass, you'll need to call your connector first to make sure this batch has all custom-required data formats (e.g. LSTM zero-padding). Only after all these preprocessing steps, you will be able to continue with the regular forward_train + loss + ... procedure.

kouroshHakha · 2024-01-01T20:55:34Z

rllib/core/learner/learner.py

        *,
-        minibatch_size: Optional[int] = None,
-        num_iters: int = 1,
+        batch: Optional[MultiAgentBatch] = None,


I like it if the two methods are separated. I don't think there would be a case where a specific algorithm's learner would have both methods implemented. ie DQN will only implemented update_from_batch, and PPO would only implement update_from_episodes. This is much much cleaner than mixing both into one function. The user will have to deal with less cognitive load if they are separated.

kouroshHakha · 2024-01-01T20:58:56Z

rllib/utils/minibatch_utils.py

+
+
+@DeveloperAPI
+class ShardEpisodesIterator:


can we have a unittest for this?

…runner_support_connectors_04_learner_api_changes

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…runner_support_connectors_04_learner_api_changes

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…runner_support_connectors_04_learner_api_changes

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…runner_support_connectors_04_learner_api_changes

…Group API to allow updating from Episodes. (ray-project#41235)

wip

e7ae52a

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 requested review from avnishn, ArturNiederfahrenhorst, smorad, maxpumperla and kouroshHakha as code owners November 17, 2023 12:00

sven1977 assigned kouroshHakha Nov 17, 2023

sven1977 commented Nov 17, 2023

View reviewed changes

wip

fe640de

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 commented Nov 20, 2023

View reviewed changes

sven1977 added 8 commits December 1, 2023 14:21

Merge branch 'master' of https://github.com/ray-project/ray into env_…

b340ddc

…runner_support_connectors_04_learner_api_changes

wip

9492b7a

Signed-off-by: sven1977 <svenmika1977@gmail.com>

merge

52d5e72

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

6437d7e

Signed-off-by: sven1977 <svenmika1977@gmail.com>

merge

242d40a

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

61be702

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

bf802fc

Signed-off-by: sven1977 <svenmika1977@gmail.com>

fixes

16f2c38

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 requested a review from a team as a code owner December 21, 2023 14:58

kouroshHakha requested changes Jan 1, 2024

View reviewed changes

sven1977 added 4 commits January 9, 2024 12:27

Merge branch 'master' of https://github.com/ray-project/ray into env_…

ad047a7

…runner_support_connectors_04_learner_api_changes

wip

083388d

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into env_…

8e02889

…runner_support_connectors_04_learner_api_changes

LINT

10b0700

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 requested review from richardliaw, ericl and edoakes as code owners January 9, 2024 15:37

sven1977 added 8 commits January 9, 2024 16:38

fix

e439fc8

Signed-off-by: sven1977 <svenmika1977@gmail.com>

fix

4633659

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

cce2c66

Signed-off-by: sven1977 <svenmika1977@gmail.com>

fix

bdb20dc

Signed-off-by: sven1977 <svenmika1977@gmail.com>

fix

bcdb92f

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into env_…

03fe431

…runner_support_connectors_04_learner_api_changes

fix

7dd8f3f

Signed-off-by: sven1977 <svenmika1977@gmail.com>

fix

b769c05

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 10, 2024

Merge branch 'master' of https://github.com/ray-project/ray into env_…

f5ffe83

…runner_support_connectors_04_learner_api_changes

kouroshHakha approved these changes Jan 10, 2024

View reviewed changes

sven1977 merged commit 806701e into ray-project:master Jan 10, 2024
9 checks passed

vickytsang pushed a commit to ROCm/ray that referenced this pull request Jan 12, 2024

[RLlib] New ConnectorV2 API ray-project#4: Changes to Learner/Learner…

6ca559e

…Group API to allow updating from Episodes. (ray-project#41235)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] New ConnectorV2 API #04: Changes to Learner/LearnerGroup API to allow updating from Episodes. #41235

[RLlib] New ConnectorV2 API #04: Changes to Learner/LearnerGroup API to allow updating from Episodes. #41235

sven1977 commented Nov 17, 2023 •

edited

Loading

sven1977 Nov 17, 2023

kouroshHakha Jan 1, 2024

sven1977 Jan 9, 2024

sven1977 Jan 9, 2024

sven1977 Nov 17, 2023

sven1977 Nov 17, 2023

sven1977 Nov 17, 2023

sven1977 Nov 17, 2023

sven1977 Nov 17, 2023

sven1977 Nov 17, 2023

sven1977 Nov 20, 2023

kouroshHakha Jan 1, 2024

sven1977 Jan 9, 2024

kouroshHakha Jan 1, 2024

kouroshHakha Jan 1, 2024

sven1977 Jan 9, 2024



		@DeveloperAPI
		class ShardEpisodesIterator:

[RLlib] New ConnectorV2 API #04: Changes to Learner/LearnerGroup API to allow updating from Episodes. #41235

[RLlib] New ConnectorV2 API #04: Changes to Learner/LearnerGroup API to allow updating from Episodes. #41235

Conversation

sven1977 commented Nov 17, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sven1977 commented Nov 17, 2023 •

edited

Loading