[RLlib] Compile update logic on learner and use cudagraphs #35759

ArturNiederfahrenhorst · 2023-05-25T02:41:43Z

Why are these changes needed?

In the first attempt to leverage torch compile, we didn't introduce a compiled update method on the side of the learner (1) and also had little success with torch compiling on the rollout worker side because weight updates would effectively not happen when we would compile (2).

For (1): This PR makes an attempt at compiling on the learner side akin to what we do for eager tracing, meaning that there is a possibly_compiled_update() method on the TorchLearner side that we introduce.

For (2): We get around the issue of not being able to set weights by using cudagraphs as the torch dynamo backend.

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

ArturNiederfahrenhorst · 2023-05-26T21:14:29Z

Related tensorboard that shows speedups on rollout worker side:

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

ArturNiederfahrenhorst · 2023-06-09T18:32:15Z

rllib/core/rl_module/torch/torch_rl_module.py

@@ -123,8 +107,6 @@ def get_state(self) -> Mapping[str, Any]:
    @override(RLModule)
    def set_state(self, state_dict: Mapping[str, Any]) -> None:
        self.load_state_dict(state_dict)
-        if self._retrace_on_set_weights:
-            torch._dynamo.reset()


We don't need this with cudagraphs.

ArturNiederfahrenhorst · 2023-06-09T18:32:56Z

rllib/core/rl_module/torch/torch_rl_module.py

-            compile_config.compile_forward_train
-            or compile_config.compile_forward_inference
-            or compile_config.compile_forward_exploration
-        )


We can just compile all the forward methods. Ones that are not called, will not be traced anyway.

ArturNiederfahrenhorst · 2023-06-09T18:33:13Z

rllib/core/rl_module/torch/torch_compile_config.py

-    torch_dynamo_backend: str = "aot_eager" if sys.platform == "darwin" else "inductor"
-    torch_dynamo_mode: str = "reduce-overhead"
+    torch_dynamo_backend: str = (
+        "aot_eager" if sys.platform == "darwin" else "cudagraphs"


Makes it so that weight updates actually take effect.

ArturNiederfahrenhorst · 2023-06-09T18:34:54Z

rllib/algorithms/algorithm_config.py

        )
-        self.torch_compile_worker_dynamo_mode = "reduce-overhead"
+        self.torch_compile_worker_dynamo_mode = None


None will make it so that for any chosen backend, we use the default mode.
codagraphs does not have a "reduce-overhead" mode so we need to choose None here.

ArturNiederfahrenhorst · 2023-06-09T18:36:12Z

rllib/core/learner/learner.py

+            settings.
+            "complete_update" promises the highest performance gains, but may work
+            in some settings. By compiling only forward_train, you may already get
+            some speedups and avoid issues that arise from compiling the entire update.


In some cases, there are slight performance differences when compiling forward train vs the complete update.
Until we have explored this and know if we can eliminate one or the other possibility, we can use this switch to choose.

ArturNiederfahrenhorst · 2023-06-09T18:37:17Z

rllib/core/learner/torch/torch_learner.py

+                backend=torch_compile_cfg.torch_dynamo_backend,
+                mode=torch_compile_cfg.torch_dynamo_mode,
+                **torch_compile_cfg.kwargs,
+            )


When compiling the update, we need to reset and recompile the whole thing every time we add/remove a module.

add this comment to the code plz.

kouroshHakha · 2023-06-14T17:12:29Z

rllib/core/learner/learner.py

    torch_compile_cfg: Optional["TorchCompileConfig"] = None

+    def validate(self):


You need to expose these parameters to the top algorithm_config. Right now what_to_compile is not surfacing up in algorithm config.

kouroshHakha · 2023-06-14T17:23:58Z

rllib/core/learner/torch/tests/test_torch_learner_compile.py

+def _get_learner(learning_rate: float = 1e-3) -> Learner:
+    env = gym.make("CartPole-v1")
+    # adding learning rate as a configurable parameter to avoid hardcoding it
+    # and information leakage across tests that rely on knowing the LR value
+    # that is used in the learner.
+    learner = get_learner("torch", env, learning_rate=learning_rate)
+    learner.build()
+
+    return learner


modify get_learner in rllib.core.testing.utils and use it here?

kouroshHakha · 2023-06-14T17:25:36Z

rllib/core/learner/torch/tests/test_torch_learner_compile.py

+            spec = get_module_spec(
+                framework="torch", env=env, is_multi_agent=is_multi_agent
+            )
+            learner = BCTorchLearner(


why are you not using get_learner()?

kouroshHakha · 2023-06-14T17:28:26Z

rllib/core/learner/torch/tests/test_torch_learner_compile.py

+                batch = MultiAgentBatch(
+                    {"another_module": reader.next(), "default_policy": reader.next()},
+                    0,
+                )


do you really want to call reader.next() twice per iteration? is this intentional? you can obtain the batch once and use it in two places.

kouroshHakha · 2023-06-14T17:29:27Z

rllib/core/learner/torch/tests/test_torch_learner_compile.py

+        learner = BCTorchLearner(
+            module_spec=spec,
+            framework_hyperparameters=framework_hps,
+        )


use get_learner

kouroshHakha · 2023-06-14T18:14:06Z

rllib/core/learner/torch/torch_learner.py

+                        self._framework_hyperparameters.torch_compile_cfg
+                    )
+                else:
+                    assert isinstance(self._module, MultiAgentRLModule)


please add a descriptive error upon failure. e.g. expected type blah got type blah

kouroshHakha · 2023-06-14T18:14:52Z

rllib/core/learner/torch/torch_learner.py

+                else:
+                    assert isinstance(self._module, MultiAgentRLModule)
+                    for module in self._module._rl_modules.values():
+                        module.compile(


you need to skip the module if it's not a TorchRLModule (e.g. it could be a RandomRLModule, neither torch nor TF)

in other words, only compile those that are TorchRLModule

kouroshHakha · 2023-06-14T18:16:56Z

rllib/core/learner/torch/torch_learner.py

+                backend=torch_compile_cfg.torch_dynamo_backend,
+                mode=torch_compile_cfg.torch_dynamo_mode,
+                **torch_compile_cfg.kwargs,
+            )


add this comment to the code plz.

kouroshHakha · 2023-06-14T18:30:38Z

rllib/core/models/tests/test_base_models.py

make _dynamo_is_available() a util under torch_utils.py?

kouroshHakha · 2023-06-14T19:12:55Z

rllib/core/models/tests/test_base_models.py

@@ -272,7 +272,9 @@ def compile_me(input_dict):

        import torch._dynamo as dynamo

+        # This is a helper method of dynamo to analyze where breaks occur.
        dynamo_explanation = dynamo.explain(compile_me, {"in": torch.Tensor([[1]])})


same comments above apply here :)

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

ArturNiederfahrenhorst · 2023-06-19T16:58:38Z

@kouroshHakha I've also added a configuration enumerator instead of relying on two long strings "complete_update" and "forward_train".

kouroshHakha

Thanks, just a super quick nit for Enums

https://realpython.com/python-enum/#getting-to-know-enumerations-in-python

kouroshHakha · 2023-06-20T16:08:41Z

rllib/algorithms/algorithm_config.py

@@ -282,6 +283,9 @@ def __init__(self, algo_class=None):
        }
        # Torch compile settings
        self.torch_compile_learner = False
+        self.torch_compile_learner_what_to_compile = (
+            TorchCompileWhatToCompile.forward_train


Enums should usually be all CAPS. e.g. FORWARD_TRAIN

kouroshHakha · 2023-06-20T16:09:54Z

rllib/algorithms/algorithm_config.py

+            torch_compile_learner_what_to_compile: A string specifying what to
+                compile on the learner side if torch_compile_learner is True.
+                This can be one of the following:
+                - TorchCompileWhatToCompile.complete_update: Compile the


Just say see TorchCompileWhatToCompile for available options. This way you don't have to duplicate docstring if things change later down the line.

kouroshHakha · 2023-06-20T16:10:51Z

rllib/algorithms/algorithm_config.py

+                    forward_train method, the loss calculation and the optimizer step
+                    together on the TorchLearner.
+                - TorchCompileWhatToCompile.forward_train: Compile only forward train.
+                Note:


use ..Note:: directives and check the rendered documentation to see if the formating is correctly rendered.

kouroshHakha · 2023-06-20T16:13:32Z

rllib/core/learner/torch/tests/test_torch_learner_compile.py

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

rllib/core/learner/learner_group_config.py

…ct#35759) Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

ArturNiederfahrenhorst added 2 commits May 24, 2023 17:34

reduce torch compile config

8094d97

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

Introduce tests and compile learner update method

59194eb

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

ArturNiederfahrenhorst self-assigned this May 25, 2023

ArturNiederfahrenhorst and others added 2 commits May 24, 2023 21:30

Fix dangling TorchCompileConfig arguments

3f72f76

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

Use cudagraphs instead of inductor

925e831

ArturNiederfahrenhorst assigned kouroshHakha and unassigned ArturNiederfahrenhorst May 26, 2023

ArturNiederfahrenhorst marked this pull request as ready for review May 26, 2023 21:13

ArturNiederfahrenhorst requested review from sven1977, gjoliver, avnishn, smorad, maxpumperla, kouroshHakha and krfricke as code owners May 26, 2023 21:13

ArturNiederfahrenhorst changed the title ~~[RLlib] Second iteration of torch.compile() changes~~ [RLlib] Compile update logic on learner and use cudagraphs May 26, 2023

ArturNiederfahrenhorst and others added 6 commits May 30, 2023 15:42

Merge branch 'master' into improvedtorchcompile

af5e221

Add what_to_compile modes and tests

b1acafd

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

use indictor on learner

5c21292

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

Update eagerly on first call in torch learner

6129cf6

Move compile tests to own torch learner tests

bceb68e

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

merge master

2e29c37

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

ArturNiederfahrenhorst mentioned this pull request Jun 6, 2023

[RLlib] [Dummy] Torch compile benchmark PR #36119

Closed

ArturNiederfahrenhorst commented Jun 9, 2023

View reviewed changes

kouroshHakha reviewed Jun 14, 2023

View reviewed changes

ArturNiederfahrenhorst added 2 commits June 16, 2023 18:03

Kourosh's comments

15bebf5

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

merge master

cbf72ae

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

kouroshHakha mentioned this pull request Jun 17, 2023

[RLlib][Torch 2.0 compile] Inference benchmarks #36534

Merged

8 tasks

ArturNiederfahrenhorst added 2 commits June 16, 2023 22:32

typo

9e983b2

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

add TorchCompileWhatToCompile

0bb631b

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

kouroshHakha approved these changes Jun 20, 2023

View reviewed changes

ArturNiederfahrenhorst added 2 commits June 20, 2023 11:20

Kourosh's comments

0f6f7a7

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

merge master

49426e3

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

kouroshHakha approved these changes Jun 20, 2023

View reviewed changes

rllib/core/learner/learner_group_config.py Show resolved Hide resolved

kouroshHakha merged commit 2a12cf5 into ray-project:master Jun 21, 2023

akshay-anyscale mentioned this pull request Jul 21, 2023

Add service deployment instructions to stable diffusion template #37645

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Compile update logic on learner and use cudagraphs #35759

[RLlib] Compile update logic on learner and use cudagraphs #35759

ArturNiederfahrenhorst commented May 25, 2023 •

edited

Loading

ArturNiederfahrenhorst commented May 26, 2023 •

edited

Loading

ArturNiederfahrenhorst Jun 9, 2023

ArturNiederfahrenhorst Jun 9, 2023

ArturNiederfahrenhorst Jun 9, 2023

ArturNiederfahrenhorst Jun 9, 2023

ArturNiederfahrenhorst Jun 9, 2023

ArturNiederfahrenhorst Jun 9, 2023

kouroshHakha Jun 14, 2023

kouroshHakha Jun 14, 2023

kouroshHakha Jun 14, 2023

kouroshHakha Jun 14, 2023

kouroshHakha Jun 14, 2023

kouroshHakha Jun 14, 2023

kouroshHakha Jun 14, 2023

kouroshHakha Jun 14, 2023

kouroshHakha Jun 14, 2023

kouroshHakha Jun 14, 2023

kouroshHakha Jun 14, 2023

kouroshHakha Jun 14, 2023

ArturNiederfahrenhorst commented Jun 19, 2023

kouroshHakha left a comment

kouroshHakha Jun 20, 2023

kouroshHakha Jun 20, 2023

kouroshHakha Jun 20, 2023

kouroshHakha Jun 20, 2023

		torch_compile_cfg: Optional["TorchCompileConfig"] = None

		def validate(self):

[RLlib] Compile update logic on learner and use cudagraphs #35759

[RLlib] Compile update logic on learner and use cudagraphs #35759

Conversation

ArturNiederfahrenhorst commented May 25, 2023 • edited Loading

Why are these changes needed?

ArturNiederfahrenhorst commented May 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArturNiederfahrenhorst commented Jun 19, 2023

kouroshHakha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArturNiederfahrenhorst commented May 25, 2023 •

edited

Loading

ArturNiederfahrenhorst commented May 26, 2023 •

edited

Loading