[RLlib] Change `config.fault_tolerance` default behavior (from `recreate_failed_env_runners=False` to `True`). #48286

sven1977 · 2024-10-27T20:37:51Z

Change config.fault_tolerance default behavior (from recreate_failed_env_runners=False to True).

The new logic is to always try and restart failed EnvRunners by default.

Also renamed recreate_failed_env_runners to restart_failed_env_runners to match other similar settings and ray's main max_num_restarts vocabulary.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>

simonsays1980

LGTM.

simonsays1980 · 2024-10-28T08:47:17Z

doc/source/rllib/rllib-training.rst

-``recreate_failed_env_runners``: When set to True, your Algorithm will attempt to replace/recreate any failed worker(s) with newly created one(s). This way, your number of workers will never decrease, even if some of them fail from time to time.
-``restart_failed_sub_environments``: When set to True and there is a failure in one of the vectorized sub-environments in one of your workers, the worker will try to recreate only the failed sub-environment and re-integrate the newly created one into your vectorized env stack on that worker.
+``restart_failed_env_runners``: When set to True (default), your Algorithm will attempt to restart any failed EnvRunner and replace it with a newly created one. This way, your number of workers will never decrease, even if some of them fail from time to time.
+``ignore_env_runner_failures``: When set to True, your Algorithm will not crash due to an EnvRunner error, but continue for as long as there is at least one functional worker remaining. This setting is ignored when ``restart_failed_env_runners=True``.


Why do we want to ignore this setting when restart_failed_env_runners=True? Following the semantics a user would understand that if failed env runners are restarted and the algorithm does not ignore failed env runners, the setting has no meaning.

If restart_failed_env_runners=True, then RLlib doesn't have a choice to a) ignore or b) crash, b/c it has to restart the failed EnvRunner, so this setting (ignore_env_runner_failures) becomes irrelevant.

See also: https://docs.ray.io/en/latest/rllib/rllib-training.html#rllib-scaling-guide

simonsays1980 · 2024-10-28T08:48:03Z

doc/source/rllib/rllib-training.rst


-Note that only one of ``ignore_env_runner_failures`` or ``recreate_failed_env_runners`` may be set to True (they are mutually exclusive settings). However,
+Note that only one of ``ignore_env_runner_failures`` or ``restart_failed_env_runners`` should be set to True (they are mutually exclusive settings). However,


This might need further explanation.

simonsays1980 · 2024-10-28T08:50:35Z

rllib/algorithms/algorithm_config.py

@@ -3260,11 +3257,11 @@ def fault_tolerance(
                setting is ignored.
            ignore_env_runner_failures: Whether to ignore any EnvRunner failures


Ah, here it becomes clear. Maybe we refer to this section or maybe I overread it above.

…ge_fault_tolerance_default_settings

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…ate_failed_env_runners=False` to `True`). (ray-project#48286)

…ate_failed_env_runners=False` to `True`). (ray-project#48286) Signed-off-by: JP-sDEV <jon.pablo80@gmail.com>

…ate_failed_env_runners=False` to `True`). (ray-project#48286) Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>

wip

f4a4412

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 requested review from maxpumperla, simonsays1980 and a team as code owners October 27, 2024 20:37

sven1977 assigned simonsays1980 Oct 27, 2024

simonsays1980 approved these changes Oct 28, 2024

View reviewed changes

sven1977 added 2 commits October 28, 2024 10:31

Merge branch 'master' of https://github.com/ray-project/ray into chan…

5229ffb

…ge_fault_tolerance_default_settings

wip

28bfd12

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 enabled auto-merge (squash) October 28, 2024 10:03

github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 28, 2024

sven1977 disabled auto-merge October 28, 2024 11:42

sven1977 merged commit 44237c6 into ray-project:master Oct 29, 2024
6 of 7 checks passed

sven1977 deleted the change_fault_tolerance_default_settings branch November 3, 2024 23:14

Jay-ju pushed a commit to Jay-ju/ray that referenced this pull request Nov 5, 2024

[RLlib] Change config.fault_tolerance default behavior (from `recre…

575c48a

…ate_failed_env_runners=False` to `True`). (ray-project#48286)

JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024

[RLlib] Change config.fault_tolerance default behavior (from `recre…

2a8b42d

…ate_failed_env_runners=False` to `True`). (ray-project#48286) Signed-off-by: JP-sDEV <jon.pablo80@gmail.com>

mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024

[RLlib] Change config.fault_tolerance default behavior (from `recre…

e908949

…ate_failed_env_runners=False` to `True`). (ray-project#48286) Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Change `config.fault_tolerance` default behavior (from `recreate_failed_env_runners=False` to `True`). #48286

[RLlib] Change `config.fault_tolerance` default behavior (from `recreate_failed_env_runners=False` to `True`). #48286

sven1977 commented Oct 27, 2024

simonsays1980 left a comment

simonsays1980 Oct 28, 2024

sven1977 Oct 28, 2024

simonsays1980 Oct 28, 2024

simonsays1980 Oct 28, 2024


		Note that only one of ``ignore_env_runner_failures`` or ``recreate_failed_env_runners`` may be set to True (they are mutually exclusive settings). However,
		Note that only one of ``ignore_env_runner_failures`` or ``restart_failed_env_runners`` should be set to True (they are mutually exclusive settings). However,

		@@ -3260,11 +3257,11 @@ def fault_tolerance(
		setting is ignored.
		ignore_env_runner_failures: Whether to ignore any EnvRunner failures

[RLlib] Change config.fault_tolerance default behavior (from recreate_failed_env_runners=False to True). #48286

[RLlib] Change config.fault_tolerance default behavior (from recreate_failed_env_runners=False to True). #48286

Conversation

sven1977 commented Oct 27, 2024

Why are these changes needed?

Related issue number

Checks

simonsays1980 left a comment

Choose a reason for hiding this comment

simonsays1980 Oct 28, 2024

Choose a reason for hiding this comment

sven1977 Oct 28, 2024

Choose a reason for hiding this comment

simonsays1980 Oct 28, 2024

Choose a reason for hiding this comment

simonsays1980 Oct 28, 2024

Choose a reason for hiding this comment

[RLlib] Change `config.fault_tolerance` default behavior (from `recreate_failed_env_runners=False` to `True`). #48286

[RLlib] Change `config.fault_tolerance` default behavior (from `recreate_failed_env_runners=False` to `True`). #48286