Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Change config.fault_tolerance default behavior (from recreate_failed_env_runners=False to True). #48286

Merged

Conversation

sven1977
Copy link
Contributor

Change config.fault_tolerance default behavior (from recreate_failed_env_runners=False to True).

The new logic is to always try and restart failed EnvRunners by default.

Also renamed recreate_failed_env_runners to restart_failed_env_runners to match other similar settings and ray's main max_num_restarts vocabulary.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>
Copy link
Collaborator

@simonsays1980 simonsays1980 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

``recreate_failed_env_runners``: When set to True, your Algorithm will attempt to replace/recreate any failed worker(s) with newly created one(s). This way, your number of workers will never decrease, even if some of them fail from time to time.
``restart_failed_sub_environments``: When set to True and there is a failure in one of the vectorized sub-environments in one of your workers, the worker will try to recreate only the failed sub-environment and re-integrate the newly created one into your vectorized env stack on that worker.
``restart_failed_env_runners``: When set to True (default), your Algorithm will attempt to restart any failed EnvRunner and replace it with a newly created one. This way, your number of workers will never decrease, even if some of them fail from time to time.
``ignore_env_runner_failures``: When set to True, your Algorithm will not crash due to an EnvRunner error, but continue for as long as there is at least one functional worker remaining. This setting is ignored when ``restart_failed_env_runners=True``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to ignore this setting when restart_failed_env_runners=True? Following the semantics a user would understand that if failed env runners are restarted and the algorithm does not ignore failed env runners, the setting has no meaning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If restart_failed_env_runners=True, then RLlib doesn't have a choice to a) ignore or b) crash, b/c it has to restart the failed EnvRunner, so this setting (ignore_env_runner_failures) becomes irrelevant.

See also: https://docs.ray.io/en/latest/rllib/rllib-training.html#rllib-scaling-guide


Note that only one of ``ignore_env_runner_failures`` or ``recreate_failed_env_runners`` may be set to True (they are mutually exclusive settings). However,
Note that only one of ``ignore_env_runner_failures`` or ``restart_failed_env_runners`` should be set to True (they are mutually exclusive settings). However,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might need further explanation.

@@ -3260,11 +3257,11 @@ def fault_tolerance(
setting is ignored.
ignore_env_runner_failures: Whether to ignore any EnvRunner failures
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, here it becomes clear. Maybe we refer to this section or maybe I overread it above.

Signed-off-by: sven1977 <svenmika1977@gmail.com>
@sven1977 sven1977 enabled auto-merge (squash) October 28, 2024 10:03
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 28, 2024
@sven1977 sven1977 merged commit 44237c6 into ray-project:master Oct 29, 2024
6 of 7 checks passed
@sven1977 sven1977 deleted the change_fault_tolerance_default_settings branch November 3, 2024 23:14
Jay-ju pushed a commit to Jay-ju/ray that referenced this pull request Nov 5, 2024
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
…ate_failed_env_runners=False` to `True`). (ray-project#48286)

Signed-off-by: JP-sDEV <jon.pablo80@gmail.com>
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
…ate_failed_env_runners=False` to `True`). (ray-project#48286)

Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants