-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Change config.fault_tolerance
default behavior (from recreate_failed_env_runners=False
to True
).
#48286
[RLlib] Change config.fault_tolerance
default behavior (from recreate_failed_env_runners=False
to True
).
#48286
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
``recreate_failed_env_runners``: When set to True, your Algorithm will attempt to replace/recreate any failed worker(s) with newly created one(s). This way, your number of workers will never decrease, even if some of them fail from time to time. | ||
``restart_failed_sub_environments``: When set to True and there is a failure in one of the vectorized sub-environments in one of your workers, the worker will try to recreate only the failed sub-environment and re-integrate the newly created one into your vectorized env stack on that worker. | ||
``restart_failed_env_runners``: When set to True (default), your Algorithm will attempt to restart any failed EnvRunner and replace it with a newly created one. This way, your number of workers will never decrease, even if some of them fail from time to time. | ||
``ignore_env_runner_failures``: When set to True, your Algorithm will not crash due to an EnvRunner error, but continue for as long as there is at least one functional worker remaining. This setting is ignored when ``restart_failed_env_runners=True``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to ignore this setting when restart_failed_env_runners=True
? Following the semantics a user would understand that if failed env runners are restarted and the algorithm does not ignore failed env runners, the setting has no meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If restart_failed_env_runners=True
, then RLlib doesn't have a choice to a) ignore or b) crash, b/c it has to restart the failed EnvRunner, so this setting (ignore_env_runner_failures
) becomes irrelevant.
See also: https://docs.ray.io/en/latest/rllib/rllib-training.html#rllib-scaling-guide
|
||
Note that only one of ``ignore_env_runner_failures`` or ``recreate_failed_env_runners`` may be set to True (they are mutually exclusive settings). However, | ||
Note that only one of ``ignore_env_runner_failures`` or ``restart_failed_env_runners`` should be set to True (they are mutually exclusive settings). However, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might need further explanation.
@@ -3260,11 +3257,11 @@ def fault_tolerance( | |||
setting is ignored. | |||
ignore_env_runner_failures: Whether to ignore any EnvRunner failures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, here it becomes clear. Maybe we refer to this section or maybe I overread it above.
…ge_fault_tolerance_default_settings
…ate_failed_env_runners=False` to `True`). (ray-project#48286)
…ate_failed_env_runners=False` to `True`). (ray-project#48286) Signed-off-by: JP-sDEV <jon.pablo80@gmail.com>
…ate_failed_env_runners=False` to `True`). (ray-project#48286) Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Change
config.fault_tolerance
default behavior (fromrecreate_failed_env_runners=False
toTrue
).The new logic is to always try and restart failed
EnvRunners
by default.Also renamed
recreate_failed_env_runners
torestart_failed_env_runners
to match other similar settings and ray's mainmax_num_restarts
vocabulary.Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.