-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Change config.fault_tolerance
default behavior (from recreate_failed_env_runners=False
to True
).
#48286
Merged
sven1977
merged 3 commits into
ray-project:master
from
sven1977:change_fault_tolerance_default_settings
Oct 29, 2024
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -412,16 +412,16 @@ inference. Make sure to set ``num_gpus: 1`` if you want to use a GPU. If the lea | |
4. Finally, if both model and environment are compute intensive, then enable `remote worker envs <rllib-env.html#vectorized>`__ with `async batching <rllib-env.html#vectorized>`__ by setting ``remote_worker_envs: True`` and optionally ``remote_env_batch_wait_ms``. This batches inference on GPUs in the rollout workers while letting envs run asynchronously in separate actors, similar to the `SEED <https://ai.googleblog.com/2020/03/massively-scaling-reinforcement.html>`__ architecture. The number of workers and number of envs per worker should be tuned to maximize GPU utilization. | ||
|
||
In case you are using lots of workers (``num_env_runners >> 10``) and you observe worker failures for whatever reasons, which normally interrupt your RLlib training runs, consider using | ||
the config settings ``ignore_env_runner_failures=True``, ``recreate_failed_env_runners=True``, or ``restart_failed_sub_environments=True``: | ||
the config settings ``ignore_env_runner_failures=True``, ``restart_failed_env_runners=True``, or ``restart_failed_sub_environments=True``: | ||
|
||
``ignore_env_runner_failures``: When set to True, your Algorithm will not crash due to a single worker error but continue for as long as there is at least one functional worker remaining. | ||
``recreate_failed_env_runners``: When set to True, your Algorithm will attempt to replace/recreate any failed worker(s) with newly created one(s). This way, your number of workers will never decrease, even if some of them fail from time to time. | ||
``restart_failed_sub_environments``: When set to True and there is a failure in one of the vectorized sub-environments in one of your workers, the worker will try to recreate only the failed sub-environment and re-integrate the newly created one into your vectorized env stack on that worker. | ||
``restart_failed_env_runners``: When set to True (default), your Algorithm will attempt to restart any failed EnvRunner and replace it with a newly created one. This way, your number of workers will never decrease, even if some of them fail from time to time. | ||
``ignore_env_runner_failures``: When set to True, your Algorithm will not crash due to an EnvRunner error, but continue for as long as there is at least one functional worker remaining. This setting is ignored when ``restart_failed_env_runners=True``. | ||
``restart_failed_sub_environments``: When set to True and there is a failure in one of the vectorized sub-environments in one of your EnvRunners, RLlib tries to recreate only the failed sub-environment and re-integrate the newly created one into your vectorized env stack on that EnvRunner. | ||
|
||
Note that only one of ``ignore_env_runner_failures`` or ``recreate_failed_env_runners`` may be set to True (they are mutually exclusive settings). However, | ||
Note that only one of ``ignore_env_runner_failures`` or ``restart_failed_env_runners`` should be set to True (they are mutually exclusive settings). However, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This might need further explanation. |
||
you can combine each of these with the ``restart_failed_sub_environments=True`` setting. | ||
Using these options will make your training runs much more stable and more robust against occasional OOM or other similar "once in a while" errors on your workers | ||
themselves or inside your environments. | ||
Using these options will make your training runs much more stable and more robust against occasional OOM or other similar "once in a while" errors on the EnvRunners | ||
themselves or inside your custom environments. | ||
|
||
|
||
Debugging RLlib Experiments | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -528,26 +528,22 @@ def __init__(self, algo_class: Optional[type] = None): | |
self._evaluation_parallel_to_training_wo_thread = False | ||
|
||
# `self.fault_tolerance()` | ||
# TODO (sven): Rename to `restart_..` to match other attributes AND ray's | ||
# `ray.remote(max_num_restarts=..)([class])` setting. | ||
self.recreate_failed_env_runners = False | ||
self.restart_failed_env_runners = True | ||
self.ignore_env_runner_failures = False | ||
# By default, restart failed worker a thousand times. | ||
# This should be enough to handle normal transient failures. | ||
# This also prevents infinite number of restarts in case | ||
# the worker or env has a bug. | ||
# This also prevents infinite number of restarts in case the worker or env has | ||
# a bug. | ||
self.max_num_env_runner_restarts = 1000 | ||
# Small delay between worker restarts. In case rollout or | ||
# evaluation workers have remote dependencies, this delay can be | ||
# adjusted to make sure we don't flood them with re-connection | ||
# requests, and allow them enough time to recover. | ||
# This delay also gives Ray time to stream back error logging | ||
# and exceptions. | ||
# Small delay between worker restarts. In case EnvRunners or eval EnvRunners | ||
# have remote dependencies, this delay can be adjusted to make sure we don't | ||
# flood them with re-connection requests, and allow them enough time to recover. | ||
# This delay also gives Ray time to stream back error logging and exceptions. | ||
self.delay_between_env_runner_restarts_s = 60.0 | ||
self.restart_failed_sub_environments = False | ||
self.num_consecutive_env_runner_failures_tolerance = 100 | ||
self.env_runner_health_probe_timeout_s = 30 | ||
self.env_runner_restore_timeout_s = 1800 | ||
self.env_runner_health_probe_timeout_s = 30.0 | ||
self.env_runner_restore_timeout_s = 1800.0 | ||
|
||
# `self.rl_module()` | ||
self._model_config = {} | ||
|
@@ -3230,7 +3226,7 @@ def debugging( | |
def fault_tolerance( | ||
self, | ||
*, | ||
recreate_failed_env_runners: Optional[bool] = NotProvided, | ||
restart_failed_env_runners: Optional[bool] = NotProvided, | ||
ignore_env_runner_failures: Optional[bool] = NotProvided, | ||
max_num_env_runner_restarts: Optional[int] = NotProvided, | ||
delay_between_env_runner_restarts_s: Optional[float] = NotProvided, | ||
|
@@ -3239,6 +3235,7 @@ def fault_tolerance( | |
env_runner_health_probe_timeout_s: Optional[float] = NotProvided, | ||
env_runner_restore_timeout_s: Optional[float] = NotProvided, | ||
# Deprecated args. | ||
recreate_failed_env_runners=DEPRECATED_VALUE, | ||
ignore_worker_failures=DEPRECATED_VALUE, | ||
recreate_failed_workers=DEPRECATED_VALUE, | ||
max_num_worker_restarts=DEPRECATED_VALUE, | ||
|
@@ -3250,8 +3247,8 @@ def fault_tolerance( | |
"""Sets the config's fault tolerance settings. | ||
|
||
Args: | ||
recreate_failed_env_runners: Whether - upon an EnvRunner failure - RLlib | ||
tries to recreate the lost EnvRunner(s) as an identical copy of the | ||
restart_failed_env_runners: Whether - upon an EnvRunner failure - RLlib | ||
tries to restart the lost EnvRunner(s) as an identical copy of the | ||
failed one(s). You should set this to True when training on SPOT | ||
instances that may preempt any time. The new, recreated EnvRunner(s) | ||
only differ from the failed one in their `self.recreated_worker=True` | ||
|
@@ -3260,11 +3257,11 @@ def fault_tolerance( | |
setting is ignored. | ||
ignore_env_runner_failures: Whether to ignore any EnvRunner failures | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, here it becomes clear. Maybe we refer to this section or maybe I overread it above. |
||
and continue running with the remaining EnvRunners. This setting is | ||
ignored, if `recreate_failed_env_runners=True`. | ||
ignored, if `restart_failed_env_runners=True`. | ||
max_num_env_runner_restarts: The maximum number of times any EnvRunner | ||
is allowed to be restarted (if `recreate_failed_env_runners` is True). | ||
is allowed to be restarted (if `restart_failed_env_runners` is True). | ||
delay_between_env_runner_restarts_s: The delay (in seconds) between two | ||
consecutive EnvRunner restarts (if `recreate_failed_env_runners` is | ||
consecutive EnvRunner restarts (if `restart_failed_env_runners` is | ||
True). | ||
restart_failed_sub_environments: If True and any sub-environment (within | ||
a vectorized env) throws any error during env stepping, the | ||
|
@@ -3274,7 +3271,7 @@ def fault_tolerance( | |
num_consecutive_env_runner_failures_tolerance: The number of consecutive | ||
times an EnvRunner failure (also for evaluation) is tolerated before | ||
finally crashing the Algorithm. Only useful if either | ||
`ignore_env_runner_failures` or `recreate_failed_env_runners` is True. | ||
`ignore_env_runner_failures` or `restart_failed_env_runners` is True. | ||
Note that for `restart_failed_sub_environments` and sub-environment | ||
failures, the EnvRunner itself is NOT affected and won't throw any | ||
errors as the flawed sub-environment is silently restarted under the | ||
|
@@ -3290,6 +3287,12 @@ def fault_tolerance( | |
Returns: | ||
This updated AlgorithmConfig object. | ||
""" | ||
if recreate_failed_env_runners != DEPRECATED_VALUE: | ||
deprecation_warning( | ||
old="AlgorithmConfig.fault_tolerance(recreate_failed_env_runners)", | ||
new="AlgorithmConfig.fault_tolerance(restart_failed_env_runners)", | ||
error=True, | ||
) | ||
if ignore_worker_failures != DEPRECATED_VALUE: | ||
deprecation_warning( | ||
old="AlgorithmConfig.fault_tolerance(ignore_worker_failures)", | ||
|
@@ -3299,7 +3302,7 @@ def fault_tolerance( | |
if recreate_failed_workers != DEPRECATED_VALUE: | ||
deprecation_warning( | ||
old="AlgorithmConfig.fault_tolerance(recreate_failed_workers)", | ||
new="AlgorithmConfig.fault_tolerance(recreate_failed_env_runners)", | ||
new="AlgorithmConfig.fault_tolerance(restart_failed_env_runners)", | ||
error=True, | ||
) | ||
if max_num_worker_restarts != DEPRECATED_VALUE: | ||
|
@@ -3339,8 +3342,8 @@ def fault_tolerance( | |
|
||
if ignore_env_runner_failures is not NotProvided: | ||
self.ignore_env_runner_failures = ignore_env_runner_failures | ||
if recreate_failed_env_runners is not NotProvided: | ||
self.recreate_failed_env_runners = recreate_failed_env_runners | ||
if restart_failed_env_runners is not NotProvided: | ||
self.restart_failed_env_runners = restart_failed_env_runners | ||
if max_num_env_runner_restarts is not NotProvided: | ||
self.max_num_env_runner_restarts = max_num_env_runner_restarts | ||
if delay_between_env_runner_restarts_s is not NotProvided: | ||
|
@@ -5150,6 +5153,22 @@ def rollouts(self, *args, **kwargs): | |
def exploration(self, *args, **kwargs): | ||
return self.env_runners(*args, **kwargs) | ||
|
||
@property | ||
@Deprecated( | ||
new="AlgorithmConfig.fault_tolerance(restart_failed_env_runners=..)", | ||
error=False, | ||
) | ||
def recreate_failed_env_runners(self): | ||
return self.restart_failed_env_runners | ||
|
||
@recreate_failed_env_runners.setter | ||
def recreate_failed_env_runners(self, value): | ||
deprecation_warning( | ||
old="AlgorithmConfig.recreate_failed_env_runners", | ||
new="AlgorithmConfig.restart_failed_env_runners", | ||
error=True, | ||
) | ||
|
||
@property | ||
@Deprecated(new="AlgorithmConfig._enable_new_api_stack", error=False) | ||
def _enable_new_api_stack(self): | ||
|
@@ -5225,18 +5244,18 @@ def ignore_worker_failures(self, value): | |
self.ignore_env_runner_failures = value | ||
|
||
@property | ||
@Deprecated(new="AlgorithmConfig.recreate_failed_env_runners", error=False) | ||
@Deprecated(new="AlgorithmConfig.restart_failed_env_runners", error=False) | ||
def recreate_failed_workers(self): | ||
return self.recreate_failed_env_runners | ||
return self.restart_failed_env_runners | ||
|
||
@recreate_failed_workers.setter | ||
def recreate_failed_workers(self, value): | ||
deprecation_warning( | ||
old="AlgorithmConfig.recreate_failed_workers", | ||
new="AlgorithmConfig.recreate_failed_env_runners", | ||
new="AlgorithmConfig.restart_failed_env_runners", | ||
error=False, | ||
) | ||
self.recreate_failed_env_runners = value | ||
self.restart_failed_env_runners = value | ||
|
||
@property | ||
@Deprecated(new="AlgorithmConfig.max_num_env_runner_restarts", error=False) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to ignore this setting when
restart_failed_env_runners=True
? Following the semantics a user would understand that if failed env runners are restarted and the algorithm does not ignore failed env runners, the setting has no meaning.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If
restart_failed_env_runners=True
, then RLlib doesn't have a choice to a) ignore or b) crash, b/c it has to restart the failed EnvRunner, so this setting (ignore_env_runner_failures
) becomes irrelevant.See also: https://docs.ray.io/en/latest/rllib/rllib-training.html#rllib-scaling-guide