-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Output insufficent resources warning msg when trials are in pending for extended amount of time. #17533
Conversation
Putting up an initial draft to collect some feedback. |
…not running. RayTune currently does not receive a definitive signal from resource management about whether a certain request is not fulfilled because of other competing requests or would never be fulfilled due to resource limitations. As a result, user complains repeated PENDING status of trials without making any progress. This implementation is at best a calculated investment to collect some low hanging fruits. A proper fix should involve API changes in resource management in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good - let's just see about recording the last time we raised the warning.
Also, it would be good to test this somehow, specifically with the case when we have finished trials already but no new trials can be scheduled (because they might have different resource requirements). I think we can use a unit test here, we don't need an end to end test.
Good point! Yes, my current logic only considers the case of first trial getting stuck. It should be trivial to cover the case you mentioned here. and +100 to unit tests. Do we have a current test suite that I can piggy back this test on? |
Actually how about removing is_ray_cluster() all together? Or use a different x seconds threshold for the case with autoscaler? Basically, what kind of set up should I run to say with certain confidence that if none of the PENDING trials are proceeding to RUNNING after x seconds with autoscaler, it's reasonable to output a similar warning msg as well. |
I think we should definitely use a different case for BTW, the autoscaler also has output when it's scaling up, so it'll be important to design the UX holistically. |
Yes, I do see those log outputs. How about we say "ignore the following msg if you are still seeing xxx in logs" or it is too cumbersome? |
I think it's a bit cumbersome/adds to the noise. |
Also extended to cover w/ and w/o autoscaler and add unit tests.
"cpu": 1, | ||
"gpu": 1, | ||
}) | ||
msg = "Autoscaler is disabled. Resource is not ready after " \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestions for better formatting? I vaguely remember '' is not recommended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'/'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
msg = ("Autoscaler is disabled. Resource is not ready after "
"extended amount of time without any trials running - "
"please consider if the allocated resource is not enough.")
Updated the msg, ptal |
pass | ||
|
||
with self.assertLogs(logger="ray.tune.trial_executor") as ctx: | ||
out = tune.run( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The set up in this file either drives the tests from tune.run()
or on a granular level by interacting(start/stop/pause/resume) with individual trials.
For me, the added logic is more or less hooked up with step(), thus driven by tune.run(). Is there a good way to drive heterogeneous trials? So that we can exercise the logic of:
- one trial runs and terminates successfully
- another trial requires more resource than the 1st trial and thus we print the warning msg.
I am open to ideas to add more test coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krfricke to see if any input for unit tests.
Added tests and left a question of how to have heterogeneous trial running through Tune.run(), so that this specific case you mentioned can also be tested. |
Additional questions before I forgot:
|
|
Done! PTAL. |
Awesome, thanks! Some minor nits and we are good to go |
Looks like there are some shady coupling between test suites.
python/ray/tune/trial_executor.py
Outdated
@@ -12,6 +15,18 @@ | |||
logger = logging.getLogger(__name__) | |||
|
|||
|
|||
# Accessing environment variable could be slow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait really? do you have some reference for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be as fast as accessing any other dict, right? os should load it only once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that accessing environment variable incurs penalty for some scripting languages. But looking closely at os.py, it seems just a normal wrapped dictionary in process. So maybe not so much in this case.
python/ray/tune/trial_executor.py
Outdated
f"This could be due to the cluster not having enough " | ||
f"resources available to start the next trial. Please " | ||
f"check if the requested resources can be fulfilled by " | ||
f"your cluster, or will be fulfilled eventually (when " | ||
f"using the Ray autoscaler).") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple comments here:
- IMO users that aren't using the Ray autoscaler should not see "Ray autoscaler"
- This doesn't actually provide any action for the user to take. For example, user may not know what "requested resources" means nor even "cluster".
Instead, it would be good to say:
- which resource is not available, and how much is being requested
- what is the total amount of those resource available on the cluster
Also, one suggestion should be to say that they should stop their tuning job and reconfigure their resource request.
Does that make sense? In principle, we should provide 1. what went wrong on the Ray side (in terms that the end user understands) 2. what the user did wrong (if possible) 3. what they should do instead :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good feedback.
Practically there is no API that exposes those information. Left a TODO and filed #17799 to follow up.
updated to address requested changes :)
…nding for extended amount of time. (ray-project#17533)
Output insufficent resource warning msg when autoscaler is not running.
Why are these changes needed?
RayTune currently does not receive a definitive signal from resource management
about whether a certain request is not fulfilled because of other competing
requests or would never be fulfilled due to resource limitations. As a result,
user complains repeated PENDING status of trials without making any progress.
This implementation is at best a calculated investment to collect some low
hanging fruits.
A proper fix should involve API changes in resource management in the future.
Related issue number
Closes #16425
Checks
scripts/format.sh
to lint the changes in this PR.