[tune] Output insufficent resources warning msg when trials are in pending for extended amount of time. #17533

xwjiang2010 · 2021-08-03T05:08:07Z

Output insufficent resource warning msg when autoscaler is not running.

Why are these changes needed?

RayTune currently does not receive a definitive signal from resource management
about whether a certain request is not fulfilled because of other competing
requests or would never be fulfilled due to resource limitations. As a result,
user complains repeated PENDING status of trials without making any progress.

This implementation is at best a calculated investment to collect some low
hanging fruits.

A proper fix should involve API changes in resource management in the future.

Related issue number

Closes #16425

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

xwjiang2010 · 2021-08-03T05:12:32Z

Putting up an initial draft to collect some feedback.

python/ray/tune/trial_executor.py

…not running. RayTune currently does not receive a definitive signal from resource management about whether a certain request is not fulfilled because of other competing requests or would never be fulfilled due to resource limitations. As a result, user complains repeated PENDING status of trials without making any progress. This implementation is at best a calculated investment to collect some low hanging fruits. A proper fix should involve API changes in resource management in the future.

krfricke

Generally looks good - let's just see about recording the last time we raised the warning.

Also, it would be good to test this somehow, specifically with the case when we have finished trials already but no new trials can be scheduled (because they might have different resource requirements). I think we can use a unit test here, we don't need an end to end test.

python/ray/tune/trial_executor.py

xwjiang2010 · 2021-08-03T16:36:43Z

Generally looks good - let's just see about recording the last time we raised the warning.

Also, it would be good to test this somehow, specifically with the case when we have finished trials already but no new trials can be scheduled (because they might have different resource requirements). I think we can use a unit test here, we don't need an end to end test.

Good point! Yes, my current logic only considers the case of first trial getting stuck. It should be trivial to cover the case you mentioned here. and +100 to unit tests. Do we have a current test suite that I can piggy back this test on?

xwjiang2010 · 2021-08-03T16:49:51Z

Generally looks good - let's just see about recording the last time we raised the warning.
Also, it would be good to test this somehow, specifically with the case when we have finished trials already but no new trials can be scheduled (because they might have different resource requirements). I think we can use a unit test here, we don't need an end to end test.

Good point! Yes, my current logic only considers the case of first trial getting stuck. It should be trivial to cover the case you mentioned here. and +100 to unit tests. Do we have a current test suite that I can piggy back this test on?

Actually how about removing is_ray_cluster() all together? Or use a different x seconds threshold for the case with autoscaler? Basically, what kind of set up should I run to say with certain confidence that if none of the PENDING trials are proceeding to RUNNING after x seconds with autoscaler, it's reasonable to output a similar warning msg as well.

python/ray/tune/trial_runner.py

richardliaw · 2021-08-03T18:27:19Z

Actually how about removing is_ray_cluster() all together? Or use a different x seconds threshold for the case with autoscaler? Basically, what kind of set up should I run to say with certain confidence that if none of the PENDING trials are proceeding to RUNNING after x seconds with autoscaler, it's reasonable to output a similar warning msg as well.

I think we should definitely use a different case for is_ray_cluster(), but still use that as a decision criteria. Time-based warnings sound good to me.

BTW, the autoscaler also has output when it's scaling up, so it'll be important to design the UX holistically.

xwjiang2010 · 2021-08-03T18:46:15Z

BTW, the autoscaler also has output when it's scaling up, so it'll be important to design the UX holistically.

Yes, I do see those log outputs. How about we say "ignore the following msg if you are still seeing xxx in logs" or it is too cumbersome?

richardliaw · 2021-08-03T21:34:40Z

How about we say "ignore the following msg if you are still seeing xxx in logs" or it is too cumbersome?

I think it's a bit cumbersome/adds to the noise.

Also extended to cover w/ and w/o autoscaler and add unit tests.

xwjiang2010 · 2021-08-04T01:09:38Z

python/ray/tune/tests/test_ray_trial_executor.py

+                    "cpu": 1,
+                    "gpu": 1,
+                })
+            msg = "Autoscaler is disabled. Resource is not ready after " \


Any suggestions for better formatting? I vaguely remember '' is not recommended.

msg = ("Autoscaler is disabled. Resource is not ready after " "extended amount of time without any trials running - " "please consider if the allocated resource is not enough.")

xwjiang2010 · 2021-08-04T01:16:46Z

How about we say "ignore the following msg if you are still seeing xxx in logs" or it is too cumbersome?

I think it's a bit cumbersome/adds to the noise.

Updated the msg, ptal

python/ray/tune/trial_executor.py

xwjiang2010 · 2021-08-04T15:06:43Z

python/ray/tune/tests/test_ray_trial_executor.py

+            pass
+
+        with self.assertLogs(logger="ray.tune.trial_executor") as ctx:
+            out = tune.run(


The set up in this file either drives the tests from tune.run() or on a granular level by interacting(start/stop/pause/resume) with individual trials.

For me, the added logic is more or less hooked up with step(), thus driven by tune.run(). Is there a good way to drive heterogeneous trials? So that we can exercise the logic of:

one trial runs and terminates successfully

another trial requires more resource than the 1st trial and thus we print the warning msg.

I am open to ideas to add more test coverage.

@krfricke to see if any input for unit tests.

xwjiang2010 · 2021-08-04T15:08:51Z

Generally looks good - let's just see about recording the last time we raised the warning.

Also, it would be good to test this somehow, specifically with the case when we have finished trials already but no new trials can be scheduled (because they might have different resource requirements). I think we can use a unit test here, we don't need an end to end test.

Added tests and left a question of how to have heterogeneous trial running through Tune.run(), so that this specific case you mentioned can also be tested.

xwjiang2010 · 2021-08-04T16:36:38Z

Additional questions before I forgot:

I added some new dependencies, do I need to update our dependency somewhere like requirements.txt?
I added two env variables, should I update user facing doc somewhere?

Yard1 · 2021-08-04T19:24:06Z

@xwjiang2010

if it's a Tune test only dependency, add it to python/requirements/tune/requirements_tune.txt
doc/source/tune/user-guide.rst, "Environment variables" section

xwjiang2010 · 2021-08-06T16:46:47Z

@xwjiang2010

if it's a Tune test only dependency, add it to python/requirements/tune/requirements_tune.txt

doc/source/tune/user-guide.rst, "Environment variables" section

Done! PTAL.

doc/source/tune/user-guide.rst

python/ray/tune/trial_executor.py

Yard1 · 2021-08-06T16:59:10Z

Awesome, thanks! Some minor nits and we are good to go

Looks like there are some shady coupling between test suites.

python/ray/tune/tests/test_ray_trial_executor.py

doc/source/tune/user-guide.rst

python/ray/tune/trial_executor.py

python/requirements/tune/requirements_tune.txt

richardliaw · 2021-08-12T18:28:02Z

python/ray/tune/trial_executor.py

@@ -12,6 +15,18 @@
 logger = logging.getLogger(__name__)


+# Accessing environment variable could be slow.


wait really? do you have some reference for this?

it should be as fast as accessing any other dict, right? os should load it only once

I was under the impression that accessing environment variable incurs penalty for some scripting languages. But looking closely at os.py, it seems just a normal wrapped dictionary in process. So maybe not so much in this case.

richardliaw · 2021-08-12T19:01:55Z

python/ray/tune/trial_executor.py

+                    f"This could be due to the cluster not having enough "
+                    f"resources available to start the next trial. Please "
+                    f"check if the requested resources can be fulfilled by "
+                    f"your cluster, or will be fulfilled eventually (when "
+                    f"using the Ray autoscaler).")


A couple comments here:

IMO users that aren't using the Ray autoscaler should not see "Ray autoscaler"

This doesn't actually provide any action for the user to take. For example, user may not know what "requested resources" means nor even "cluster".

Instead, it would be good to say:

which resource is not available, and how much is being requested

what is the total amount of those resource available on the cluster

Also, one suggestion should be to say that they should stop their tuning job and reconfigure their resource request.

Does that make sense? In principle, we should provide 1. what went wrong on the Ray side (in terms that the end user understands) 2. what the user did wrong (if possible) 3. what they should do instead :)

This is good feedback.
Practically there is no API that exposes those information. Left a TODO and filed #17799 to follow up.

updated to address requested changes :)

…nding for extended amount of time. (ray-project#17533)

xwjiang2010 changed the title ~~[RayTune] Output insufficent resource warning msg when autoscaler is …~~ [RayTune] Output insufficent resource warning msg when autoscaler is not running. Aug 3, 2021

xwjiang2010 requested a review from richardliaw August 3, 2021 05:11

xwjiang2010 requested review from Yard1 and krfricke August 3, 2021 05:12

xwjiang2010 commented Aug 3, 2021

View reviewed changes

python/ray/tune/trial_executor.py Outdated Show resolved Hide resolved

krfricke self-assigned this Aug 3, 2021

krfricke reviewed Aug 3, 2021

View reviewed changes

python/ray/tune/trial_executor.py Outdated Show resolved Hide resolved

python/ray/tune/trial_executor.py Outdated Show resolved Hide resolved

Yard1 reviewed Aug 3, 2021

View reviewed changes

python/ray/tune/trial_executor.py Outdated Show resolved Hide resolved

Yard1 changed the title ~~[RayTune] Output insufficent resource warning msg when autoscaler is not running.~~ [tune] Output insufficent resource warning msg when autoscaler is not running. Aug 3, 2021

richardliaw reviewed Aug 3, 2021

View reviewed changes

python/ray/tune/trial_runner.py Outdated Show resolved Hide resolved

xwjiang2010 force-pushed the usability branch from de31766 to 0522914 Compare August 4, 2021 01:04

Time based warning threshold, passed by env variable.

51a7ab2

Also extended to cover w/ and w/o autoscaler and add unit tests.

xwjiang2010 force-pushed the usability branch from 0522914 to 51a7ab2 Compare August 4, 2021 01:08

xwjiang2010 commented Aug 4, 2021

View reviewed changes

Yard1 reviewed Aug 4, 2021

View reviewed changes

python/ray/tune/trial_executor.py Outdated Show resolved Hide resolved

Yard1 reviewed Aug 4, 2021

View reviewed changes

python/ray/tune/trial_executor.py Outdated Show resolved Hide resolved

xwjiang2010 added 2 commits August 4, 2021 07:44

Use f-string, type hint etc.

a93e49d

Correct typo.

a66e410

xwjiang2010 commented Aug 4, 2021

View reviewed changes

Yard1 approved these changes Aug 4, 2021

View reviewed changes

xwjiang2010 changed the title ~~[tune] Output insufficent resource warning msg when autoscaler is not running.~~ [tune] Output insufficent resource warning msg when trials are in pending for extended amount of time. Aug 6, 2021

Merge branch 'ray-project:master' into usability

5d0dce7

Add mock, freezegun to requirements.txt. Add doc for env variable.

5ded209

xwjiang2010 force-pushed the usability branch from 0fa54ea to 5ded209 Compare August 6, 2021 16:47

Yard1 reviewed Aug 6, 2021

View reviewed changes

doc/source/tune/user-guide.rst Outdated Show resolved Hide resolved

Yard1 reviewed Aug 6, 2021

View reviewed changes

doc/source/tune/user-guide.rst Outdated Show resolved Hide resolved

Yard1 reviewed Aug 6, 2021

View reviewed changes

python/ray/tune/trial_executor.py Outdated Show resolved Hide resolved

xwjiang2010 added 2 commits August 6, 2021 10:06

minor fixes.

58f5d62

Reshuffle test order.

7d204b1

Looks like there are some shady coupling between test suites.

krfricke previously requested changes Aug 10, 2021

View reviewed changes

Address comments. Use memoize().

8d09c7c

krfricke linked an issue Aug 11, 2021 that may be closed by this pull request

[tune] Better error message for all pending trials #16850

Closed

xwjiang2010 changed the title ~~[tune] Output insufficent resource warning msg when trials are in pending for extended amount of time.~~ [tune] Output insufficent resources warning msg when trials are in pending for extended amount of time. Aug 11, 2021

Merge branch 'master' into usability

e254b58

xwjiang2010 force-pushed the usability branch from 149deb8 to e254b58 Compare August 11, 2021 22:37

xwjiang2010 added 2 commits August 11, 2021 16:54

Merge branch 'ray-project:master' into usability

0681d2b

Merge branch 'ray-project:master' into usability

d6150e7

richardliaw reviewed Aug 12, 2021

View reviewed changes

Address comments.

0b342f1

richardliaw approved these changes Aug 13, 2021

View reviewed changes

richardliaw merged commit 0be9f06 into ray-project:master Aug 13, 2021

Bam4d pushed a commit to Bam4d/ray that referenced this pull request Aug 13, 2021

[tune] Output insufficent resources warning msg when trials are in pe…

b5dfeae

…nding for extended amount of time. (ray-project#17533)

xwjiang2010 deleted the usability branch July 26, 2023 19:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Output insufficent resources warning msg when trials are in pending for extended amount of time. #17533

[tune] Output insufficent resources warning msg when trials are in pending for extended amount of time. #17533

xwjiang2010 commented Aug 3, 2021 •

edited

Loading

xwjiang2010 commented Aug 3, 2021

krfricke left a comment

xwjiang2010 commented Aug 3, 2021

xwjiang2010 commented Aug 3, 2021

richardliaw commented Aug 3, 2021

xwjiang2010 commented Aug 3, 2021

richardliaw commented Aug 3, 2021

xwjiang2010 Aug 4, 2021

xwjiang2010 Aug 4, 2021

Yard1 Aug 4, 2021 •

edited

Loading

xwjiang2010 commented Aug 4, 2021

xwjiang2010 Aug 4, 2021

xwjiang2010 Aug 11, 2021

xwjiang2010 commented Aug 4, 2021

xwjiang2010 commented Aug 4, 2021

Yard1 commented Aug 4, 2021

xwjiang2010 commented Aug 6, 2021

Yard1 commented Aug 6, 2021

richardliaw Aug 12, 2021

Yard1 Aug 12, 2021

xwjiang2010 Aug 12, 2021

richardliaw Aug 12, 2021

xwjiang2010 Aug 12, 2021

		@@ -12,6 +15,18 @@
		logger = logging.getLogger(__name__)


		# Accessing environment variable could be slow.

[tune] Output insufficent resources warning msg when trials are in pending for extended amount of time. #17533

[tune] Output insufficent resources warning msg when trials are in pending for extended amount of time. #17533

Conversation

xwjiang2010 commented Aug 3, 2021 • edited Loading

Why are these changes needed?

Related issue number

Checks

xwjiang2010 commented Aug 3, 2021

krfricke left a comment

Choose a reason for hiding this comment

xwjiang2010 commented Aug 3, 2021

xwjiang2010 commented Aug 3, 2021

richardliaw commented Aug 3, 2021

xwjiang2010 commented Aug 3, 2021

richardliaw commented Aug 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yard1 Aug 4, 2021 • edited Loading

Choose a reason for hiding this comment

xwjiang2010 commented Aug 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xwjiang2010 commented Aug 4, 2021

xwjiang2010 commented Aug 4, 2021

Yard1 commented Aug 4, 2021

xwjiang2010 commented Aug 6, 2021

Yard1 commented Aug 6, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xwjiang2010 commented Aug 3, 2021 •

edited

Loading

Yard1 Aug 4, 2021 •

edited

Loading