Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Output insufficent resources warning msg when trials are in pending for extended amount of time. #17533

Merged
merged 13 commits into from
Aug 13, 2021

Conversation

xwjiang2010
Copy link
Contributor

@xwjiang2010 xwjiang2010 commented Aug 3, 2021

Output insufficent resource warning msg when autoscaler is not running.

Why are these changes needed?

RayTune currently does not receive a definitive signal from resource management
about whether a certain request is not fulfilled because of other competing
requests or would never be fulfilled due to resource limitations. As a result,
user complains repeated PENDING status of trials without making any progress.

This implementation is at best a calculated investment to collect some low
hanging fruits.

A proper fix should involve API changes in resource management in the future.

Related issue number

Closes #16425

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@xwjiang2010 xwjiang2010 changed the title [RayTune] Output insufficent resource warning msg when autoscaler is … [RayTune] Output insufficent resource warning msg when autoscaler is not running. Aug 3, 2021
@xwjiang2010 xwjiang2010 requested a review from richardliaw August 3, 2021 05:11
@xwjiang2010
Copy link
Contributor Author

Putting up an initial draft to collect some feedback.

@xwjiang2010 xwjiang2010 requested review from Yard1 and krfricke August 3, 2021 05:12
…not running.

RayTune currently does not receive a definitive signal from resource management
about whether a certain request is not fulfilled because of other competing
requests or would never be fulfilled due to resource limitations. As a result,
user complains repeated PENDING status of trials without making any progress.

This implementation is at best a calculated investment to collect some low
hanging fruits.

A proper fix should involve API changes in resource management in the future.
@krfricke krfricke self-assigned this Aug 3, 2021
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good - let's just see about recording the last time we raised the warning.

Also, it would be good to test this somehow, specifically with the case when we have finished trials already but no new trials can be scheduled (because they might have different resource requirements). I think we can use a unit test here, we don't need an end to end test.

python/ray/tune/trial_executor.py Outdated Show resolved Hide resolved
python/ray/tune/trial_executor.py Outdated Show resolved Hide resolved
@Yard1 Yard1 changed the title [RayTune] Output insufficent resource warning msg when autoscaler is not running. [tune] Output insufficent resource warning msg when autoscaler is not running. Aug 3, 2021
@xwjiang2010
Copy link
Contributor Author

Generally looks good - let's just see about recording the last time we raised the warning.

Also, it would be good to test this somehow, specifically with the case when we have finished trials already but no new trials can be scheduled (because they might have different resource requirements). I think we can use a unit test here, we don't need an end to end test.

Good point! Yes, my current logic only considers the case of first trial getting stuck. It should be trivial to cover the case you mentioned here. and +100 to unit tests. Do we have a current test suite that I can piggy back this test on?

@xwjiang2010
Copy link
Contributor Author

Generally looks good - let's just see about recording the last time we raised the warning.
Also, it would be good to test this somehow, specifically with the case when we have finished trials already but no new trials can be scheduled (because they might have different resource requirements). I think we can use a unit test here, we don't need an end to end test.

Good point! Yes, my current logic only considers the case of first trial getting stuck. It should be trivial to cover the case you mentioned here. and +100 to unit tests. Do we have a current test suite that I can piggy back this test on?

Actually how about removing is_ray_cluster() all together? Or use a different x seconds threshold for the case with autoscaler? Basically, what kind of set up should I run to say with certain confidence that if none of the PENDING trials are proceeding to RUNNING after x seconds with autoscaler, it's reasonable to output a similar warning msg as well.

@richardliaw
Copy link
Contributor

Actually how about removing is_ray_cluster() all together? Or use a different x seconds threshold for the case with autoscaler? Basically, what kind of set up should I run to say with certain confidence that if none of the PENDING trials are proceeding to RUNNING after x seconds with autoscaler, it's reasonable to output a similar warning msg as well.

I think we should definitely use a different case for is_ray_cluster(), but still use that as a decision criteria. Time-based warnings sound good to me.

BTW, the autoscaler also has output when it's scaling up, so it'll be important to design the UX holistically.

@xwjiang2010
Copy link
Contributor Author

BTW, the autoscaler also has output when it's scaling up, so it'll be important to design the UX holistically.

Yes, I do see those log outputs. How about we say "ignore the following msg if you are still seeing xxx in logs" or it is too cumbersome?

@richardliaw
Copy link
Contributor

How about we say "ignore the following msg if you are still seeing xxx in logs" or it is too cumbersome?

I think it's a bit cumbersome/adds to the noise.

Also extended to cover w/ and w/o autoscaler and add unit tests.
"cpu": 1,
"gpu": 1,
})
msg = "Autoscaler is disabled. Resource is not ready after " \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestions for better formatting? I vaguely remember '' is not recommended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'/'

Copy link
Member

@Yard1 Yard1 Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

msg = ("Autoscaler is disabled. Resource is not ready after "
       "extended amount of time without any trials running - "
       "please consider if the allocated resource is not enough.")

@xwjiang2010
Copy link
Contributor Author

How about we say "ignore the following msg if you are still seeing xxx in logs" or it is too cumbersome?

I think it's a bit cumbersome/adds to the noise.

Updated the msg, ptal

pass

with self.assertLogs(logger="ray.tune.trial_executor") as ctx:
out = tune.run(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The set up in this file either drives the tests from tune.run() or on a granular level by interacting(start/stop/pause/resume) with individual trials.

For me, the added logic is more or less hooked up with step(), thus driven by tune.run(). Is there a good way to drive heterogeneous trials? So that we can exercise the logic of:

  • one trial runs and terminates successfully
  • another trial requires more resource than the 1st trial and thus we print the warning msg.

I am open to ideas to add more test coverage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krfricke to see if any input for unit tests.

@xwjiang2010
Copy link
Contributor Author

Generally looks good - let's just see about recording the last time we raised the warning.

Also, it would be good to test this somehow, specifically with the case when we have finished trials already but no new trials can be scheduled (because they might have different resource requirements). I think we can use a unit test here, we don't need an end to end test.

Added tests and left a question of how to have heterogeneous trial running through Tune.run(), so that this specific case you mentioned can also be tested.

@xwjiang2010
Copy link
Contributor Author

Additional questions before I forgot:

  • I added some new dependencies, do I need to update our dependency somewhere like requirements.txt?
  • I added two env variables, should I update user facing doc somewhere?

@Yard1
Copy link
Member

Yard1 commented Aug 4, 2021

@xwjiang2010

  • if it's a Tune test only dependency, add it to python/requirements/tune/requirements_tune.txt
  • doc/source/tune/user-guide.rst, "Environment variables" section

@xwjiang2010 xwjiang2010 changed the title [tune] Output insufficent resource warning msg when autoscaler is not running. [tune] Output insufficent resource warning msg when trials are in pending for extended amount of time. Aug 6, 2021
@xwjiang2010
Copy link
Contributor Author

@xwjiang2010

  • if it's a Tune test only dependency, add it to python/requirements/tune/requirements_tune.txt
  • doc/source/tune/user-guide.rst, "Environment variables" section

Done! PTAL.

@Yard1
Copy link
Member

Yard1 commented Aug 6, 2021

Awesome, thanks! Some minor nits and we are good to go

Looks like there are some shady coupling between test suites.
doc/source/tune/user-guide.rst Outdated Show resolved Hide resolved
python/ray/tune/trial_executor.py Outdated Show resolved Hide resolved
python/ray/tune/trial_executor.py Outdated Show resolved Hide resolved
python/requirements/tune/requirements_tune.txt Outdated Show resolved Hide resolved
@krfricke krfricke linked an issue Aug 11, 2021 that may be closed by this pull request
@xwjiang2010 xwjiang2010 changed the title [tune] Output insufficent resource warning msg when trials are in pending for extended amount of time. [tune] Output insufficent resources warning msg when trials are in pending for extended amount of time. Aug 11, 2021
@@ -12,6 +15,18 @@
logger = logging.getLogger(__name__)


# Accessing environment variable could be slow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait really? do you have some reference for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be as fast as accessing any other dict, right? os should load it only once

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that accessing environment variable incurs penalty for some scripting languages. But looking closely at os.py, it seems just a normal wrapped dictionary in process. So maybe not so much in this case.

Comment on lines 234 to 238
f"This could be due to the cluster not having enough "
f"resources available to start the next trial. Please "
f"check if the requested resources can be fulfilled by "
f"your cluster, or will be fulfilled eventually (when "
f"using the Ray autoscaler).")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple comments here:

  1. IMO users that aren't using the Ray autoscaler should not see "Ray autoscaler"
  2. This doesn't actually provide any action for the user to take. For example, user may not know what "requested resources" means nor even "cluster".

Instead, it would be good to say:

  1. which resource is not available, and how much is being requested
  2. what is the total amount of those resource available on the cluster

Also, one suggestion should be to say that they should stop their tuning job and reconfigure their resource request.

Does that make sense? In principle, we should provide 1. what went wrong on the Ray side (in terms that the end user understands) 2. what the user did wrong (if possible) 3. what they should do instead :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good feedback.
Practically there is no API that exposes those information. Left a TODO and filed #17799 to follow up.

@richardliaw richardliaw dismissed krfricke’s stale review August 13, 2021 08:37

updated to address requested changes :)

@richardliaw richardliaw merged commit 0be9f06 into ray-project:master Aug 13, 2021
Bam4d pushed a commit to Bam4d/ray that referenced this pull request Aug 13, 2021
@xwjiang2010 xwjiang2010 deleted the usability branch July 26, 2023 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[tune] Better error message for all pending trials [rllib][tune] Training stuck in "Pending" status
4 participants