Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle node allocation errors gracefully, log details, and exit on failure #264

Merged
merged 3 commits into from
Oct 15, 2024

Conversation

TaekyungHeo
Copy link
Member

@TaekyungHeo TaekyungHeo commented Oct 15, 2024

Summary

This is a bug fix for https://redmine.mellanox.com/issues/4100971.

  • Handle node allocation errors gracefully, log details, and exit on failure
  • Enhance exception message in allocate_nodes

Test Plan

CI passes.

@TaekyungHeo TaekyungHeo added bug Something isn't working enhancement New feature or request Oct24 Oct'24 release feature labels Oct 15, 2024
@TaekyungHeo TaekyungHeo marked this pull request as ready for review October 15, 2024 12:16
@TaekyungHeo TaekyungHeo changed the title Max bug Handle node allocation errors gracefully, log details, and exit on failure Oct 15, 2024
src/cloudai/systems/slurm/slurm_system.py Outdated Show resolved Hide resolved
src/cloudai/systems/slurm/slurm_system.py Outdated Show resolved Hide resolved
src/cloudai/systems/slurm/slurm_system.py Outdated Show resolved Hide resolved
@TaekyungHeo
Copy link
Member Author

Please find the updated code, @amaslenn .

Copy link
Contributor

@amaslenn amaslenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain how this scenario will work:

  1. test1 has enough nodes
  2. test2 doesn't have enough nodes
  3. test3 has enough nodes
    ?

What will user see as the result? Which test will run?

@TaekyungHeo
Copy link
Member Author

@amaslenn CloudAI can run test 1, but when it tries to run test 2, it will fail and raise an exception. Test 3 will not run as CloudAI has failed.

@TaekyungHeo
Copy link
Member Author

@amaslenn I recall that we had a similar discussion with Jeff and Srivatsan earlier. This is an important topic, and we should make a decision, but it is outside the scope of this PR.

@amaslenn
Copy link
Contributor

..., but it is outside the scope of this PR.

Right, I asked this because we do not raise an exception now and do not propagate it, so I wasn't sure that mentioned behaviour remains. But if you confirm that, no issues, we are good.

@TaekyungHeo TaekyungHeo merged commit 622141f into NVIDIA:main Oct 15, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request Oct24 Oct'24 release feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants