Handle node allocation errors gracefully, log details, and exit on failure #264

TaekyungHeo · 2024-10-15T12:15:01Z

Summary

This is a bug fix for https://redmine.mellanox.com/issues/4100971.

Handle node allocation errors gracefully, log details, and exit on failure
Enhance exception message in allocate_nodes

Test Plan

CI passes.

…ilure

src/cloudai/systems/slurm/slurm_system.py

TaekyungHeo · 2024-10-15T12:59:51Z

Please find the updated code, @amaslenn .

amaslenn

Could you please explain how this scenario will work:

test1 has enough nodes
test2 doesn't have enough nodes
test3 has enough nodes
?

What will user see as the result? Which test will run?

TaekyungHeo · 2024-10-15T13:14:24Z

@amaslenn CloudAI can run test 1, but when it tries to run test 2, it will fail and raise an exception. Test 3 will not run as CloudAI has failed.

TaekyungHeo · 2024-10-15T13:15:37Z

@amaslenn I recall that we had a similar discussion with Jeff and Srivatsan earlier. This is an important topic, and we should make a decision, but it is outside the scope of this PR.

amaslenn · 2024-10-15T14:14:46Z

..., but it is outside the scope of this PR.

Right, I asked this because we do not raise an exception now and do not propagate it, so I wasn't sure that mentioned behaviour remains. But if you confirm that, no issues, we are good.

TaekyungHeo added 2 commits October 15, 2024 06:55

Enhance exception message in allocate_nodes

635f666

Handle node allocation errors gracefully, log details, and exit on fa…

06d471e

…ilure

TaekyungHeo requested review from amaslenn, artemry-nv, srivatsankrishnan and srinivas212 October 15, 2024 12:15

TaekyungHeo added bug Something isn't working enhancement New feature or request Oct24 Oct'24 release feature labels Oct 15, 2024

TaekyungHeo marked this pull request as ready for review October 15, 2024 12:16

TaekyungHeo changed the title ~~Max bug~~ Handle node allocation errors gracefully, log details, and exit on failure Oct 15, 2024

amaslenn reviewed Oct 15, 2024

View reviewed changes

src/cloudai/systems/slurm/slurm_system.py Outdated Show resolved Hide resolved

src/cloudai/systems/slurm/slurm_system.py Outdated Show resolved Hide resolved

src/cloudai/systems/slurm/slurm_system.py Outdated Show resolved Hide resolved

Reflect Andrei's comments

febf0f6

amaslenn reviewed Oct 15, 2024

View reviewed changes

amaslenn approved these changes Oct 15, 2024

View reviewed changes

TaekyungHeo merged commit 622141f into NVIDIA:main Oct 15, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle node allocation errors gracefully, log details, and exit on failure #264

Handle node allocation errors gracefully, log details, and exit on failure #264

TaekyungHeo commented Oct 15, 2024 •

edited

Loading

TaekyungHeo commented Oct 15, 2024

amaslenn left a comment

TaekyungHeo commented Oct 15, 2024

TaekyungHeo commented Oct 15, 2024

amaslenn commented Oct 15, 2024

Handle node allocation errors gracefully, log details, and exit on failure #264

Handle node allocation errors gracefully, log details, and exit on failure #264

Conversation

TaekyungHeo commented Oct 15, 2024 • edited Loading

Summary

Test Plan

TaekyungHeo commented Oct 15, 2024

amaslenn left a comment

Choose a reason for hiding this comment

TaekyungHeo commented Oct 15, 2024

TaekyungHeo commented Oct 15, 2024

amaslenn commented Oct 15, 2024

TaekyungHeo commented Oct 15, 2024 •

edited

Loading