-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle node allocation errors gracefully, log details, and exit on failure #264
Conversation
Please find the updated code, @amaslenn . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain how this scenario will work:
- test1 has enough nodes
- test2 doesn't have enough nodes
- test3 has enough nodes
?
What will user see as the result? Which test will run?
@amaslenn CloudAI can run test 1, but when it tries to run test 2, it will fail and raise an exception. Test 3 will not run as CloudAI has failed. |
@amaslenn I recall that we had a similar discussion with Jeff and Srivatsan earlier. This is an important topic, and we should make a decision, but it is outside the scope of this PR. |
Right, I asked this because we do not raise an exception now and do not propagate it, so I wasn't sure that mentioned behaviour remains. But if you confirm that, no issues, we are good. |
Summary
This is a bug fix for https://redmine.mellanox.com/issues/4100971.
Test Plan
CI passes.