-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: improve logs displayed when scheduling fails #317
chore: improve logs displayed when scheduling fails #317
Conversation
New sample error messages:
|
0172653
to
4089837
Compare
4089837
to
0a22aea
Compare
Pull Request Test Coverage Report for Build 4917124948
💛 - Coveralls |
Negligible difference in scheduling performance:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extremely cool
} | ||
if r.requirementsAndOffering { | ||
return "no instance type which met the scheduling requirements and the required offering had the required resources" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this evolution of log messaging is definitely an improvement, and gives more visibility into what decisions what karpenter + k8s scheduler is doing.
Would it be more readable if we simply returned a more transparent log message w/ the three dimensions and their true/false values? I like the human-centric structure of a sentence, but formulating the combinatoric possibilities of failures here as discrete sentences is hard to get right from a clarity of language point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I merged it before I saw your comment. I did it this way to try to be more informative, with three bools you are reporting these 7 possible errors:
R / F / O
0 0 0 - success
0 0 1 - "no instance type has the required offering"
0 1 0 - "no instance type has enough resources"
0 1 1 - "no instance type had enough resources or had a required offering"
1 0 0 - "no instance type met all requirements"
1 0 1 - "no instance type met the scheduling requirements or had a required offering"
1 1 0 - "no instance type met the scheduling requirements or had enough resources"
1 1 1 - "no instance type met the scheduling requirements or had enough resources or had a required offering"
It currently reports four additional errors:
- "no instance type which met the scheduling requirements and had enough resources, had a required offering"
- "no instance type which had enough resources and the required offering met the scheduling requirements"
- "no instance type which met the scheduling requirements and the required offering had the required resources"
- "no instance type met the requirements/resources/offering tuple"
I was trying to err on the side of making the currently challenging task of figuring out why your pod won't schedule easier :)
(I see this just merged, but I'll keep asking questions :)) Are these scheduling failures during scale out considerations? Or are these orthogonal to scale out/in and evaluated based upon nodes and their capabilities that are actually running in the cluster at the time? |
Happy to review a followon! |
They occur during both provisioning and consolidation, but they're not logged during consolidation. The same scheduling code is used for both, consolidation is essentially just:
|
Fixes #
Description
improve logs displayed when scheduling fails
How was this change tested?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.