-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add backoff data to cluster-autoscaler-status and make the status easier for parsing #6318
Comments
I have a few comments to the proposed format:
|
Good points, thanks Maciek!
I agree, I'll remove remove nodeCounts from scalueUp.
I am not sure about this one. I don't know why it is named cloudProviderTarget, rather than clusterAutoscalerTarget. In that case, minSize and maxSize are the min size and max size of autoscaler and not the node group. For example, if min size for autoscaler is 3, but the user manually removes nodes until the count is 2 and the utilization is still low, autoscaler won't scale up to 3 and the size will be smaller than the min size of the autoscaler settings.
I agree, I'll do that. |
I still don't think we should connect cloudProviderTarget and min/max anymore than we connect nodeCounts and min/max. This is undocumented, so there is no way to prove it, but even with the old configmap I think the min/max in parenthesis was expected to be interpreted as a sort of comment to the entire line that is now represented as "nodeCounts" and not interpreted as specifically related to cloudProviderTarget. It just so happened that cloudProviderTarget was the last field. Regardless of the above:
|
Actually this discussion makes me think about whether we should nest resourceUnready under unready, or even nest a bunch of stuff under registered. Technically the correct representation would be:
This actually illustrates how the different states aggregate. Registered should equal ready+unready+notStarted. ResourceUnready nodes are already counted as 'unready' and including them in calculation would lead to double counting (and longUnregistered are obviously not part of registered). @x13n @towca @BigDarkClown opinions? |
Thanks a lot @MaciekPytel , this is very helpful to know! |
The reason why |
@MaciekPytel @x13n @towca I found that "deleted" is counted towards the registered count. WDYT of adding it as well? It was never printed before but I think it can be useful.
Also WDYT of adding "unregistered" which wasn't there before as well? |
/assign @walidghallab |
+1 to printing both unregistered and deleted. Especially deleted, no idea why we haven't been printing them already, I assume it was just an omission and not an intentional decision |
Thanks @MaciekPytel |
Which component are you using?:
cluster-autoscaler
The problem this feature should solve:
cluster-autoscaler-status config map shows which nodegroups are in backoff but it doesn't show the reasons behind backoff of nodegroups which makes debugging harder especially for large clusters.
Also the status of cluster-autoscaler doesn't confine to a pattern that is easily parsable.
Describe the solution you'd like:
To add backoff reason (errorCode & errorMessage) for each nodegroup being in backoff in the status field of cluster-autoscaler-status config map.
Also change the current format of status field from the current format to yaml which is both human-readable and easier in parsing.
Example for the status field:
Current state (Human readable string without backoff data):
Target state (yaml with backoff data):
The text was updated successfully, but these errors were encountered: