-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert status in cluster-autoscaler-status to yaml and add error info for scale-up backoff #6375
Conversation
/uncc @feiskyer |
13faa7b
to
802c17b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not going to block review on this, since this is mostly the preexisting state. But the unittests for configmap are very old and, frankly, pretty terrible. I wonder if you wouldn't mind re-writing them into table-based tests? Your changes are completely re-working this part of the code and it's a good opportunity to refresh the tests. Also good test coverage would help make sure we're not introducing any regressions.
Sorry, one more comment - would you mind posting an example yaml obtained via |
d2603e5
to
4c82384
Compare
Thanks a lot @MaciekPytel for the review! /assign @MaciekPytel Output for
Output for apiVersion: v1
data:
status: |
time: 2023-12-20 11:05:35.373518676 +0000 UTC
autoscalerStatus: Running
clusterWide:
health:
status: Healthy
nodeCounts:
registered:
total: 7
ready: 7
notStarted: 0
longUnregistered: 0
unregistered: 0
lastProbeTime: "2023-12-20T11:05:35.373518676Z"
lastTransitionTime: "2023-12-20T10:46:22.549253146Z"
scaleUp:
status: NoActivity
lastProbeTime: "2023-12-20T11:05:35.373518676Z"
lastTransitionTime: "2023-12-20T11:03:21.319836881Z"
scaleDown:
status: NoCandidates
lastProbeTime: "2023-12-20T11:05:35.373518676Z"
nodeGroups:
- name: https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp
health:
status: Healthy
nodeCounts:
registered:
total: 7
ready: 7
notStarted: 0
longUnregistered: 0
unregistered: 0
cloudProviderTarget: 7
minSize: 4
maxSize: 10
lastProbeTime: "2023-12-20T11:05:35.373518676Z"
lastTransitionTime: "2023-12-20T10:46:22.549253146Z"
scaleUp:
status: Backoff
backoffInfo:
errorCode: QUOTA_EXCEEDED
errorMessage: 'Instance ''gke-sample-cluster-default-pool-40ce0341-b82s'' creation
failed: Quota ''CPUS'' exceeded. Limit: 57.0 in region us-central1., Instance
''gke-sample-cluster-default-pool-40ce0341-n8d6'' creation failed: Quota ''CPUS''
exceeded. Limit: 57.0 in region us-central1., Instance ''gke-sample-cluster-default-pool-40ce0341-s5tz''
creation failed: Quota ''CPUS'' exceeded. Limit: 57.0 in region us-central1.'
lastProbeTime: "2023-12-20T11:05:35.373518676Z"
lastTransitionTime: "2023-12-20T11:03:21.319836881Z"
scaleDown:
status: NoCandidates
lastProbeTime: "2023-12-20T11:05:35.373518676Z"
kind: ConfigMap
metadata:
annotations:
cluster-autoscaler.kubernetes.io/last-updated: 2023-12-20 11:05:35.373518676 +0000
UTC
creationTimestamp: "2023-12-20T10:46:10Z"
name: cluster-autoscaler-status
namespace: kube-system
resourceVersion: "22056994"
uid: 2be11b59-cf5c-4bd5-bb31-b7baf133bcbc I changed the project name in the messages above. I will refactor config map tests in a subsequent PR since it is already existing state and out of scope of this PR. |
/hold Holding submission to make sure this comment is addressed before the bot submit automatically. |
/unhold The comment has been addressed. |
/unhold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
I left a bunch of non-blocking comments for a potential discussion / follow-up. Let's not fix them on this PR.
/hold
For fixing one comment that I see as blocking - the one about using "Initializing" instead of const in actionable_cluster_processor. Please feel free to remove hold as soon as you address this one.
@@ -839,130 +846,114 @@ func buildScaleUpStatusNodeGroup(isScaleUpInProgress bool, scaleUpSafety NodeGro | |||
condition.Status = api.ClusterAutoscalerUnhealthy | |||
} else if !scaleUpSafety.SafeToScale { | |||
condition.Status = api.ClusterAutoscalerBackoff | |||
condition.BackoffInfo = api.BackoffInfo{ | |||
ErrorCode: scaleUpSafety.BackoffStatus.ErrorInfo.ErrorCode, | |||
ErrorMessage: scaleUpSafety.BackoffStatus.ErrorInfo.ErrorMessage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should maybe trim this to some maximum length? Max size of configmap is 1MB and that's not much if you have a few hundred nodegroups.
A few hundred characters should likely be a reasonable limit for an error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't print all the error messages. Only the first three error messages in each node group are printed.
They are also the same error messages used in events.
I added truncation logic for extra safety, PTAL.
cluster-autoscaler/processors/actionablecluster/actionable_cluster_processor.go
Outdated
Show resolved
Hide resolved
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MaciekPytel, walidghallab The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…o for backoff and more node counts. Change-Id: Ic68e0d67b7ce9912b605b6c0a3356b4d0e177911
447eced
to
6fe5a73
Compare
…roup. Max size of configmap is 1MB. Change-Id: I615d25781e4f8dafb6a08f752c085544bcd49e5a
/lgtm |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR changes the status field in data of cluster-autoscaler-status config map to yaml to make it easier for parsing.
This PR also adds error info for scale-up during backoff to provide the reason of the backoff to the users to make it easier to debug.
Which issue(s) this PR fixes:
Fixes #6318
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: