Convert status in cluster-autoscaler-status to yaml and add error info for scale-up backoff #6375

walidghallab · 2023-12-15T00:01:44Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR changes the status field in data of cluster-autoscaler-status config map to yaml to make it easier for parsing.
This PR also adds error info for scale-up during backoff to provide the reason of the backoff to the users to make it easier to debug.

Which issue(s) this PR fixes:

Fixes #6318

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Changed status value in data of cluster-autoscaler-status config map to yaml format for easier parsing.

Added error information for backoff status in scale-up.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

walidghallab · 2023-12-15T00:03:42Z

/uncc @feiskyer
/uncc @BigDarkClown

walidghallab · 2023-12-15T12:25:41Z

/cc @towca
/assign @towca

MaciekPytel

I'm not going to block review on this, since this is mostly the preexisting state. But the unittests for configmap are very old and, frankly, pretty terrible. I wonder if you wouldn't mind re-writing them into table-based tests? Your changes are completely re-working this part of the code and it's a good opportunity to refresh the tests. Also good test coverage would help make sure we're not introducing any regressions.

cluster-autoscaler/clusterstate/api/types.go

cluster-autoscaler/clusterstate/clusterstate.go

cluster-autoscaler/clusterstate/utils/status.go

MaciekPytel · 2023-12-19T19:19:01Z

Sorry, one more comment - would you mind posting an example yaml obtained via kubectl describe configmap or similar? Given that primary use-case is I think manual debugging, I think it's important to make sure the yaml is pretty-printed. Since your unittest does YAML comparison, I don't actually know if the real output is formated in similar way to your example file.

walidghallab · 2023-12-20T11:00:17Z

Thanks a lot @MaciekPytel for the review!

/assign @MaciekPytel
/cc @MaciekPytel

Output for kubectl describe configmap cluster-autoscaler-status --namespace=kube-system

Name:         cluster-autoscaler-status
Namespace:    kube-system
Labels:       <none>
Annotations:  cluster-autoscaler.kubernetes.io/last-updated: 2023-12-20 11:04:54.684568722 +0000 UTC

Data
====
status:
----
time: 2023-12-20 11:04:54.684568722 +0000 UTC
autoscalerStatus: Running
clusterWide:
  health:
    status: Healthy
    nodeCounts:
      registered:
        total: 7
        ready: 7
        notStarted: 0
      longUnregistered: 0
      unregistered: 0
    lastProbeTime: "2023-12-20T11:04:54.684568722Z"
    lastTransitionTime: "2023-12-20T10:46:22.549253146Z"
  scaleUp:
    status: NoActivity
    lastProbeTime: "2023-12-20T11:04:54.684568722Z"
    lastTransitionTime: "2023-12-20T11:03:21.319836881Z"
  scaleDown:
    status: NoCandidates
    lastProbeTime: "2023-12-20T11:04:54.684568722Z"
nodeGroups:
- name: https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp
  health:
    status: Healthy
    nodeCounts:
      registered:
        total: 7
        ready: 7
        notStarted: 0
      longUnregistered: 0
      unregistered: 0
    cloudProviderTarget: 7
    minSize: 4
    maxSize: 10
    lastProbeTime: "2023-12-20T11:04:54.684568722Z"
    lastTransitionTime: "2023-12-20T10:46:22.549253146Z"
  scaleUp:
    status: Backoff
    backoffInfo:
      errorCode: QUOTA_EXCEEDED
      errorMessage: 'Instance ''gke-sample-cluster-default-pool-40ce0341-b82s'' creation
        failed: Quota ''CPUS'' exceeded.  Limit: 57.0 in region us-central1., Instance
        ''gke-sample-cluster-default-pool-40ce0341-n8d6'' creation failed: Quota ''CPUS''
        exceeded.  Limit: 57.0 in region us-central1., Instance ''gke-sample-cluster-default-pool-40ce0341-s5tz''
        creation failed: Quota ''CPUS'' exceeded.  Limit: 57.0 in region us-central1.'
    lastProbeTime: "2023-12-20T11:04:54.684568722Z"
    lastTransitionTime: "2023-12-20T11:03:21.319836881Z"
  scaleDown:
    status: NoCandidates
    lastProbeTime: "2023-12-20T11:04:54.684568722Z"


BinaryData
====

Events:
  Type     Reason         Age                 From                Message
  ----     ------         ----                ----                -------
  Warning  ScaleUpFailed  18m                 cluster-autoscaler  Failed adding 1 nodes to group https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp due to OutOfResource.QUOTA_EXCEEDED; source errors: Instance 'gke-sample-cluster-default-pool-40ce0341-b9kr' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1.
  Warning  ScaleUpFailed  17m                 cluster-autoscaler  Failed adding 2 nodes to group https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp due to OutOfResource.QUOTA_EXCEEDED; source errors: Instance 'gke-sample-cluster-default-pool-40ce0341-dh9k' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1., Instance 'gke-sample-cluster-default-pool-40ce0341-swdz' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1.
  Warning  ScaleUpFailed  12m                 cluster-autoscaler  Failed adding 1 nodes to group https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp due to OutOfResource.QUOTA_EXCEEDED; source errors: Instance 'gke-sample-cluster-default-pool-40ce0341-xm2m' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1.
  Warning  ScaleUpFailed  12m                 cluster-autoscaler  Failed adding 1 nodes to group https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp due to OutOfResource.QUOTA_EXCEEDED; source errors: Instance 'gke-sample-cluster-default-pool-40ce0341-vgbp' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1.
  Warning  ScaleUpFailed  12m                 cluster-autoscaler  Failed adding 1 nodes to group https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp due to OutOfResource.QUOTA_EXCEEDED; source errors: Instance 'gke-sample-cluster-default-pool-40ce0341-xf5d' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1.
  Normal   ScaledUpGroup  2m8s (x3 over 18m)  cluster-autoscaler  Scale-up: setting group https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp size to 10 instead of 7 (max: 10)
  Normal   ScaledUpGroup  2m6s (x3 over 18m)  cluster-autoscaler  Scale-up: group https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp size set to 10 instead of 7 (max: 10)
  Warning  ScaleUpFailed  95s                 cluster-autoscaler  Failed adding 3 nodes to group https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp due to OutOfResource.QUOTA_EXCEEDED; source errors: Instance 'gke-sample-cluster-default-pool-40ce0341-b82s' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1., Instance 'gke-sample-cluster-default-pool-40ce0341-n8d6' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1., Instance 'gke-sample-cluster-default-pool-40ce0341-s5tz' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1.

Output for kubectl get configmap cluster-autoscaler-status --namespace=kube-system -o yaml

apiVersion: v1
data:
  status: |
    time: 2023-12-20 11:05:35.373518676 +0000 UTC
    autoscalerStatus: Running
    clusterWide:
      health:
        status: Healthy
        nodeCounts:
          registered:
            total: 7
            ready: 7
            notStarted: 0
          longUnregistered: 0
          unregistered: 0
        lastProbeTime: "2023-12-20T11:05:35.373518676Z"
        lastTransitionTime: "2023-12-20T10:46:22.549253146Z"
      scaleUp:
        status: NoActivity
        lastProbeTime: "2023-12-20T11:05:35.373518676Z"
        lastTransitionTime: "2023-12-20T11:03:21.319836881Z"
      scaleDown:
        status: NoCandidates
        lastProbeTime: "2023-12-20T11:05:35.373518676Z"
    nodeGroups:
    - name: https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp
      health:
        status: Healthy
        nodeCounts:
          registered:
            total: 7
            ready: 7
            notStarted: 0
          longUnregistered: 0
          unregistered: 0
        cloudProviderTarget: 7
        minSize: 4
        maxSize: 10
        lastProbeTime: "2023-12-20T11:05:35.373518676Z"
        lastTransitionTime: "2023-12-20T10:46:22.549253146Z"
      scaleUp:
        status: Backoff
        backoffInfo:
          errorCode: QUOTA_EXCEEDED
          errorMessage: 'Instance ''gke-sample-cluster-default-pool-40ce0341-b82s'' creation
            failed: Quota ''CPUS'' exceeded.  Limit: 57.0 in region us-central1., Instance
            ''gke-sample-cluster-default-pool-40ce0341-n8d6'' creation failed: Quota ''CPUS''
            exceeded.  Limit: 57.0 in region us-central1., Instance ''gke-sample-cluster-default-pool-40ce0341-s5tz''
            creation failed: Quota ''CPUS'' exceeded.  Limit: 57.0 in region us-central1.'
        lastProbeTime: "2023-12-20T11:05:35.373518676Z"
        lastTransitionTime: "2023-12-20T11:03:21.319836881Z"
      scaleDown:
        status: NoCandidates
        lastProbeTime: "2023-12-20T11:05:35.373518676Z"
kind: ConfigMap
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/last-updated: 2023-12-20 11:05:35.373518676 +0000
      UTC
  creationTimestamp: "2023-12-20T10:46:10Z"
  name: cluster-autoscaler-status
  namespace: kube-system
  resourceVersion: "22056994"
  uid: 2be11b59-cf5c-4bd5-bb31-b7baf133bcbc

I changed the project name in the messages above.

I will refactor config map tests in a subsequent PR since it is already existing state and out of scope of this PR.

walidghallab · 2023-12-28T11:20:25Z

/hold Holding submission to make sure this comment is addressed before the bot submit automatically.

walidghallab · 2023-12-28T12:45:53Z

/unhold The comment has been addressed.
I've added a new field beingDeleted and added tag omitDefault to it to not display it when it is 0 (which happens most of the time).

walidghallab · 2023-12-28T12:50:57Z

/unhold

MaciekPytel

/lgtm
/approve
I left a bunch of non-blocking comments for a potential discussion / follow-up. Let's not fix them on this PR.

/hold
For fixing one comment that I see as blocking - the one about using "Initializing" instead of const in actionable_cluster_processor. Please feel free to remove hold as soon as you address this one.

cluster-autoscaler/clusterstate/api/types.go

MaciekPytel · 2023-12-28T14:10:09Z

cluster-autoscaler/clusterstate/clusterstate.go

@@ -839,130 +846,114 @@ func buildScaleUpStatusNodeGroup(isScaleUpInProgress bool, scaleUpSafety NodeGro
 		condition.Status = api.ClusterAutoscalerUnhealthy
 	} else if !scaleUpSafety.SafeToScale {
 		condition.Status = api.ClusterAutoscalerBackoff
+		condition.BackoffInfo = api.BackoffInfo{
+			ErrorCode:    scaleUpSafety.BackoffStatus.ErrorInfo.ErrorCode,
+			ErrorMessage: scaleUpSafety.BackoffStatus.ErrorInfo.ErrorMessage,


I wonder if we should maybe trim this to some maximum length? Max size of configmap is 1MB and that's not much if you have a few hundred nodegroups.
A few hundred characters should likely be a reasonable limit for an error message?

I actually don't print all the error messages. Only the first three error messages in each node group are printed.
They are also the same error messages used in events.

I added truncation logic for extra safety, PTAL.

cluster-autoscaler/clusterstate/clusterstate.go

cluster-autoscaler/processors/actionablecluster/actionable_cluster_processor.go

k8s-ci-robot · 2023-12-28T14:20:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MaciekPytel, walidghallab

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [MaciekPytel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…o for backoff and more node counts. Change-Id: Ic68e0d67b7ce9912b605b6c0a3356b4d0e177911

…roup. Max size of configmap is 1MB. Change-Id: I615d25781e4f8dafb6a08f752c085544bcd49e5a

MaciekPytel · 2023-12-29T14:03:39Z

/lgtm
/hold cancel

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. area/cluster-autoscaler labels Dec 15, 2023

k8s-ci-robot requested review from BigDarkClown and feiskyer December 15, 2023 00:02

k8s-ci-robot removed request for feiskyer and BigDarkClown December 15, 2023 00:03

walidghallab marked this pull request as draft December 15, 2023 00:04

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 15, 2023

walidghallab force-pushed the status branch 6 times, most recently from 13faa7b to 802c17b Compare December 15, 2023 12:24

k8s-ci-robot assigned towca Dec 15, 2023

k8s-ci-robot requested a review from towca December 15, 2023 12:25

walidghallab marked this pull request as ready for review December 15, 2023 12:25

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 15, 2023

k8s-ci-robot requested review from BigDarkClown and feiskyer December 15, 2023 12:25

MaciekPytel reviewed Dec 19, 2023

View reviewed changes

walidghallab force-pushed the status branch 2 times, most recently from d2603e5 to 4c82384 Compare December 20, 2023 10:17

k8s-ci-robot assigned MaciekPytel Dec 20, 2023

k8s-ci-robot requested a review from MaciekPytel December 20, 2023 11:00

walidghallab force-pushed the status branch from 4c82384 to b5cce38 Compare December 22, 2023 16:46

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 28, 2023

walidghallab force-pushed the status branch from b5cce38 to 8ffb05d Compare December 28, 2023 12:44

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 28, 2023

MaciekPytel reviewed Dec 28, 2023

View reviewed changes

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Dec 28, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 28, 2023

walidghallab force-pushed the status branch from 8ffb05d to f8d78b6 Compare December 28, 2023 18:40

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 28, 2023

Convert status in cluster-autoscaler-status to yaml and add error inf…

11a0846

…o for backoff and more node counts. Change-Id: Ic68e0d67b7ce9912b605b6c0a3356b4d0e177911

walidghallab force-pushed the status branch 2 times, most recently from 447eced to 6fe5a73 Compare December 28, 2023 19:00

Truncate error messages in CA config map to 500 characters per node g…

4b63993

…roup. Max size of configmap is 1MB. Change-Id: I615d25781e4f8dafb6a08f752c085544bcd49e5a

walidghallab force-pushed the status branch from 6fe5a73 to 4b63993 Compare December 28, 2023 19:05

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Dec 29, 2023

k8s-ci-robot merged commit c7ad47b into kubernetes:master Dec 29, 2023
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert status in cluster-autoscaler-status to yaml and add error info for scale-up backoff #6375

Convert status in cluster-autoscaler-status to yaml and add error info for scale-up backoff #6375

walidghallab commented Dec 15, 2023 •

edited

Loading

walidghallab commented Dec 15, 2023

walidghallab commented Dec 15, 2023

MaciekPytel left a comment

MaciekPytel commented Dec 19, 2023

walidghallab commented Dec 20, 2023 •

edited

Loading

walidghallab commented Dec 28, 2023

walidghallab commented Dec 28, 2023 •

edited

Loading

walidghallab commented Dec 28, 2023

MaciekPytel left a comment

MaciekPytel Dec 28, 2023

walidghallab Dec 28, 2023

k8s-ci-robot commented Dec 28, 2023

MaciekPytel commented Dec 29, 2023

Convert status in cluster-autoscaler-status to yaml and add error info for scale-up backoff #6375

Convert status in cluster-autoscaler-status to yaml and add error info for scale-up backoff #6375

Conversation

walidghallab commented Dec 15, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

walidghallab commented Dec 15, 2023

walidghallab commented Dec 15, 2023

MaciekPytel left a comment

Choose a reason for hiding this comment

MaciekPytel commented Dec 19, 2023

walidghallab commented Dec 20, 2023 • edited Loading

walidghallab commented Dec 28, 2023

walidghallab commented Dec 28, 2023 • edited Loading

walidghallab commented Dec 28, 2023

MaciekPytel left a comment

Choose a reason for hiding this comment

MaciekPytel Dec 28, 2023

Choose a reason for hiding this comment

walidghallab Dec 28, 2023

Choose a reason for hiding this comment

k8s-ci-robot commented Dec 28, 2023

MaciekPytel commented Dec 29, 2023

walidghallab commented Dec 15, 2023 •

edited

Loading

walidghallab commented Dec 20, 2023 •

edited

Loading

walidghallab commented Dec 28, 2023 •

edited

Loading