Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert status in cluster-autoscaler-status to yaml and add error info for scale-up backoff #6375

Merged
merged 2 commits into from
Dec 29, 2023

Conversation

walidghallab
Copy link
Contributor

@walidghallab walidghallab commented Dec 15, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR changes the status field in data of cluster-autoscaler-status config map to yaml to make it easier for parsing.
This PR also adds error info for scale-up during backoff to provide the reason of the backoff to the users to make it easier to debug.

Which issue(s) this PR fixes:

Fixes #6318

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Changed status value in data of cluster-autoscaler-status config map to yaml format for easier parsing.

Added error information for backoff status in scale-up.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. area/cluster-autoscaler labels Dec 15, 2023
@walidghallab
Copy link
Contributor Author

/uncc @feiskyer
/uncc @BigDarkClown

@walidghallab walidghallab marked this pull request as draft December 15, 2023 00:04
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 15, 2023
@walidghallab
Copy link
Contributor Author

/cc @towca
/assign @towca

@walidghallab walidghallab marked this pull request as ready for review December 15, 2023 12:25
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 15, 2023
Copy link
Contributor

@MaciekPytel MaciekPytel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to block review on this, since this is mostly the preexisting state. But the unittests for configmap are very old and, frankly, pretty terrible. I wonder if you wouldn't mind re-writing them into table-based tests? Your changes are completely re-working this part of the code and it's a good opportunity to refresh the tests. Also good test coverage would help make sure we're not introducing any regressions.

cluster-autoscaler/clusterstate/api/types.go Outdated Show resolved Hide resolved
cluster-autoscaler/clusterstate/api/types.go Outdated Show resolved Hide resolved
cluster-autoscaler/clusterstate/api/types.go Outdated Show resolved Hide resolved
cluster-autoscaler/clusterstate/api/types.go Outdated Show resolved Hide resolved
cluster-autoscaler/clusterstate/api/types.go Outdated Show resolved Hide resolved
cluster-autoscaler/clusterstate/api/types.go Show resolved Hide resolved
cluster-autoscaler/clusterstate/clusterstate.go Outdated Show resolved Hide resolved
cluster-autoscaler/clusterstate/clusterstate.go Outdated Show resolved Hide resolved
cluster-autoscaler/clusterstate/clusterstate.go Outdated Show resolved Hide resolved
cluster-autoscaler/clusterstate/utils/status.go Outdated Show resolved Hide resolved
@MaciekPytel
Copy link
Contributor

Sorry, one more comment - would you mind posting an example yaml obtained via kubectl describe configmap or similar? Given that primary use-case is I think manual debugging, I think it's important to make sure the yaml is pretty-printed. Since your unittest does YAML comparison, I don't actually know if the real output is formated in similar way to your example file.

@walidghallab
Copy link
Contributor Author

walidghallab commented Dec 20, 2023

Thanks a lot @MaciekPytel for the review!

/assign @MaciekPytel
/cc @MaciekPytel

Output for kubectl describe configmap cluster-autoscaler-status --namespace=kube-system

Name:         cluster-autoscaler-status
Namespace:    kube-system
Labels:       <none>
Annotations:  cluster-autoscaler.kubernetes.io/last-updated: 2023-12-20 11:04:54.684568722 +0000 UTC

Data
====
status:
----
time: 2023-12-20 11:04:54.684568722 +0000 UTC
autoscalerStatus: Running
clusterWide:
  health:
    status: Healthy
    nodeCounts:
      registered:
        total: 7
        ready: 7
        notStarted: 0
      longUnregistered: 0
      unregistered: 0
    lastProbeTime: "2023-12-20T11:04:54.684568722Z"
    lastTransitionTime: "2023-12-20T10:46:22.549253146Z"
  scaleUp:
    status: NoActivity
    lastProbeTime: "2023-12-20T11:04:54.684568722Z"
    lastTransitionTime: "2023-12-20T11:03:21.319836881Z"
  scaleDown:
    status: NoCandidates
    lastProbeTime: "2023-12-20T11:04:54.684568722Z"
nodeGroups:
- name: https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp
  health:
    status: Healthy
    nodeCounts:
      registered:
        total: 7
        ready: 7
        notStarted: 0
      longUnregistered: 0
      unregistered: 0
    cloudProviderTarget: 7
    minSize: 4
    maxSize: 10
    lastProbeTime: "2023-12-20T11:04:54.684568722Z"
    lastTransitionTime: "2023-12-20T10:46:22.549253146Z"
  scaleUp:
    status: Backoff
    backoffInfo:
      errorCode: QUOTA_EXCEEDED
      errorMessage: 'Instance ''gke-sample-cluster-default-pool-40ce0341-b82s'' creation
        failed: Quota ''CPUS'' exceeded.  Limit: 57.0 in region us-central1., Instance
        ''gke-sample-cluster-default-pool-40ce0341-n8d6'' creation failed: Quota ''CPUS''
        exceeded.  Limit: 57.0 in region us-central1., Instance ''gke-sample-cluster-default-pool-40ce0341-s5tz''
        creation failed: Quota ''CPUS'' exceeded.  Limit: 57.0 in region us-central1.'
    lastProbeTime: "2023-12-20T11:04:54.684568722Z"
    lastTransitionTime: "2023-12-20T11:03:21.319836881Z"
  scaleDown:
    status: NoCandidates
    lastProbeTime: "2023-12-20T11:04:54.684568722Z"


BinaryData
====

Events:
  Type     Reason         Age                 From                Message
  ----     ------         ----                ----                -------
  Warning  ScaleUpFailed  18m                 cluster-autoscaler  Failed adding 1 nodes to group https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp due to OutOfResource.QUOTA_EXCEEDED; source errors: Instance 'gke-sample-cluster-default-pool-40ce0341-b9kr' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1.
  Warning  ScaleUpFailed  17m                 cluster-autoscaler  Failed adding 2 nodes to group https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp due to OutOfResource.QUOTA_EXCEEDED; source errors: Instance 'gke-sample-cluster-default-pool-40ce0341-dh9k' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1., Instance 'gke-sample-cluster-default-pool-40ce0341-swdz' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1.
  Warning  ScaleUpFailed  12m                 cluster-autoscaler  Failed adding 1 nodes to group https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp due to OutOfResource.QUOTA_EXCEEDED; source errors: Instance 'gke-sample-cluster-default-pool-40ce0341-xm2m' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1.
  Warning  ScaleUpFailed  12m                 cluster-autoscaler  Failed adding 1 nodes to group https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp due to OutOfResource.QUOTA_EXCEEDED; source errors: Instance 'gke-sample-cluster-default-pool-40ce0341-vgbp' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1.
  Warning  ScaleUpFailed  12m                 cluster-autoscaler  Failed adding 1 nodes to group https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp due to OutOfResource.QUOTA_EXCEEDED; source errors: Instance 'gke-sample-cluster-default-pool-40ce0341-xf5d' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1.
  Normal   ScaledUpGroup  2m8s (x3 over 18m)  cluster-autoscaler  Scale-up: setting group https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp size to 10 instead of 7 (max: 10)
  Normal   ScaledUpGroup  2m6s (x3 over 18m)  cluster-autoscaler  Scale-up: group https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp size set to 10 instead of 7 (max: 10)
  Warning  ScaleUpFailed  95s                 cluster-autoscaler  Failed adding 3 nodes to group https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp due to OutOfResource.QUOTA_EXCEEDED; source errors: Instance 'gke-sample-cluster-default-pool-40ce0341-b82s' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1., Instance 'gke-sample-cluster-default-pool-40ce0341-n8d6' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1., Instance 'gke-sample-cluster-default-pool-40ce0341-s5tz' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1.

Output for kubectl get configmap cluster-autoscaler-status --namespace=kube-system -o yaml

apiVersion: v1
data:
  status: |
    time: 2023-12-20 11:05:35.373518676 +0000 UTC
    autoscalerStatus: Running
    clusterWide:
      health:
        status: Healthy
        nodeCounts:
          registered:
            total: 7
            ready: 7
            notStarted: 0
          longUnregistered: 0
          unregistered: 0
        lastProbeTime: "2023-12-20T11:05:35.373518676Z"
        lastTransitionTime: "2023-12-20T10:46:22.549253146Z"
      scaleUp:
        status: NoActivity
        lastProbeTime: "2023-12-20T11:05:35.373518676Z"
        lastTransitionTime: "2023-12-20T11:03:21.319836881Z"
      scaleDown:
        status: NoCandidates
        lastProbeTime: "2023-12-20T11:05:35.373518676Z"
    nodeGroups:
    - name: https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp
      health:
        status: Healthy
        nodeCounts:
          registered:
            total: 7
            ready: 7
            notStarted: 0
          longUnregistered: 0
          unregistered: 0
        cloudProviderTarget: 7
        minSize: 4
        maxSize: 10
        lastProbeTime: "2023-12-20T11:05:35.373518676Z"
        lastTransitionTime: "2023-12-20T10:46:22.549253146Z"
      scaleUp:
        status: Backoff
        backoffInfo:
          errorCode: QUOTA_EXCEEDED
          errorMessage: 'Instance ''gke-sample-cluster-default-pool-40ce0341-b82s'' creation
            failed: Quota ''CPUS'' exceeded.  Limit: 57.0 in region us-central1., Instance
            ''gke-sample-cluster-default-pool-40ce0341-n8d6'' creation failed: Quota ''CPUS''
            exceeded.  Limit: 57.0 in region us-central1., Instance ''gke-sample-cluster-default-pool-40ce0341-s5tz''
            creation failed: Quota ''CPUS'' exceeded.  Limit: 57.0 in region us-central1.'
        lastProbeTime: "2023-12-20T11:05:35.373518676Z"
        lastTransitionTime: "2023-12-20T11:03:21.319836881Z"
      scaleDown:
        status: NoCandidates
        lastProbeTime: "2023-12-20T11:05:35.373518676Z"
kind: ConfigMap
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/last-updated: 2023-12-20 11:05:35.373518676 +0000
      UTC
  creationTimestamp: "2023-12-20T10:46:10Z"
  name: cluster-autoscaler-status
  namespace: kube-system
  resourceVersion: "22056994"
  uid: 2be11b59-cf5c-4bd5-bb31-b7baf133bcbc

I changed the project name in the messages above.

I will refactor config map tests in a subsequent PR since it is already existing state and out of scope of this PR.

@walidghallab
Copy link
Contributor Author

/hold Holding submission to make sure this comment is addressed before the bot submit automatically.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 28, 2023
@walidghallab
Copy link
Contributor Author

walidghallab commented Dec 28, 2023

/unhold The comment has been addressed.
I've added a new field beingDeleted and added tag omitDefault to it to not display it when it is 0 (which happens most of the time).

@walidghallab
Copy link
Contributor Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 28, 2023
Copy link
Contributor

@MaciekPytel MaciekPytel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
I left a bunch of non-blocking comments for a potential discussion / follow-up. Let's not fix them on this PR.

/hold
For fixing one comment that I see as blocking - the one about using "Initializing" instead of const in actionable_cluster_processor. Please feel free to remove hold as soon as you address this one.

cluster-autoscaler/clusterstate/api/types.go Show resolved Hide resolved
cluster-autoscaler/clusterstate/api/types.go Show resolved Hide resolved
@@ -839,130 +846,114 @@ func buildScaleUpStatusNodeGroup(isScaleUpInProgress bool, scaleUpSafety NodeGro
condition.Status = api.ClusterAutoscalerUnhealthy
} else if !scaleUpSafety.SafeToScale {
condition.Status = api.ClusterAutoscalerBackoff
condition.BackoffInfo = api.BackoffInfo{
ErrorCode: scaleUpSafety.BackoffStatus.ErrorInfo.ErrorCode,
ErrorMessage: scaleUpSafety.BackoffStatus.ErrorInfo.ErrorMessage,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should maybe trim this to some maximum length? Max size of configmap is 1MB and that's not much if you have a few hundred nodegroups.
A few hundred characters should likely be a reasonable limit for an error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't print all the error messages. Only the first three error messages in each node group are printed.
They are also the same error messages used in events.

I added truncation logic for extra safety, PTAL.

cluster-autoscaler/clusterstate/clusterstate.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Dec 28, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MaciekPytel, walidghallab

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 28, 2023
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 28, 2023
…o for backoff and more node counts.

Change-Id: Ic68e0d67b7ce9912b605b6c0a3356b4d0e177911
…roup.

Max size of configmap is 1MB.

Change-Id: I615d25781e4f8dafb6a08f752c085544bcd49e5a
@MaciekPytel
Copy link
Contributor

/lgtm
/hold cancel

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Dec 29, 2023
@k8s-ci-robot k8s-ci-robot merged commit c7ad47b into kubernetes:master Dec 29, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add backoff data to cluster-autoscaler-status and make the status easier for parsing
4 participants