Add backoff data to cluster-autoscaler-status and make the status easier for parsing #6318

walidghallab · 2023-11-27T14:19:58Z

Which component are you using?:
cluster-autoscaler

The problem this feature should solve:
cluster-autoscaler-status config map shows which nodegroups are in backoff but it doesn't show the reasons behind backoff of nodegroups which makes debugging harder especially for large clusters.
Also the status of cluster-autoscaler doesn't confine to a pattern that is easily parsable.

Describe the solution you'd like:
To add backoff reason (errorCode & errorMessage) for each nodegroup being in backoff in the status field of cluster-autoscaler-status config map.

Also change the current format of status field from the current format to yaml which is both human-readable and easier in parsing.

Example for the status field:

Current state (Human readable string without backoff data):

status: |
Cluster-autoscaler status at 2023-11-24 04:28:19.546750398 +0000 UTC:
Cluster-wide:
  Health:      Healthy (ready=7 unready=0 (resourceUnready=0) notStarted=0 longNotStarted=0 registered=7 longUnregistered=0)
               LastProbeTime:      2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876
               LastTransitionTime: 2023-11-23 14:52:02.009391854 +0000 UTC m=+10.793340938
  ScaleUp:     NoActivity (ready=7 registered=7)
               LastProbeTime:      2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876
               LastTransitionTime: 2023-11-24 04:15:34.972108466 +0000 UTC m=+48223.756057543
  ScaleDown:   NoCandidates (candidates=0)
               LastProbeTime:      2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876
               LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC

NodeGroups:
  Name:        https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp
  Health:      Healthy (ready=7 unready=0 (resourceUnready=0) notStarted=0 longNotStarted=0 registered=7 longUnregistered=0 cloudProviderTarget=7 (minSize=4, maxSize=10))
               LastProbeTime:      2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876
               LastTransitionTime: 2023-11-23 14:52:02.009391854 +0000 UTC m=+10.793340938
  ScaleUp:     Backoff (ready=7 cloudProviderTarget=7)
               LastProbeTime:      2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876
               LastTransitionTime: 2023-11-24 04:15:34.972108466 +0000 UTC m=+48223.756057543
  ScaleDown:   NoCandidates (candidates=0)
               LastProbeTime:      2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876
               LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC

Target state (yaml with backoff data):

status:
  time: "2023-11-24 04:28:19.546750398 +0000 UTC"
  autoscalerStatus: "Running"
  clusterWide:
    health: 
      status: "Healthy"
      nodeCounts:
        ready: 7
        unready: 0
        resourceUnready: 0
        notStarted: 0
        registered: 7
        longUnregistered: 0
      lastProbeTime: "2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876"
      lastTransitionTime: "2023-11-23 14:52:02.009391854 +0000 UTC m=+10.793340938"
    scaleUp: 
      status: "NoActivity"
      nodeCounts:
        ready: "7"
        registered: "7"
      lastProbeTime: "2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876"
      lastTransitionTime: "2023-11-24 04:15:34.972108466 +0000 UTC m=+48223.756057543"
    scaleDown:   
      status: "NoCandidates"
      candidates: 0
      lastProbeTime: "2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876"
      lastTransitionTime: "0001-01-01 00:00:00 +0000 UTC"

  nodeGroups:
    -
      name: "https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp"
      health:      
        status: "Healthy"
        nodeCounts:
          ready: 7
          unready: 0
          resourceUnready: 0
          notStarted: 0
          registered: 7
          cloudProviderTarget:
            targetSize: 7
            minSize: 4
            maxSize: 10
        lastProbeTime: "2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876"
        lastTransitionTime: "2023-11-23 14:52:02.009391854 +0000 UTC m=+10.793340938"
      scaleUp:   
        status: "Backoff"
        nodeCounts:
          ready: "7"
          cloudProviderTarget: "7"
        errorCode: "QUOTA_EXCEEDED"
        errorMessage: "Instance 'gke-sample-cluster-default-pool-40ce0341-t28s' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1., Instance 'gke-sample-cluster-default-pool-40ce0341-8pz9' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1., Instance 'gke-sample-cluster-default-pool-40ce0341-q8ck' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1."
        lastProbeTime: "2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876"
        lastTransitionTime: "2023-11-24 04:15:34.972108466 +0000 UTC m=+48223.756057543"
      scaleDown:   
        status: "NoCandidates"
        candidates: 0
        lastProbeTime: "2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876"
        lastTransitionTime: "0001-01-01 00:00:00 +0000 UTC"

The text was updated successfully, but these errors were encountered:

MaciekPytel · 2023-11-28T15:49:44Z

I have a few comments to the proposed format:

Do we need to repeat node counts in 'health' and 'scaleUp'? It seems redundant (I realize the current configmap repeats this info, but maybe that's an opportunity to clean-up?).
I don't think minSize/maxSize should be under cloudProviderTarget. Again, current formating puts it that way, but I don't really think those are connected. Min/Max size are cluster autoscaler constraints, not cloudprovider ones (and many cloudproviders have min/max size of ASG, so putting it here makes it easy to confuse those). In your example I'd change it to:
```
health:
    nodeCounts:
         (...)
         longUnregistered: 0
    cloudProviderTarget: 7
    minSize: 4
    maxSize: 10
```
I'd consider putting backoff related information under something like 'backoffInfo' or similar. Using your example:

  scaleUp:   
        status: "Backoff"
        backoffInfo:
            errorCode: "QUOTA_EXCEEDED"
            errorMessage: "Instance 'gke-sample-cluster-default-pool-40ce0341-t28s' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1., Instance 'gke-sample-cluster-default-pool-40ce0341-8pz9' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1., Instance 'gke-sample-cluster-default-pool-40ce0341-q8ck' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1."

walidghallab · 2023-11-29T15:06:25Z

Good points, thanks Maciek!

Do we need to repeat node counts in 'health' and 'scaleUp'? It seems redundant (I realize the current configmap repeats this info, but maybe that's an opportunity to clean-up?).

I agree, I'll remove remove nodeCounts from scalueUp.

I don't think minSize/maxSize should be under cloudProviderTarget. Again, current formating puts it that way, but I don't really think those are connected. Min/Max size are cluster autoscaler constraints, not cloudprovider ones (and many cloudproviders have min/max size of ASG, so putting it here makes it easy to confuse those).

I am not sure about this one. I don't know why it is named cloudProviderTarget, rather than clusterAutoscalerTarget. In that case, minSize and maxSize are the min size and max size of autoscaler and not the node group. For example, if min size for autoscaler is 3, but the user manually removes nodes until the count is 2 and the utilization is still low, autoscaler won't scale up to 3 and the size will be smaller than the min size of the autoscaler settings.
So, WDYT of still keeping minSize and maxSize nested as they are now, and rename cloudProviderTarget to clusterAutoscalerTarget?

I'd consider putting backoff related information under something like 'backoffInfo' or similar. :

I agree, I'll do that.

MaciekPytel · 2023-12-01T12:16:52Z

I still don't think we should connect cloudProviderTarget and min/max anymore than we connect nodeCounts and min/max. This is undocumented, so there is no way to prove it, but even with the old configmap I think the min/max in parenthesis was expected to be interpreted as a sort of comment to the entire line that is now represented as "nodeCounts" and not interpreted as specifically related to cloudProviderTarget. It just so happened that cloudProviderTarget was the last field.
I think that line could have just as well looked like "cloudProviderTarget=5 ready=5 unready=0 (minSize=1, maxSize=8)" with some other field being the one next to min/max size information. Admittedly the existence of "resourceUnready" in parenthesis conflicts with this interpretation as this one clearly refers to unready, but resourceUnready was only added recently and I don't think it reflects on original intent.

Regardless of the above:

I think 'cloudProviderTarget' makes sense as a name. This field represents the "target size" of whatever cloud provider concept is backing CA NodeGroup (e.g. MIG on GCE, ASG on AWS). It doesn't matter if it was Cluster Autoscaler that set this value or someone else (and if CA "thinks" this value is right - it may very well disagree and change it in the next loop). This field (and every other field in nodeCounts) is just CA's observation of external world and not a representation of CA intent.
MinSize and MaxSize on the other hand are just internal CA configuration. They actually don't have any meaning outside of CA. cloudProviderTarget or nodeCounts may very well be larger than maxSize or less than minSize. All that this field means is that CA will not resize nodeGroup besize that limit (but other components or user still may do it). Hence why I suggested putting it separately from both cloudProviderTarget and nodeCount.
It is especially confusing on providers where the underlying nodeGroup actually do have a min/max size. In this case the min/max size used by CA is coming from CA configuration and it is not the cloud provider min/max size. Example of this is AWS, where CA min/max can be different than ASG minimum/maximum capacity. Connecting min/max size to cloudProviderTarget would suggest that those are "cloud provider min/max" where they're in fact "CA min/max that may be different from underlying cloud provider min/max".

MaciekPytel · 2023-12-01T12:30:11Z

Actually this discussion makes me think about whether we should nest resourceUnready under unready, or even nest a bunch of stuff under registered. Technically the correct representation would be:

registered:
  total: 8
  ready: 4
  unready:
    total: 3
    resourceUnready: 2
  notStarted: 1

This actually illustrates how the different states aggregate. Registered should equal ready+unready+notStarted. ResourceUnready nodes are already counted as 'unready' and including them in calculation would lead to double counting (and longUnregistered are obviously not part of registered).
My guess is that vast majority of CA users wouldn't know the relationships between those states. I wonder if more structured representation like this would help make this more clear? Or if it would just make things harder to read and parse?

@x13n @towca @BigDarkClown opinions?

walidghallab · 2023-12-01T12:47:09Z

Thanks a lot @MaciekPytel , this is very helpful to know!

x13n · 2023-12-01T18:03:48Z

The reason why resourceUnready is currently displayed in parentheses is to make it clear it is not at the same level as other counts, but rather "nested" under unready. We would lose this information if we put them at the same level. I quite like the idea of showing the relationships between different node counts as a tree. There's a bit of extra boilerplate coming from all the "totals" on each level, but in my opinion the extra readability is worth it.

walidghallab · 2023-12-08T19:19:59Z

@MaciekPytel @x13n @towca I found that "deleted" is counted towards the registered count. WDYT of adding it as well? It was never printed before but I think it can be useful.

autoscaler/cluster-autoscaler/clusterstate/clusterstate.go

Line 586 in 844e61c

current.Deleted = append(current.Deleted, node.Name)

Also WDYT of adding "unregistered" which wasn't there before as well?

walidghallab · 2023-12-22T16:54:07Z

/assign @walidghallab

MaciekPytel · 2023-12-28T12:16:47Z

+1 to printing both unregistered and deleted. Especially deleted, no idea why we haven't been printing them already, I assume it was just an omission and not an intentional decision

walidghallab · 2023-12-28T12:41:39Z

Thanks @MaciekPytel
I've added them. I've also named deleted to be beingDeleted as I feel deleted might be confusing.

walidghallab · 2023-12-29T14:49:13Z

For future readers:

Schema used defined here (it is subject to change in the future).
Example yaml here.

walidghallab added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 27, 2023

walidghallab mentioned this issue Nov 30, 2023

Add error details to autoscaling backoff. #6327

Merged

This was referenced Dec 14, 2023

Make backoff.Status.ErrorInfo non-pointer. #6374

Merged

Convert status in cluster-autoscaler-status to yaml and add error info for scale-up backoff #6375

Merged

k8s-ci-robot assigned walidghallab Dec 22, 2023

k8s-ci-robot closed this as completed in #6375 Dec 29, 2023

kevinkrp93 mentioned this issue Mar 25, 2024

[Feature] Improved visibility into root cause of Backoff state for autoscaling node pools Azure/AKS#4132

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add backoff data to cluster-autoscaler-status and make the status easier for parsing #6318

Add backoff data to cluster-autoscaler-status and make the status easier for parsing #6318

walidghallab commented Nov 27, 2023 •

edited

Loading

MaciekPytel commented Nov 28, 2023

walidghallab commented Nov 29, 2023

MaciekPytel commented Dec 1, 2023

MaciekPytel commented Dec 1, 2023 •

edited

Loading

walidghallab commented Dec 1, 2023

x13n commented Dec 1, 2023

walidghallab commented Dec 8, 2023 •

edited

Loading

walidghallab commented Dec 22, 2023

MaciekPytel commented Dec 28, 2023

walidghallab commented Dec 28, 2023

walidghallab commented Dec 29, 2023

Add backoff data to cluster-autoscaler-status and make the status easier for parsing #6318

Add backoff data to cluster-autoscaler-status and make the status easier for parsing #6318

Comments

walidghallab commented Nov 27, 2023 • edited Loading

MaciekPytel commented Nov 28, 2023

walidghallab commented Nov 29, 2023

MaciekPytel commented Dec 1, 2023

MaciekPytel commented Dec 1, 2023 • edited Loading

walidghallab commented Dec 1, 2023

x13n commented Dec 1, 2023

walidghallab commented Dec 8, 2023 • edited Loading

walidghallab commented Dec 22, 2023

MaciekPytel commented Dec 28, 2023

walidghallab commented Dec 28, 2023

walidghallab commented Dec 29, 2023

walidghallab commented Nov 27, 2023 •

edited

Loading

MaciekPytel commented Dec 1, 2023 •

edited

Loading

walidghallab commented Dec 8, 2023 •

edited

Loading