Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add backoff data to cluster-autoscaler-status and make the status easier for parsing #6318

Closed
walidghallab opened this issue Nov 27, 2023 · 11 comments · Fixed by #6375
Closed
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@walidghallab
Copy link
Contributor

walidghallab commented Nov 27, 2023

Which component are you using?:
cluster-autoscaler

The problem this feature should solve:
cluster-autoscaler-status config map shows which nodegroups are in backoff but it doesn't show the reasons behind backoff of nodegroups which makes debugging harder especially for large clusters.
Also the status of cluster-autoscaler doesn't confine to a pattern that is easily parsable.

Describe the solution you'd like:
To add backoff reason (errorCode & errorMessage) for each nodegroup being in backoff in the status field of cluster-autoscaler-status config map.

Also change the current format of status field from the current format to yaml which is both human-readable and easier in parsing.

Example for the status field:

Current state (Human readable string without backoff data):

status: |
Cluster-autoscaler status at 2023-11-24 04:28:19.546750398 +0000 UTC:
Cluster-wide:
  Health:      Healthy (ready=7 unready=0 (resourceUnready=0) notStarted=0 longNotStarted=0 registered=7 longUnregistered=0)
               LastProbeTime:      2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876
               LastTransitionTime: 2023-11-23 14:52:02.009391854 +0000 UTC m=+10.793340938
  ScaleUp:     NoActivity (ready=7 registered=7)
               LastProbeTime:      2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876
               LastTransitionTime: 2023-11-24 04:15:34.972108466 +0000 UTC m=+48223.756057543
  ScaleDown:   NoCandidates (candidates=0)
               LastProbeTime:      2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876
               LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC

NodeGroups:
  Name:        https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp
  Health:      Healthy (ready=7 unready=0 (resourceUnready=0) notStarted=0 longNotStarted=0 registered=7 longUnregistered=0 cloudProviderTarget=7 (minSize=4, maxSize=10))
               LastProbeTime:      2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876
               LastTransitionTime: 2023-11-23 14:52:02.009391854 +0000 UTC m=+10.793340938
  ScaleUp:     Backoff (ready=7 cloudProviderTarget=7)
               LastProbeTime:      2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876
               LastTransitionTime: 2023-11-24 04:15:34.972108466 +0000 UTC m=+48223.756057543
  ScaleDown:   NoCandidates (candidates=0)
               LastProbeTime:      2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876
               LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC

Target state (yaml with backoff data):

status:
  time: "2023-11-24 04:28:19.546750398 +0000 UTC"
  autoscalerStatus: "Running"
  clusterWide:
    health: 
      status: "Healthy"
      nodeCounts:
        ready: 7
        unready: 0
        resourceUnready: 0
        notStarted: 0
        registered: 7
        longUnregistered: 0
      lastProbeTime: "2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876"
      lastTransitionTime: "2023-11-23 14:52:02.009391854 +0000 UTC m=+10.793340938"
    scaleUp: 
      status: "NoActivity"
      nodeCounts:
        ready: "7"
        registered: "7"
      lastProbeTime: "2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876"
      lastTransitionTime: "2023-11-24 04:15:34.972108466 +0000 UTC m=+48223.756057543"
    scaleDown:   
      status: "NoCandidates"
      candidates: 0
      lastProbeTime: "2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876"
      lastTransitionTime: "0001-01-01 00:00:00 +0000 UTC"

  nodeGroups:
    -
      name: "https://www.googleapis.com/compute/v1/projects/sample-project/zones/us-central1-c/instanceGroups/gke-sample-cluster-default-pool-40ce0341-grp"
      health:      
        status: "Healthy"
        nodeCounts:
          ready: 7
          unready: 0
          resourceUnready: 0
          notStarted: 0
          registered: 7
          cloudProviderTarget:
            targetSize: 7
            minSize: 4
            maxSize: 10
        lastProbeTime: "2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876"
        lastTransitionTime: "2023-11-23 14:52:02.009391854 +0000 UTC m=+10.793340938"
      scaleUp:   
        status: "Backoff"
        nodeCounts:
          ready: "7"
          cloudProviderTarget: "7"
        errorCode: "QUOTA_EXCEEDED"
        errorMessage: "Instance 'gke-sample-cluster-default-pool-40ce0341-t28s' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1., Instance 'gke-sample-cluster-default-pool-40ce0341-8pz9' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1., Instance 'gke-sample-cluster-default-pool-40ce0341-q8ck' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1."
        lastProbeTime: "2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876"
        lastTransitionTime: "2023-11-24 04:15:34.972108466 +0000 UTC m=+48223.756057543"
      scaleDown:   
        status: "NoCandidates"
        candidates: 0
        lastProbeTime: "2023-11-24 04:28:19.408724799 +0000 UTC m=+48988.192673876"
        lastTransitionTime: "0001-01-01 00:00:00 +0000 UTC"

@walidghallab walidghallab added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 27, 2023
@MaciekPytel
Copy link
Contributor

I have a few comments to the proposed format:

  • Do we need to repeat node counts in 'health' and 'scaleUp'? It seems redundant (I realize the current configmap repeats this info, but maybe that's an opportunity to clean-up?).
  • I don't think minSize/maxSize should be under cloudProviderTarget. Again, current formating puts it that way, but I don't really think those are connected. Min/Max size are cluster autoscaler constraints, not cloudprovider ones (and many cloudproviders have min/max size of ASG, so putting it here makes it easy to confuse those). In your example I'd change it to:
    health:
        nodeCounts:
             (...)
             longUnregistered: 0
        cloudProviderTarget: 7
        minSize: 4
        maxSize: 10
    
  • I'd consider putting backoff related information under something like 'backoffInfo' or similar. Using your example:
  scaleUp:   
        status: "Backoff"
        backoffInfo:
            errorCode: "QUOTA_EXCEEDED"
            errorMessage: "Instance 'gke-sample-cluster-default-pool-40ce0341-t28s' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1., Instance 'gke-sample-cluster-default-pool-40ce0341-8pz9' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1., Instance 'gke-sample-cluster-default-pool-40ce0341-q8ck' creation failed: Quota 'CPUS' exceeded.  Limit: 57.0 in region us-central1."

@walidghallab
Copy link
Contributor Author

Good points, thanks Maciek!

Do we need to repeat node counts in 'health' and 'scaleUp'? It seems redundant (I realize the current configmap repeats this info, but maybe that's an opportunity to clean-up?).

I agree, I'll remove remove nodeCounts from scalueUp.

I don't think minSize/maxSize should be under cloudProviderTarget. Again, current formating puts it that way, but I don't really think those are connected. Min/Max size are cluster autoscaler constraints, not cloudprovider ones (and many cloudproviders have min/max size of ASG, so putting it here makes it easy to confuse those).

I am not sure about this one. I don't know why it is named cloudProviderTarget, rather than clusterAutoscalerTarget. In that case, minSize and maxSize are the min size and max size of autoscaler and not the node group. For example, if min size for autoscaler is 3, but the user manually removes nodes until the count is 2 and the utilization is still low, autoscaler won't scale up to 3 and the size will be smaller than the min size of the autoscaler settings.
So, WDYT of still keeping minSize and maxSize nested as they are now, and rename cloudProviderTarget to clusterAutoscalerTarget?

I'd consider putting backoff related information under something like 'backoffInfo' or similar. :

I agree, I'll do that.

@MaciekPytel
Copy link
Contributor

I still don't think we should connect cloudProviderTarget and min/max anymore than we connect nodeCounts and min/max. This is undocumented, so there is no way to prove it, but even with the old configmap I think the min/max in parenthesis was expected to be interpreted as a sort of comment to the entire line that is now represented as "nodeCounts" and not interpreted as specifically related to cloudProviderTarget. It just so happened that cloudProviderTarget was the last field.
I think that line could have just as well looked like "cloudProviderTarget=5 ready=5 unready=0 (minSize=1, maxSize=8)" with some other field being the one next to min/max size information. Admittedly the existence of "resourceUnready" in parenthesis conflicts with this interpretation as this one clearly refers to unready, but resourceUnready was only added recently and I don't think it reflects on original intent.

Regardless of the above:

  • I think 'cloudProviderTarget' makes sense as a name. This field represents the "target size" of whatever cloud provider concept is backing CA NodeGroup (e.g. MIG on GCE, ASG on AWS). It doesn't matter if it was Cluster Autoscaler that set this value or someone else (and if CA "thinks" this value is right - it may very well disagree and change it in the next loop). This field (and every other field in nodeCounts) is just CA's observation of external world and not a representation of CA intent.
  • MinSize and MaxSize on the other hand are just internal CA configuration. They actually don't have any meaning outside of CA. cloudProviderTarget or nodeCounts may very well be larger than maxSize or less than minSize. All that this field means is that CA will not resize nodeGroup besize that limit (but other components or user still may do it). Hence why I suggested putting it separately from both cloudProviderTarget and nodeCount.
  • It is especially confusing on providers where the underlying nodeGroup actually do have a min/max size. In this case the min/max size used by CA is coming from CA configuration and it is not the cloud provider min/max size. Example of this is AWS, where CA min/max can be different than ASG minimum/maximum capacity. Connecting min/max size to cloudProviderTarget would suggest that those are "cloud provider min/max" where they're in fact "CA min/max that may be different from underlying cloud provider min/max".

@MaciekPytel
Copy link
Contributor

MaciekPytel commented Dec 1, 2023

Actually this discussion makes me think about whether we should nest resourceUnready under unready, or even nest a bunch of stuff under registered. Technically the correct representation would be:

registered:
  total: 8
  ready: 4
  unready:
    total: 3
    resourceUnready: 2
  notStarted: 1

This actually illustrates how the different states aggregate. Registered should equal ready+unready+notStarted. ResourceUnready nodes are already counted as 'unready' and including them in calculation would lead to double counting (and longUnregistered are obviously not part of registered).
My guess is that vast majority of CA users wouldn't know the relationships between those states. I wonder if more structured representation like this would help make this more clear? Or if it would just make things harder to read and parse?

@x13n @towca @BigDarkClown opinions?

@walidghallab
Copy link
Contributor Author

Thanks a lot @MaciekPytel , this is very helpful to know!

@x13n
Copy link
Member

x13n commented Dec 1, 2023

The reason why resourceUnready is currently displayed in parentheses is to make it clear it is not at the same level as other counts, but rather "nested" under unready. We would lose this information if we put them at the same level. I quite like the idea of showing the relationships between different node counts as a tree. There's a bit of extra boilerplate coming from all the "totals" on each level, but in my opinion the extra readability is worth it.

@walidghallab
Copy link
Contributor Author

walidghallab commented Dec 8, 2023

@MaciekPytel @x13n @towca I found that "deleted" is counted towards the registered count. WDYT of adding it as well? It was never printed before but I think it can be useful.

current.Deleted = append(current.Deleted, node.Name)

Also WDYT of adding "unregistered" which wasn't there before as well?

@walidghallab
Copy link
Contributor Author

/assign @walidghallab

@MaciekPytel
Copy link
Contributor

+1 to printing both unregistered and deleted. Especially deleted, no idea why we haven't been printing them already, I assume it was just an omission and not an intentional decision

@walidghallab
Copy link
Contributor Author

Thanks @MaciekPytel
I've added them. I've also named deleted to be beingDeleted as I feel deleted might be confusing.

@walidghallab
Copy link
Contributor Author

For future readers:

  • Schema used defined here (it is subject to change in the future).
  • Example yaml here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
3 participants