Regression since 1.28 with managing unregistered nodes #6524

yarinm · 2024-02-12T08:56:10Z

Which component are you using?: cluster-autoscaler
What version of the component are you using?: 1.28.X and also 1.29.0
What k8s version are you using (kubectl version)?: 1.28.5
What environment is this in?: AWS but I think this bug is possibly in Azure as well
What did you expect to happen?:
We have EKS clusters that are using self-managed node groups, we noticed since upgrading 1.28 that sometimes the autoscaler stops scaling up node groups once it detects an unregistered node that passed the provisioning allotted timeout. This sometimes happen when the node fails to register its kubelet with the api server.

When that happens we see these logs:

Scale-up timed out for node group ng-1-9af2fac after 15m5.826849861s
Disabling scale-up for node group ng-1-9af2fac until 2024-02-11 10:47:42.875469718 +0000 UTC m=+1232.844506453; errorClass=Other; errorCode=timeout
Readiness for node group ng-1-9af2fac not found
Failed to find readiness information for ng-1-9af2fac

In version 1.27.X in this scenario, autoscaler detects this node as unregistered and eventually it removes it and tries to keep scaling up additional nodes:

    I0131 09:52:17.121052       1 static_autoscaler.go:746] Removing unregistered node aws:///us-east-2b/i-036148ac351ea2dde
    I0131 09:52:17.297294       1 static_autoscaler.go:746] Removing unregistered node aws:///us-east-2a/i-07de3550c695e78f9
    I0131 09:52:17.297402       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"b2174fad-f01d-4c43-8c72-255af3e7b752", APIVersion:"v1", ResourceVersion:"27833033", FieldPath:""}): type: 'Normal' reason: 'DeleteUnregistered' Removed unregistered node aws:///us-east-2b/i-036148ac351ea2dde
    I0131 09:52:17.487807       1 static_autoscaler.go:746] Removing unregistered node aws:///us-east-2c/i-0a5627f0a8e056899
    I0131 09:52:17.487911       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"b2174fad-f01d-4c43-8c72-255af3e7b752", APIVersion:"v1", ResourceVersion:"27833033", FieldPath:""}): type: 'Normal' reason: 'DeleteUnregistered' Removed unregistered node aws:///us-east-2a/i-07de3550c695e78f9
    I0131 09:52:17.661894       1 static_autoscaler.go:413] Some unregistered nodes were removed
    I0131 09:52:17.661952       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"b2174fad-f01d-4c43-8c72-255af3e7b752", APIVersion:"v1", ResourceVersion:"27833033", FieldPath:""}): type: 'Normal' reason: 'DeleteUnregistered' Removed unregistered node aws:///us-east-2c/i-0a5627f0a8e056899

To test this locally I set up an EKS with self managed node groups and just edited the startup script to do exit 1 before running the kubelet causing the node to not register.

This is a regression from 1.27 to 1.28+ and I found the culprit commit:
e5bc070

From debugging I noticed instance.Status is always nil (This is taken from aws_cloud_provider.go):

func (ng *AwsNodeGroup) Nodes() ([]cloudprovider.Instance, error) {
	asgNodes, err := ng.awsManager.GetAsgNodes(ng.asg.AwsRef)
	if err != nil {
		return nil, err
	}

	instances := make([]cloudprovider.Instance, len(asgNodes))

	for i, asgNode := range asgNodes {
		var status *cloudprovider.InstanceStatus
		instanceStatusString, err := ng.awsManager.GetInstanceStatus(asgNode)
		if err != nil {
			klog.V(4).Infof("Could not get instance status, continuing anyways: %v", err)
		} else if instanceStatusString != nil && *instanceStatusString == placeholderUnfulfillableStatus {
			status = &cloudprovider.InstanceStatus{
				State: cloudprovider.InstanceCreating,
				ErrorInfo: &cloudprovider.InstanceErrorInfo{
					ErrorClass:   cloudprovider.OutOfResourcesErrorClass,
					ErrorCode:    placeholderUnfulfillableStatus,
					ErrorMessage: "AWS cannot provision any more instances for this node group",
				},
			}
		}
		instances[i] = cloudprovider.Instance{
			Id:     asgNode.ProviderID,
			Status: status,
		}
	}
	return instances, nil

So this check never adds unregistered nodes to the list.. So my suggestion is either we revert this commit or just change expectedToRegister to:

func expectedToRegister(instance cloudprovider.Instance) bool {
	return instance.Status == nil || ( instance.Status.State != cloudprovider.InstanceDeleting && instance.Status.ErrorInfo == nil)
}

@azylinski @x13n

The text was updated successfully, but these errors were encountered:

azylinski · 2024-02-12T09:24:57Z

Thanks yarinm . The change to expectedToRegister func makes sense to me. @x13n , would you agree?

x13n · 2024-02-12T17:40:41Z

Yup, that certainly looks better. Thanks @yarinm for investigating this!

x13n · 2024-02-12T19:55:07Z

I'd also consider changing Status to be of type InstanceStatus, not *InstanceStatus. Currently it is documented as optional:

autoscaler/cluster-autoscaler/cloudprovider/cloud_provider.go

Lines 245 to 246 in 6c14a3a

    
           // Status represents status of node. (Optional) 
        
           Status *InstanceStatus

But honestly I have no idea how to interpret it then.

yarinm · 2024-02-13T09:19:30Z

Created a PR with the fix #6528
@x13n

Shubham82 · 2024-02-14T09:12:33Z

/triage accepted

fmuyassarov · 2024-02-26T10:54:02Z

Perhaps this can be closed as fix has landed in #6528 ?

Shubham82 · 2024-02-26T11:28:07Z

Closing this issue, as it is resolved by PR #6528
/close

Thanks!

k8s-ci-robot · 2024-02-26T11:28:11Z

@Shubham82: Closing this issue.

In response to this:

Closing this issue, as it is resolved by PR #6528
/close

Thanks!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yarinm added the kind/bug Categorizes issue or PR as related to a bug. label Feb 12, 2024

yarinm mentioned this issue Feb 13, 2024

Fix expectedToRegister to respect instances with nil status #6528

Merged

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Feb 14, 2024

k8s-ci-robot closed this as completed Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression since 1.28 with managing unregistered nodes #6524

Regression since 1.28 with managing unregistered nodes #6524

yarinm commented Feb 12, 2024 •

edited

Loading

azylinski commented Feb 12, 2024

x13n commented Feb 12, 2024

x13n commented Feb 12, 2024

yarinm commented Feb 13, 2024

Shubham82 commented Feb 14, 2024

fmuyassarov commented Feb 26, 2024

Shubham82 commented Feb 26, 2024

k8s-ci-robot commented Feb 26, 2024

Regression since 1.28 with managing unregistered nodes #6524

Regression since 1.28 with managing unregistered nodes #6524

Comments

yarinm commented Feb 12, 2024 • edited Loading

azylinski commented Feb 12, 2024

x13n commented Feb 12, 2024

x13n commented Feb 12, 2024

yarinm commented Feb 13, 2024

Shubham82 commented Feb 14, 2024

fmuyassarov commented Feb 26, 2024

Shubham82 commented Feb 26, 2024

k8s-ci-robot commented Feb 26, 2024

yarinm commented Feb 12, 2024 •

edited

Loading