Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression since 1.28 with managing unregistered nodes #6524

Closed
yarinm opened this issue Feb 12, 2024 · 8 comments
Closed

Regression since 1.28 with managing unregistered nodes #6524

yarinm opened this issue Feb 12, 2024 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@yarinm
Copy link
Contributor

yarinm commented Feb 12, 2024

Which component are you using?: cluster-autoscaler
What version of the component are you using?: 1.28.X and also 1.29.0
What k8s version are you using (kubectl version)?: 1.28.5
What environment is this in?: AWS but I think this bug is possibly in Azure as well
What did you expect to happen?:
We have EKS clusters that are using self-managed node groups, we noticed since upgrading 1.28 that sometimes the autoscaler stops scaling up node groups once it detects an unregistered node that passed the provisioning allotted timeout. This sometimes happen when the node fails to register its kubelet with the api server.

When that happens we see these logs:

Scale-up timed out for node group ng-1-9af2fac after 15m5.826849861s
Disabling scale-up for node group ng-1-9af2fac until 2024-02-11 10:47:42.875469718 +0000 UTC m=+1232.844506453; errorClass=Other; errorCode=timeout
Readiness for node group ng-1-9af2fac not found
Failed to find readiness information for ng-1-9af2fac

In version 1.27.X in this scenario, autoscaler detects this node as unregistered and eventually it removes it and tries to keep scaling up additional nodes:

    I0131 09:52:17.121052       1 static_autoscaler.go:746] Removing unregistered node aws:///us-east-2b/i-036148ac351ea2dde
    I0131 09:52:17.297294       1 static_autoscaler.go:746] Removing unregistered node aws:///us-east-2a/i-07de3550c695e78f9
    I0131 09:52:17.297402       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"b2174fad-f01d-4c43-8c72-255af3e7b752", APIVersion:"v1", ResourceVersion:"27833033", FieldPath:""}): type: 'Normal' reason: 'DeleteUnregistered' Removed unregistered node aws:///us-east-2b/i-036148ac351ea2dde
    I0131 09:52:17.487807       1 static_autoscaler.go:746] Removing unregistered node aws:///us-east-2c/i-0a5627f0a8e056899
    I0131 09:52:17.487911       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"b2174fad-f01d-4c43-8c72-255af3e7b752", APIVersion:"v1", ResourceVersion:"27833033", FieldPath:""}): type: 'Normal' reason: 'DeleteUnregistered' Removed unregistered node aws:///us-east-2a/i-07de3550c695e78f9
    I0131 09:52:17.661894       1 static_autoscaler.go:413] Some unregistered nodes were removed
    I0131 09:52:17.661952       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"b2174fad-f01d-4c43-8c72-255af3e7b752", APIVersion:"v1", ResourceVersion:"27833033", FieldPath:""}): type: 'Normal' reason: 'DeleteUnregistered' Removed unregistered node aws:///us-east-2c/i-0a5627f0a8e056899

To test this locally I set up an EKS with self managed node groups and just edited the startup script to do exit 1 before running the kubelet causing the node to not register.

This is a regression from 1.27 to 1.28+ and I found the culprit commit:
e5bc070

From debugging I noticed instance.Status is always nil (This is taken from aws_cloud_provider.go):

func (ng *AwsNodeGroup) Nodes() ([]cloudprovider.Instance, error) {
	asgNodes, err := ng.awsManager.GetAsgNodes(ng.asg.AwsRef)
	if err != nil {
		return nil, err
	}

	instances := make([]cloudprovider.Instance, len(asgNodes))

	for i, asgNode := range asgNodes {
		var status *cloudprovider.InstanceStatus
		instanceStatusString, err := ng.awsManager.GetInstanceStatus(asgNode)
		if err != nil {
			klog.V(4).Infof("Could not get instance status, continuing anyways: %v", err)
		} else if instanceStatusString != nil && *instanceStatusString == placeholderUnfulfillableStatus {
			status = &cloudprovider.InstanceStatus{
				State: cloudprovider.InstanceCreating,
				ErrorInfo: &cloudprovider.InstanceErrorInfo{
					ErrorClass:   cloudprovider.OutOfResourcesErrorClass,
					ErrorCode:    placeholderUnfulfillableStatus,
					ErrorMessage: "AWS cannot provision any more instances for this node group",
				},
			}
		}
		instances[i] = cloudprovider.Instance{
			Id:     asgNode.ProviderID,
			Status: status,
		}
	}
	return instances, nil

So this check never adds unregistered nodes to the list.. So my suggestion is either we revert this commit or just change expectedToRegister to:

func expectedToRegister(instance cloudprovider.Instance) bool {
	return instance.Status == nil || ( instance.Status.State != cloudprovider.InstanceDeleting && instance.Status.ErrorInfo == nil)
}

@azylinski @x13n

@yarinm yarinm added the kind/bug Categorizes issue or PR as related to a bug. label Feb 12, 2024
@azylinski
Copy link
Contributor

Thanks yarinm . The change to expectedToRegister func makes sense to me. @x13n , would you agree?

@x13n
Copy link
Member

x13n commented Feb 12, 2024

Yup, that certainly looks better. Thanks @yarinm for investigating this!

@x13n
Copy link
Member

x13n commented Feb 12, 2024

I'd also consider changing Status to be of type InstanceStatus, not *InstanceStatus. Currently it is documented as optional:

// Status represents status of node. (Optional)
Status *InstanceStatus

But honestly I have no idea how to interpret it then.

@yarinm
Copy link
Contributor Author

yarinm commented Feb 13, 2024

Created a PR with the fix #6528
@x13n

@Shubham82
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Feb 14, 2024
@fmuyassarov
Copy link
Member

Perhaps this can be closed as fix has landed in #6528 ?

@Shubham82
Copy link
Contributor

Closing this issue, as it is resolved by PR #6528
/close

Thanks!

@k8s-ci-robot
Copy link
Contributor

@Shubham82: Closing this issue.

In response to this:

Closing this issue, as it is resolved by PR #6528
/close

Thanks!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants