Node groups get "stuck" during node deletion #3949

alfredkrohmer · 2021-03-17T12:57:03Z

Which component are you using?: cluster-autoscaler

What version of the component are you using?: 1.17.4

Component version: 1.17.4

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-20T02:22:41Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.12-eks-7684af", GitCommit:"7684af4ac41370dd109ac13817023cb8063e3d45", GitTreeState:"clean", BuildDate:"2020-10-20T22:57:40Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?: AWS EKS with self-managed node groups

What happened?:

The leading cluster-autoscaler pod was apparently disrupted in the middle of draining and deleting a node. The affected node had both the ToBeDeletedByClusterAutoscaler and DeletionCandidateOfClusterAutoscaler taints but there were still pods running and it was not deleted from EC2. Afterwards, the node group that this node is part of was apparently kind of "stuck" as cluster-autoscaler didn't scale it up anymore although there was a pending pod that would have been able to run on it. According to the logs, cluster-autoscaler believed that there was a node coming up to run this pod, but this was not the case, the Auto Scaling Group was not scaled up, all nodes in the ASG were registered and ready in Kubernetes:

I0317 10:43:37.632876       1 static_autoscaler.go:194] Starting main loop
I0317 10:43:38.604818       1 filter_out_schedulable.go:66] Filtering out schedulables
I0317 10:43:38.616421       1 filter_out_schedulable.go:125] Pod my-pod-that-is-pending marked as unschedulable can be scheduled on upcoming node template-node-for-my-autoscalinggroup-name-1923849245781316735-0. Ignoring in scale up.
I0317 10:43:38.616448       1 filter_out_schedulable.go:131] 1 other pods marked as unschedulable can be scheduled.
I0317 10:43:38.622681       1 filter_out_schedulable.go:131] 0 other pods marked as unschedulable can be scheduled.
I0317 10:43:38.622713       1 filter_out_schedulable.go:88] Schedulable pods present
I0317 10:43:38.622731       1 static_autoscaler.go:343] No unschedulable pods
I0317 10:43:38.622762       1 static_autoscaler.go:390] Calculating unneeded nodes

The cluster-autoscaler-status config map in kube-system showed the following for the affected node group:

Health:      Healthy (ready=18 unready=0 notStarted=0 longNotStarted=0 registered=19 longUnregistered=0 cloudProviderTarget=19 (minSize=0, maxSize=50))

So it didn't consider the affected node as ready and (I assume) thought that there was a node coming up because cloudProviderTarget - ready > 0. At the same time, it didn't delete the node because there were still pods running on it.

Manually removing the ToBeDeletedByClusterAutoscaler taint from the affected node resolved the situation, causing the node group to be scaled up as expected.

The text was updated successfully, but these errors were encountered:

yashwanthkalva · 2021-04-20T13:33:18Z

+1

It is now a recurring issue. Looking for a resolution.

alfredkrohmer · 2021-04-27T12:47:42Z

Looking at the code the problem seems to be that here:

autoscaler/cluster-autoscaler/clusterstate/clusterstate.go

Line 915 in 8dc1afb

    
           newNodes := ar.CurrentTarget - (readiness.Ready + readiness.Unready + readiness.LongUnregistered)

readiness.Deleted is not considered. However, just adding this to the equation will most likely only solve half of the problem. The other half of the problem is that the ToBeDeleted taint is not removed from a node when the autoscaler is interrupted after it has added this taint but not yet deleted the node. There are a couple of defer sections that are supposed to do this in case the node could not be drained successfully, but this obviously doesn't run if the autoscaler is killed. Old ToBeDeleted taints are ignored and the affected nodes are eligible for deletion again, however, it can happen that in the mean time the overall cluster utilization situation has shifted and the node is no longer "unneeded", in which case there will be no more retries to deleted it unless the utilization situation changes again.

MaciekPytel · 2021-04-27T13:43:56Z

Cluster Autoscaler is expected to remove all ToBeDeletedByClusterAutoscaler taints when starting up: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L174. The new CA leader should execute this logic on its first autoscaling loop.
My initial guess is that what likely happened is node lister not being initialized quickly enough leading to CA failing to remove taints. It looks like that logic doesn't involve any retries (failure in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L181 would still lead to setting initialized = true). It seems like a fix may be just to add a return statement in the if err block.

alfredkrohmer · 2021-04-28T13:54:21Z

@MaciekPytel I cannot find this error message:

Failed to list ready nodes, not cleaning up taints: ...

that should have been logged if this would have happened in any of the logs, so I suppose failing at this stage wouldn't help much. Is it possible that the lister didn't fail but just returned an empty list because the cache wasn't populated yet?

alfredkrohmer · 2021-05-28T16:42:04Z

I think I have finally figured out the mechanisms of action for this bug:

This only happens during binpacking operations where nodes are scaled down to move pods to other nodes. Only in this scenario it can happen that node drain / deletion is initiated while there are still pods running on the node.
The node draining during binpacking can take a long time if pods have a long terminationGracePeriod (several hours in our case). During this time the node group will be in the previously described "stuck" state because the "deleted" nodes are still there but not subtracted from the number of upcoming nodes - hence cluster-autoscaler thinks a node is upcoming while in fact it's being drained.
The ToBeDeletedByClusterAutoscaler taint is not removed on startup in our cases since the affected nodes had been cordoned. Cordoned nodes are not returned by ReadyNodesListener, hence the taints are not removed on startup.

k8s-triage-robot · 2021-08-26T17:27:02Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

alfredkrohmer · 2021-09-14T11:06:10Z

/remove-lifecycle stale

k8s-triage-robot · 2021-12-14T15:01:15Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

alfredkrohmer · 2021-12-14T17:33:23Z

/remove-lifecycle stale

k8s-triage-robot · 2022-03-14T18:10:27Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

alfredkrohmer · 2022-03-17T16:10:18Z

/remove-lifecycle stale

k8s-triage-robot · 2022-06-15T16:20:10Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

alfredkrohmer added the kind/bug Categorizes issue or PR as related to a bug. label Mar 17, 2021

alfredkrohmer changed the title ~~Node groups can get "stuck" when node deletion is interrupted~~ Node groups get "stuck" during node deletion May 28, 2021

alfredkrohmer mentioned this issue Jul 21, 2021

Subtract toBeDeleted nodes from number of upcoming nodes; cleanup toBeDeleted taints from all nodes, not only ready ones #4211

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 26, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 14, 2021

jbartosik added the area/cluster-autoscaler label Sep 15, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 17, 2022

fookenc mentioned this issue Apr 12, 2022

CA fails to scale-up or cancel in-progress scale down when there are un-schedulable pods #4456

Closed

fookenc mentioned this issue May 17, 2022

Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes #4896

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 15, 2022

k8s-ci-robot closed this as completed in #4896 Jul 4, 2022

fookenc mentioned this issue Jul 27, 2022

Identifying cloud provider deleted nodes #5054

Merged

vadasambar mentioned this issue Mar 29, 2023

fix: implement function to identify if node is present in aws #5632

Merged

MaxFedotov mentioned this issue Apr 15, 2024

fix: implement function to identify if node is present in cluster-api #6708

Merged

Bryce-Soghigian mentioned this issue Jun 21, 2024

feat: Azure Provider HasInstance implementation #6956

Merged

alfredkrohmer mentioned this issue Aug 12, 2024

fix: implement HasInstance() for OCI providers #7154

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node groups get "stuck" during node deletion #3949

Node groups get "stuck" during node deletion #3949

alfredkrohmer commented Mar 17, 2021

yashwanthkalva commented Apr 20, 2021

alfredkrohmer commented Apr 27, 2021

MaciekPytel commented Apr 27, 2021

alfredkrohmer commented Apr 28, 2021 •

edited

Loading

alfredkrohmer commented May 28, 2021 •

edited

Loading

k8s-triage-robot commented Aug 26, 2021

alfredkrohmer commented Sep 14, 2021

k8s-triage-robot commented Dec 14, 2021

alfredkrohmer commented Dec 14, 2021

k8s-triage-robot commented Mar 14, 2022

alfredkrohmer commented Mar 17, 2022

k8s-triage-robot commented Jun 15, 2022

Node groups get "stuck" during node deletion #3949

Node groups get "stuck" during node deletion #3949

Comments

alfredkrohmer commented Mar 17, 2021

yashwanthkalva commented Apr 20, 2021

alfredkrohmer commented Apr 27, 2021

MaciekPytel commented Apr 27, 2021

alfredkrohmer commented Apr 28, 2021 • edited Loading

alfredkrohmer commented May 28, 2021 • edited Loading

k8s-triage-robot commented Aug 26, 2021

alfredkrohmer commented Sep 14, 2021

k8s-triage-robot commented Dec 14, 2021

alfredkrohmer commented Dec 14, 2021

k8s-triage-robot commented Mar 14, 2022

alfredkrohmer commented Mar 17, 2022

k8s-triage-robot commented Jun 15, 2022

alfredkrohmer commented Apr 28, 2021 •

edited

Loading

alfredkrohmer commented May 28, 2021 •

edited

Loading