Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node groups get "stuck" during node deletion #3949

Closed
alfredkrohmer opened this issue Mar 17, 2021 · 12 comments · Fixed by #4896 or #5054
Closed

Node groups get "stuck" during node deletion #3949

alfredkrohmer opened this issue Mar 17, 2021 · 12 comments · Fixed by #4896 or #5054
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@alfredkrohmer
Copy link
Contributor

Which component are you using?: cluster-autoscaler

What version of the component are you using?: 1.17.4

Component version: 1.17.4

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-20T02:22:41Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.12-eks-7684af", GitCommit:"7684af4ac41370dd109ac13817023cb8063e3d45", GitTreeState:"clean", BuildDate:"2020-10-20T22:57:40Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?: AWS EKS with self-managed node groups

What happened?:

The leading cluster-autoscaler pod was apparently disrupted in the middle of draining and deleting a node. The affected node had both the ToBeDeletedByClusterAutoscaler and DeletionCandidateOfClusterAutoscaler taints but there were still pods running and it was not deleted from EC2. Afterwards, the node group that this node is part of was apparently kind of "stuck" as cluster-autoscaler didn't scale it up anymore although there was a pending pod that would have been able to run on it. According to the logs, cluster-autoscaler believed that there was a node coming up to run this pod, but this was not the case, the Auto Scaling Group was not scaled up, all nodes in the ASG were registered and ready in Kubernetes:

I0317 10:43:37.632876       1 static_autoscaler.go:194] Starting main loop
I0317 10:43:38.604818       1 filter_out_schedulable.go:66] Filtering out schedulables
I0317 10:43:38.616421       1 filter_out_schedulable.go:125] Pod my-pod-that-is-pending marked as unschedulable can be scheduled on upcoming node template-node-for-my-autoscalinggroup-name-1923849245781316735-0. Ignoring in scale up.
I0317 10:43:38.616448       1 filter_out_schedulable.go:131] 1 other pods marked as unschedulable can be scheduled.
I0317 10:43:38.622681       1 filter_out_schedulable.go:131] 0 other pods marked as unschedulable can be scheduled.
I0317 10:43:38.622713       1 filter_out_schedulable.go:88] Schedulable pods present
I0317 10:43:38.622731       1 static_autoscaler.go:343] No unschedulable pods
I0317 10:43:38.622762       1 static_autoscaler.go:390] Calculating unneeded nodes

The cluster-autoscaler-status config map in kube-system showed the following for the affected node group:

Health:      Healthy (ready=18 unready=0 notStarted=0 longNotStarted=0 registered=19 longUnregistered=0 cloudProviderTarget=19 (minSize=0, maxSize=50))

So it didn't consider the affected node as ready and (I assume) thought that there was a node coming up because cloudProviderTarget - ready > 0. At the same time, it didn't delete the node because there were still pods running on it.

Manually removing the ToBeDeletedByClusterAutoscaler taint from the affected node resolved the situation, causing the node group to be scaled up as expected.

@alfredkrohmer alfredkrohmer added the kind/bug Categorizes issue or PR as related to a bug. label Mar 17, 2021
@yashwanthkalva
Copy link

+1

It is now a recurring issue. Looking for a resolution.

@alfredkrohmer
Copy link
Contributor Author

Looking at the code the problem seems to be that here:

newNodes := ar.CurrentTarget - (readiness.Ready + readiness.Unready + readiness.LongUnregistered)

readiness.Deleted is not considered. However, just adding this to the equation will most likely only solve half of the problem. The other half of the problem is that the ToBeDeleted taint is not removed from a node when the autoscaler is interrupted after it has added this taint but not yet deleted the node. There are a couple of defer sections that are supposed to do this in case the node could not be drained successfully, but this obviously doesn't run if the autoscaler is killed. Old ToBeDeleted taints are ignored and the affected nodes are eligible for deletion again, however, it can happen that in the mean time the overall cluster utilization situation has shifted and the node is no longer "unneeded", in which case there will be no more retries to deleted it unless the utilization situation changes again.

@MaciekPytel
Copy link
Contributor

Cluster Autoscaler is expected to remove all ToBeDeletedByClusterAutoscaler taints when starting up: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L174. The new CA leader should execute this logic on its first autoscaling loop.
My initial guess is that what likely happened is node lister not being initialized quickly enough leading to CA failing to remove taints. It looks like that logic doesn't involve any retries (failure in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L181 would still lead to setting initialized = true). It seems like a fix may be just to add a return statement in the if err block.

@alfredkrohmer
Copy link
Contributor Author

alfredkrohmer commented Apr 28, 2021

@MaciekPytel I cannot find this error message:

Failed to list ready nodes, not cleaning up taints: ...

that should have been logged if this would have happened in any of the logs, so I suppose failing at this stage wouldn't help much. Is it possible that the lister didn't fail but just returned an empty list because the cache wasn't populated yet?

@alfredkrohmer alfredkrohmer changed the title Node groups can get "stuck" when node deletion is interrupted Node groups get "stuck" during node deletion May 28, 2021
@alfredkrohmer
Copy link
Contributor Author

alfredkrohmer commented May 28, 2021

I think I have finally figured out the mechanisms of action for this bug:

  • This only happens during binpacking operations where nodes are scaled down to move pods to other nodes. Only in this scenario it can happen that node drain / deletion is initiated while there are still pods running on the node.
  • The node draining during binpacking can take a long time if pods have a long terminationGracePeriod (several hours in our case). During this time the node group will be in the previously described "stuck" state because the "deleted" nodes are still there but not subtracted from the number of upcoming nodes - hence cluster-autoscaler thinks a node is upcoming while in fact it's being drained.
  • The ToBeDeletedByClusterAutoscaler taint is not removed on startup in our cases since the affected nodes had been cordoned. Cordoned nodes are not returned by ReadyNodesListener, hence the taints are not removed on startup.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 26, 2021
@alfredkrohmer
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 14, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2021
@alfredkrohmer
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2022
@alfredkrohmer
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment