Cluster Autoscaler Forgets Nodes Scheduled for Deletion during Restart #5048

jabdoa2 · 2022-07-25T15:35:08Z

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

1.20.2
1.22.3

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.15", GitCommit:"8f1e5bf0b9729a899b8df86249b56e2c74aebc55", GitTreeState:"clean", BuildDate:"2022-01-19T17:23:01Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

AWS using Kops

What did you expect to happen?:

When cluster-autoscaler selects a node for deletion it cordones it and then after 10 minutes deletes it under any circumstances.

What happened instead?:

When cluster-autoscaler is restarted (typically due to scheduling) it "forgets" about the cordoned node. Now we end up with nodes which are unused and no longer considered by cluster-autoscheduler. We have seen this happen multiple times in different clusters. If always (and only) happens when cluster-autoscaler restarts after tainting/cordoning the node.

How to reproduce it (as minimally and precisely as possible):

Wait to cluster-autoscaler to select and mark a node for deletion
After cluster-autoscheduler cordoned the node delete the cluster-autoscheduler pod
Cluster-autoscheduler will be recreated (and usually the other cluster-autoscheduler pod will take over)
Cordoned node stays there forever

Anything else we need to know?:

Config:

      ./cluster-autoscaler
      --cloud-provider=aws
      --namespace=kube-system
      --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/xxxx
      --balance-similar-node-groups=false
      --cordon-node-before-terminating=true
      --ignore-daemonsets-utilization=true
      --ignore-mirror-pods-utilization=true
      --logtostderr=true
      --scale-down-utilization-threshold=0.99
      --skip-nodes-with-local-storage=false
      --stderrthreshold=info
      --v=4

Log on "old" pod instance:

scale_down.go:791] xxxx was unneeded for 9m52.964516731s
static_autoscaler.go:503] Scale down status: unneededOnly=false lastScaleUpTime=2022-07-25 13:02:50.704606468 +0000 UTC m=+17851.084374773 lastScaleDownDeleteTime=2022-07-25 13:17:25.415659793 +0000 UTC m=+18725.795428101 lastScaleDownFailTime=2022-07-25 13:02:50.704606636 +0000 UTC m=+17851.084374939 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
static_autoscaler.go:516] Starting scale down

Logs after the node has been "forgotten" in the new pod instance:

scale_down.go:407] Skipping xxxx from delete consideration - the node is currently being deleted
[...] # two hours later
scale_down.go:427] Node xxxx - memory utilization 0.000000S
static_autoscaler.go:492] xxxx is unneeded since 2022-07-25 13:28:13.284790432 +0000 UTC m=+1511.821407617 duration 2h8m33.755684792s

Autoscaler clearly still "sees" the node but it does not act on it anymore.

The text was updated successfully, but these errors were encountered:

k8s-triage-robot · 2022-10-31T10:44:04Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jabdoa2 · 2022-10-31T12:40:26Z

Issue still exists.

/remove-lifecycle stale

k8s-triage-robot · 2023-01-29T12:55:26Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jabdoa2 · 2023-01-29T18:43:12Z

Still exists and happening multiple times per week for us.

/remove-lifecycle stale

rcjsuen · 2023-02-09T19:43:29Z

Still exists and happening multiple times per week for us.

@jabdoa2 I am a just a random person on GitHub but was wondering what what version of the Cluster Autoscaler are you using now? You mentioned 1.20.2 when you opened the bug last year. Have you updated since then? We use 1.22.3 ourselves so just wondering if this is something we should keep an eye on as well.

Thank you for your information.

jabdoa2 · 2023-02-09T20:06:46Z

We updated to 1.22 by now. The issue still persists.

You can work around it by running the autoscaler on nodes which are not scaled by the autoscaler (i.e. a master or a dedicated node group). However, this issue still occurs when those nodes are upgraded or experience disruptions for other reasons. This issue is still 100% reproducible on all our clusters if you delete the autoscaler within the 10min grace period before deleting a node. We strongly recommend to monitor for nodes which have been cordoned for more than a few minutes. Those prevent scale ups in that node group later on and will cost you money without any benefit.

You might also want to monitor for nodes which are not part of the cluster which have been an issue earlier. However, we have not seen this recently as the autoscaler seem to remove those nodes after a few hours (if they still got the correct tags).

vadasambar · 2023-03-08T15:45:24Z

This issue can be seen in 1.21.x as well.
cluster-autoscaler sees the cordoned node and logs out messaging saying it's unneeded (as described in the issue description). It also considers the cordoned node as a possible destination for unschedulable pods when it runs simulations for scale-up. If the unschedulable pod can be scheduled on the cordoned node, cluster autoscaler gives up on bringing up a new node. This makes the pod get stuck in Pending state forever because it can't get scheduled on the cordoned node and cluster-autoscaler won't bring up a new node either.

vadasambar · 2023-03-08T15:46:22Z

As a short term solution, removing the cordoned node manually fixes the issue.

vadasambar · 2023-03-09T12:08:30Z

I wasn't able to reproduce the problem with the steps in the description. I wonder if it happens only sometimes or maybe I am doing something wrong.

jabdoa2 · 2023-03-09T12:23:47Z

For us this happens 100% reliably in multiple clusters. At which step did it behave differently for you?

vadasambar · 2023-03-10T11:01:28Z

@jabdoa2 I used slightly different flags in an attempt to perform the test quickly:

--scale-down-unneeded-time=1m
--unremovable-node-recheck-timeout=1m

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-parameters-to-ca

vadasambar · 2023-03-10T11:03:46Z

Maybe the issue shows up with the default values since you are using default values. When I saw the issue on my end, I saw default values being used as well.

vadasambar · 2023-03-10T13:04:40Z

I was able to reproduce the issue without having to revert to the default flags. The trick is to kill the CA pod just before (last iteration of scale-down; scale-down loop runs every 10 seconds by default) timeout set for scale-down-unneeded-time. Timing is important here which makes it harder to reproduce it manually.

New CA pod was able to delete the node after some time. :(

vadasambar · 2023-03-10T16:55:01Z

I noticed this issue happens when the cluster-autoscaler pod tries to scale down the node it's running on in which case it drains itself out from the node and leaves the node in cordoned and tainted state.

The real problem starts a when new cluster-autoscaler pod comes up, it sees an unschedulable pod and thinks it can schedule that pod on the cordoned and tainted node. This disables the scale down and makes the scale down go into cooldown thereby effectively skipping the code which does actual scale down until the cool down is lifted (which will never happen because the unschedulable pod would never get scheduled on a cordoned and tainted node it can't tolerate)

vadasambar · 2023-03-10T16:56:13Z

It seems like cluster-autoscaler doesn't consider the tainted and cordoned state of the node when running simulations.

vadasambar · 2023-03-10T16:58:07Z

One quick fix for this can be to make sure cluster-autoscaler pod is never drained from the node on which it is running. This can be done by adding a strict PDB (e.g, maxUnavailable: 0) or making sure the cluster-autoscaler pod satisfies the criteria for blocking draining of the node it is running on.

jabdoa2 · 2023-03-10T16:59:22Z

It seems like cluster-autoscaler doesn't consider the tainted and cordoned state of the node when running simulations.

Yeah it reports the node but simply never acts on it. Looks weird and it can cause a lot of havoc in your cluster when important workload can no longer be scheduled.

jabdoa2 · 2023-03-10T17:06:52Z

One quick fix for this can be to make sure cluster-autoscaler pod is never drained from the node on which it is running. This can be done by adding a strict PDB (e.g, maxUnavailable: 0) or making sure the cluster-autoscaler pod satisfies the criteria for blocking draining of the node it is running on.

It helps most of the time. You can also run autoscaler on the master nodes or set safe-to-evict: false. But even with that we have seen this bug when we were rolling the cluster using kops or during other disruptions (such as spot instance or maintenance node removal).

vadasambar · 2023-03-10T17:23:39Z

@jabdoa2 a dedicated nodegroup with taints so that nothing else gets scheduled on it except cluster-autoscaler should solve the issue (for all cases I think) until we've a better solution.

jabdoa2 · 2023-03-10T17:45:15Z

@jabdoa2 a dedicated nodegroup with taints so that nothing else gets scheduled on it except cluster-autoscaler should solve the issue (for all cases I think) until we've a better solution.

Unless you roll that group last during a cluster upgrade or you experience disruptions of any kind ;-). So it wont happen in the happy case but when things go south this tends to persist breakage and prevent clusters from recovering (i.e. because you can no longer schedule to a certain AZ).

zaafar · 2023-03-10T17:58:51Z

The real problem starts a when new cluster-autoscaler pod comes up, it sees an unschedulable pod and thinks it can schedule that pod on the cordoned and tainted node.

Sounds like this is the root cause and should be fixed.

vadasambar · 2023-03-13T11:15:35Z

Unless you roll that group last during a cluster upgrade or you experience disruptions of any kind ;-).

Sorry, I am not sure I understand this fully.

My understanding is,

We will have multiple cluster-autoscaler replicas with PDB
These replicas would be scheduled on a dedicated nodegroup where only cluster-autoscaler pods are scheduled (this can be achieved with taints)
In case of any disruptions, say the node where cluster-autoscaler was running goes down OR cluster-autoscaler itself scales down the node, other replica can take over and scale down the node properly. Note that the node scale down issue happens only when there is a pending pod and cluster-autoscaler thinks it can schedule the pending pod on the cordoned node. This wouldn't happen if we run only cluster-autoscaler on a dedicated nodegroup. This is because cluster-autoscaler won't think the pending pods can be scheduled on the cordoned node since it has taints.

Do you see any problem with this approach (just trying to understand what I am missing)

Sounds like this is the root cause and should be fixed.

Agreed.

jabdoa2 · 2023-03-13T11:59:47Z

Unless you roll that group last during a cluster upgrade or you experience disruptions of any kind ;-).

Sorry, I am not sure I understand this fully.

My understanding is,

1. We will have multiple cluster-autoscaler replicas with PDB

2. These replicas would be scheduled on a dedicated nodegroup where only cluster-autoscaler pods are scheduled (this can be achieved with taints)

3. In case of any disruptions, say the node where cluster-autoscaler was running goes down OR cluster-autoscaler itself scales down the node, other replica can take over and scale down the node properly. Note that the node scale down issue happens only when there is a pending pod and cluster-autoscaler thinks it can schedule the pending pod on the cordoned node. This wouldn't happen if we run only cluster-autoscaler on a dedicated nodegroup. This is because cluster-autoscaler won't think the pending pods can be scheduled on the cordoned node since it has taints.

Do you see any problem with this approach (just trying to understand what I am missing)

The issue can still happen in other node groups. If a scale down has been ongoing and a disruption happens to the current autoscaler there is a chance that this will happen. You can make those disruptions less likely by either a dedicated node group or by running the autoscaler on the master nodes but that will only reduce the chance. Rolling node groups, upgrading the autoscaler or node disruptions still trigger this. We got a few clusters which use spot instances and scale a lot so it keeps happening.

vadasambar · 2023-03-13T13:52:13Z

I see the problem with the solution I proposed. Thanks for explaining.

vadasambar · 2023-03-13T15:32:00Z

Brought this up in the SIG meeting today. Based on discussion with @MaciekPytel, there seem to be 2 ways of going about fixing this:

Make cluster-autoscaler remove all taints when it restarts
Fix the code around scale-up simulation so that it considers taints/cordoned state of the node

vadasambar · 2023-03-14T12:07:39Z

Related issue: #4456

Looks like the problem might be fixed in 1.26 version of cluster-autoscaler: #5054

vadasambar · 2023-03-14T12:23:59Z

We would need another PR on top of #5054 as explained in #5054 (comment) to actually fix the issue.

vadasambar · 2023-03-14T15:54:23Z

We have logic for removing all taints on the node and uncordon the nodes every time cluster-autoscaler restarts but that is not called when --cordon-node-before-terminating=true is used because the logic to list nodes which need to be untainted doesn't consider cordoned nodes (ref2, ref3). All of the links are for 1.21 commit of cluster-autoscaler. Not sure if the issue still persists in the master branch.

If the flag is removed, taints should be removed for all nodes every time the cluster-autoscaler pod restarts.

fookenc · 2023-03-14T17:46:50Z

Hi @vadasambar, I'm not sure if it solves the issue mentioned, but there was a separate PR #5200. This was merged last year in September. It changed the behavior so that taints should be removed from All nodes instead of only those that were Ready. I've been reviewing the code to check, and the NewAllNodeLister doesn't appear to have a filter set which should target all nodes, if I'm understanding correctly. Please correct if I've misunderstood.

vadasambar · 2023-03-15T10:51:01Z

@fookenc thanks for replying. Looks like we've already fixed the issue in 1.26 :)
I was looking at #4211 which had similar code and thought we decided not to merge it.

I've been reviewing the code to check, and the NewAllNodeLister doesn't appear to have a filter set which should target all nodes, if I'm understanding correctly. Please correct if I've misunderstood.

You are right. Your PR should fix the issue mentioned in #5048 (comment) i.e., the problem described in the description of this issue.

There is an overarching issue around scale up preventing scale down because CA thinks it can schedule pods on an existing node (when it can't because the node has taints or is cordoned) for which we already have your PR #5054 merged. My understanding is implementing those interfaces for specific cloud provider should fix the issue in that cloud provider.

k8s-triage-robot · 2023-06-13T11:27:01Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jabdoa2 · 2023-06-13T13:11:47Z

/remove-lifecycle stale

k8s-triage-robot · 2024-01-22T11:34:01Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jabdoa2 · 2024-01-22T12:16:02Z

/remove-lifecycle stale

k8s-triage-robot · 2024-06-19T13:42:52Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jabdoa2 · 2024-06-19T13:44:04Z

/remove-lifecycle stale

Bug still exists

jabdoa2 added the kind/bug Categorizes issue or PR as related to a bug. label Jul 25, 2022

jbartosik added the area/cluster-autoscaler label Aug 2, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 31, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 31, 2022

himanshu-kun mentioned this issue Jan 6, 2023

MCM doesn't remove the machine which CA wants gardener/autoscaler#159

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2023

vadasambar mentioned this issue Mar 15, 2023

Mar 2023 vadafoss/daily-updates#7

Closed

mimmus mentioned this issue Apr 5, 2023

Nodes tainted with ToBeDeletedByClusterAutoscaler remain in the cluster blocking cluster-autoscaler normal operations #5657

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 13, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 13, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024

towca added the area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. label Mar 21, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2024

Cluster Autoscaler Forgets Nodes Scheduled for Deletion during Restart #5048

Cluster Autoscaler Forgets Nodes Scheduled for Deletion during Restart #5048

Comments

jabdoa2 commented Jul 25, 2022 • edited Loading

k8s-triage-robot commented Oct 31, 2022

jabdoa2 commented Oct 31, 2022

k8s-triage-robot commented Jan 29, 2023

jabdoa2 commented Jan 29, 2023

rcjsuen commented Feb 9, 2023

jabdoa2 commented Feb 9, 2023 • edited Loading

vadasambar commented Mar 8, 2023

vadasambar commented Mar 8, 2023

vadasambar commented Mar 9, 2023

jabdoa2 commented Mar 9, 2023

vadasambar commented Mar 10, 2023

vadasambar commented Mar 10, 2023

vadasambar commented Mar 10, 2023 • edited Loading

vadasambar commented Mar 10, 2023 • edited Loading

vadasambar commented Mar 10, 2023 • edited Loading

vadasambar commented Mar 10, 2023

jabdoa2 commented Mar 10, 2023

jabdoa2 commented Mar 10, 2023

vadasambar commented Mar 10, 2023 • edited Loading

jabdoa2 commented Mar 10, 2023

zaafar commented Mar 10, 2023

vadasambar commented Mar 13, 2023 • edited Loading

jabdoa2 commented Mar 13, 2023

vadasambar commented Mar 13, 2023 • edited Loading

vadasambar commented Mar 13, 2023

vadasambar commented Mar 14, 2023

vadasambar commented Mar 14, 2023

vadasambar commented Mar 14, 2023 • edited Loading

fookenc commented Mar 14, 2023

vadasambar commented Mar 15, 2023 • edited Loading

k8s-triage-robot commented Jun 13, 2023

jabdoa2 commented Jun 13, 2023

k8s-triage-robot commented Jan 22, 2024

jabdoa2 commented Jan 22, 2024

k8s-triage-robot commented Jun 19, 2024

jabdoa2 commented Jun 19, 2024

jabdoa2 commented Jul 25, 2022 •

edited

Loading

jabdoa2 commented Feb 9, 2023 •

edited

Loading

vadasambar commented Mar 10, 2023 •

edited

Loading

vadasambar commented Mar 10, 2023 •

edited

Loading

vadasambar commented Mar 10, 2023 •

edited

Loading

vadasambar commented Mar 10, 2023 •

edited

Loading

vadasambar commented Mar 13, 2023 •

edited

Loading

vadasambar commented Mar 13, 2023 •

edited

Loading

vadasambar commented Mar 14, 2023 •

edited

Loading

vadasambar commented Mar 15, 2023 •

edited

Loading