Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Autoscaler Forgets Nodes Scheduled for Deletion during Restart #5048

Open
jabdoa2 opened this issue Jul 25, 2022 · 36 comments
Open

Cluster Autoscaler Forgets Nodes Scheduled for Deletion during Restart #5048

jabdoa2 opened this issue Jul 25, 2022 · 36 comments
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/bug Categorizes issue or PR as related to a bug.

Comments

@jabdoa2
Copy link

jabdoa2 commented Jul 25, 2022

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

  • 1.20.2
  • 1.22.3

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.15", GitCommit:"8f1e5bf0b9729a899b8df86249b56e2c74aebc55", GitTreeState:"clean", BuildDate:"2022-01-19T17:23:01Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

AWS using Kops

What did you expect to happen?:

When cluster-autoscaler selects a node for deletion it cordones it and then after 10 minutes deletes it under any circumstances.

What happened instead?:

When cluster-autoscaler is restarted (typically due to scheduling) it "forgets" about the cordoned node. Now we end up with nodes which are unused and no longer considered by cluster-autoscheduler. We have seen this happen multiple times in different clusters. If always (and only) happens when cluster-autoscaler restarts after tainting/cordoning the node.

How to reproduce it (as minimally and precisely as possible):

  1. Wait to cluster-autoscaler to select and mark a node for deletion
  2. After cluster-autoscheduler cordoned the node delete the cluster-autoscheduler pod
  3. Cluster-autoscheduler will be recreated (and usually the other cluster-autoscheduler pod will take over)
  4. Cordoned node stays there forever

Anything else we need to know?:

Config:

      ./cluster-autoscaler
      --cloud-provider=aws
      --namespace=kube-system
      --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/xxxx
      --balance-similar-node-groups=false
      --cordon-node-before-terminating=true
      --ignore-daemonsets-utilization=true
      --ignore-mirror-pods-utilization=true
      --logtostderr=true
      --scale-down-utilization-threshold=0.99
      --skip-nodes-with-local-storage=false
      --stderrthreshold=info
      --v=4

Log on "old" pod instance:

scale_down.go:791] xxxx was unneeded for 9m52.964516731s
static_autoscaler.go:503] Scale down status: unneededOnly=false lastScaleUpTime=2022-07-25 13:02:50.704606468 +0000 UTC m=+17851.084374773 lastScaleDownDeleteTime=2022-07-25 13:17:25.415659793 +0000 UTC m=+18725.795428101 lastScaleDownFailTime=2022-07-25 13:02:50.704606636 +0000 UTC m=+17851.084374939 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
static_autoscaler.go:516] Starting scale down

Logs after the node has been "forgotten" in the new pod instance:

scale_down.go:407] Skipping xxxx from delete consideration - the node is currently being deleted
[...] # two hours later
scale_down.go:427] Node xxxx - memory utilization 0.000000S
static_autoscaler.go:492] xxxx is unneeded since 2022-07-25 13:28:13.284790432 +0000 UTC m=+1511.821407617 duration 2h8m33.755684792s

Autoscaler clearly still "sees" the node but it does not act on it anymore.

@jabdoa2 jabdoa2 added the kind/bug Categorizes issue or PR as related to a bug. label Jul 25, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 31, 2022
@jabdoa2
Copy link
Author

jabdoa2 commented Oct 31, 2022

Issue still exists.

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 31, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2023
@jabdoa2
Copy link
Author

jabdoa2 commented Jan 29, 2023

Still exists and happening multiple times per week for us.

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2023
@rcjsuen
Copy link

rcjsuen commented Feb 9, 2023

Still exists and happening multiple times per week for us.

@jabdoa2 I am a just a random person on GitHub but was wondering what what version of the Cluster Autoscaler are you using now? You mentioned 1.20.2 when you opened the bug last year. Have you updated since then? We use 1.22.3 ourselves so just wondering if this is something we should keep an eye on as well.

Thank you for your information.

@jabdoa2
Copy link
Author

jabdoa2 commented Feb 9, 2023

We updated to 1.22 by now. The issue still persists.

You can work around it by running the autoscaler on nodes which are not scaled by the autoscaler (i.e. a master or a dedicated node group). However, this issue still occurs when those nodes are upgraded or experience disruptions for other reasons. This issue is still 100% reproducible on all our clusters if you delete the autoscaler within the 10min grace period before deleting a node. We strongly recommend to monitor for nodes which have been cordoned for more than a few minutes. Those prevent scale ups in that node group later on and will cost you money without any benefit.

You might also want to monitor for nodes which are not part of the cluster which have been an issue earlier. However, we have not seen this recently as the autoscaler seem to remove those nodes after a few hours (if they still got the correct tags).

@vadasambar
Copy link
Member

This issue can be seen in 1.21.x as well.
cluster-autoscaler sees the cordoned node and logs out messaging saying it's unneeded (as described in the issue description). It also considers the cordoned node as a possible destination for unschedulable pods when it runs simulations for scale-up. If the unschedulable pod can be scheduled on the cordoned node, cluster autoscaler gives up on bringing up a new node. This makes the pod get stuck in Pending state forever because it can't get scheduled on the cordoned node and cluster-autoscaler won't bring up a new node either.

@vadasambar
Copy link
Member

As a short term solution, removing the cordoned node manually fixes the issue.

@vadasambar
Copy link
Member

I wasn't able to reproduce the problem with the steps in the description. I wonder if it happens only sometimes or maybe I am doing something wrong.

@jabdoa2
Copy link
Author

jabdoa2 commented Mar 9, 2023

For us this happens 100% reliably in multiple clusters. At which step did it behave differently for you?

@vadasambar
Copy link
Member

@jabdoa2 I used slightly different flags in an attempt to perform the test quickly:

--scale-down-unneeded-time=1m
--unremovable-node-recheck-timeout=1m

image
image
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-parameters-to-ca

@vadasambar
Copy link
Member

Maybe the issue shows up with the default values since you are using default values. When I saw the issue on my end, I saw default values being used as well.

@vadasambar
Copy link
Member

vadasambar commented Mar 10, 2023

I was able to reproduce the issue without having to revert to the default flags. The trick is to kill the CA pod just before (last iteration of scale-down; scale-down loop runs every 10 seconds by default) timeout set for scale-down-unneeded-time. Timing is important here which makes it harder to reproduce it manually.

New CA pod was able to delete the node after some time. :(

@vadasambar
Copy link
Member

vadasambar commented Mar 10, 2023

I noticed this issue happens when the cluster-autoscaler pod tries to scale down the node it's running on in which case it drains itself out from the node and leaves the node in cordoned and tainted state.

The real problem starts a when new cluster-autoscaler pod comes up, it sees an unschedulable pod and thinks it can schedule that pod on the cordoned and tainted node. This disables the scale down and makes the scale down go into cooldown thereby effectively skipping the code which does actual scale down until the cool down is lifted (which will never happen because the unschedulable pod would never get scheduled on a cordoned and tainted node it can't tolerate)

@vadasambar
Copy link
Member

vadasambar commented Mar 10, 2023

It seems like cluster-autoscaler doesn't consider the tainted and cordoned state of the node when running simulations.

@vadasambar
Copy link
Member

One quick fix for this can be to make sure cluster-autoscaler pod is never drained from the node on which it is running. This can be done by adding a strict PDB (e.g, maxUnavailable: 0) or making sure the cluster-autoscaler pod satisfies the criteria for blocking draining of the node it is running on.

@jabdoa2
Copy link
Author

jabdoa2 commented Mar 10, 2023

It seems like cluster-autoscaler doesn't consider the tainted and cordoned state of the node when running simulations.

Yeah it reports the node but simply never acts on it. Looks weird and it can cause a lot of havoc in your cluster when important workload can no longer be scheduled.

@jabdoa2
Copy link
Author

jabdoa2 commented Mar 10, 2023

One quick fix for this can be to make sure cluster-autoscaler pod is never drained from the node on which it is running. This can be done by adding a strict PDB (e.g, maxUnavailable: 0) or making sure the cluster-autoscaler pod satisfies the criteria for blocking draining of the node it is running on.

It helps most of the time. You can also run autoscaler on the master nodes or set safe-to-evict: false. But even with that we have seen this bug when we were rolling the cluster using kops or during other disruptions (such as spot instance or maintenance node removal).

@vadasambar
Copy link
Member

vadasambar commented Mar 10, 2023

@jabdoa2 a dedicated nodegroup with taints so that nothing else gets scheduled on it except cluster-autoscaler should solve the issue (for all cases I think) until we've a better solution.

@jabdoa2
Copy link
Author

jabdoa2 commented Mar 10, 2023

@jabdoa2 a dedicated nodegroup with taints so that nothing else gets scheduled on it except cluster-autoscaler should solve the issue (for all cases I think) until we've a better solution.

Unless you roll that group last during a cluster upgrade or you experience disruptions of any kind ;-). So it wont happen in the happy case but when things go south this tends to persist breakage and prevent clusters from recovering (i.e. because you can no longer schedule to a certain AZ).

@zaafar
Copy link

zaafar commented Mar 10, 2023

The real problem starts a when new cluster-autoscaler pod comes up, it sees an unschedulable pod and thinks it can schedule that pod on the cordoned and tainted node.

Sounds like this is the root cause and should be fixed.

@vadasambar
Copy link
Member

vadasambar commented Mar 13, 2023

Unless you roll that group last during a cluster upgrade or you experience disruptions of any kind ;-).

Sorry, I am not sure I understand this fully.

My understanding is,

  1. We will have multiple cluster-autoscaler replicas with PDB
  2. These replicas would be scheduled on a dedicated nodegroup where only cluster-autoscaler pods are scheduled (this can be achieved with taints)
  3. In case of any disruptions, say the node where cluster-autoscaler was running goes down OR cluster-autoscaler itself scales down the node, other replica can take over and scale down the node properly. Note that the node scale down issue happens only when there is a pending pod and cluster-autoscaler thinks it can schedule the pending pod on the cordoned node. This wouldn't happen if we run only cluster-autoscaler on a dedicated nodegroup. This is because cluster-autoscaler won't think the pending pods can be scheduled on the cordoned node since it has taints.

Do you see any problem with this approach (just trying to understand what I am missing)

Sounds like this is the root cause and should be fixed.

Agreed.

@jabdoa2
Copy link
Author

jabdoa2 commented Mar 13, 2023

Unless you roll that group last during a cluster upgrade or you experience disruptions of any kind ;-).

Sorry, I am not sure I understand this fully.

My understanding is,

1. We will have multiple cluster-autoscaler replicas with PDB

2. These replicas would be scheduled on a dedicated nodegroup where only cluster-autoscaler pods are scheduled (this can be achieved with taints)

3. In case of any disruptions, say the node where cluster-autoscaler was running goes down OR cluster-autoscaler itself scales down the node, other replica can take over and scale down the node properly. Note that the node scale down issue happens only when there is a pending pod and cluster-autoscaler thinks it can schedule the pending pod on the cordoned node. This wouldn't happen if we run only cluster-autoscaler on a dedicated nodegroup. This is because cluster-autoscaler won't think the pending pods can be scheduled on the cordoned node since it has taints.

Do you see any problem with this approach (just trying to understand what I am missing)

The issue can still happen in other node groups. If a scale down has been ongoing and a disruption happens to the current autoscaler there is a chance that this will happen. You can make those disruptions less likely by either a dedicated node group or by running the autoscaler on the master nodes but that will only reduce the chance. Rolling node groups, upgrading the autoscaler or node disruptions still trigger this. We got a few clusters which use spot instances and scale a lot so it keeps happening.

@vadasambar
Copy link
Member

vadasambar commented Mar 13, 2023

I see the problem with the solution I proposed. Thanks for explaining.

@vadasambar
Copy link
Member

Brought this up in the SIG meeting today. Based on discussion with @MaciekPytel, there seem to be 2 ways of going about fixing this:

  1. Make cluster-autoscaler remove all taints when it restarts
  2. Fix the code around scale-up simulation so that it considers taints/cordoned state of the node

@vadasambar
Copy link
Member

Related issue: #4456

Looks like the problem might be fixed in 1.26 version of cluster-autoscaler: #5054

@vadasambar
Copy link
Member

We would need another PR on top of #5054 as explained in #5054 (comment) to actually fix the issue.

@vadasambar
Copy link
Member

vadasambar commented Mar 14, 2023

We have logic for removing all taints on the node and uncordon the nodes every time cluster-autoscaler restarts but that is not called when --cordon-node-before-terminating=true is used because the logic to list nodes which need to be untainted doesn't consider cordoned nodes (ref2, ref3). All of the links are for 1.21 commit of cluster-autoscaler. Not sure if the issue still persists in the master branch.

If the flag is removed, taints should be removed for all nodes every time the cluster-autoscaler pod restarts.

@fookenc
Copy link
Contributor

fookenc commented Mar 14, 2023

Hi @vadasambar, I'm not sure if it solves the issue mentioned, but there was a separate PR #5200. This was merged last year in September. It changed the behavior so that taints should be removed from All nodes instead of only those that were Ready. I've been reviewing the code to check, and the NewAllNodeLister doesn't appear to have a filter set which should target all nodes, if I'm understanding correctly. Please correct if I've misunderstood.

@vadasambar
Copy link
Member

vadasambar commented Mar 15, 2023

@fookenc thanks for replying. Looks like we've already fixed the issue in 1.26 :)
I was looking at #4211 which had similar code and thought we decided not to merge it.

I've been reviewing the code to check, and the NewAllNodeLister doesn't appear to have a filter set which should target all nodes, if I'm understanding correctly. Please correct if I've misunderstood.

You are right. Your PR should fix the issue mentioned in #5048 (comment) i.e., the problem described in the description of this issue.

There is an overarching issue around scale up preventing scale down because CA thinks it can schedule pods on an existing node (when it can't because the node has taints or is cordoned) for which we already have your PR #5054 merged. My understanding is implementing those interfaces for specific cloud provider should fix the issue in that cloud provider.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 13, 2023
@jabdoa2
Copy link
Author

jabdoa2 commented Jun 13, 2023

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 13, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024
@jabdoa2
Copy link
Author

jabdoa2 commented Jan 22, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024
@towca towca added the area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. label Mar 21, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2024
@jabdoa2
Copy link
Author

jabdoa2 commented Jun 19, 2024

/remove-lifecycle stale

Bug still exists

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

9 participants