Improve Node Utilization by Avoiding "safe-to-evict" Annotation for Druid-Managed Pods #766

vlerenc · 2024-02-15T09:39:37Z

How to categorize this issue?

/kind enhancement

Stories

As operator, I want to run my ETCDs efficiently (using the bin-packing kube-scheduler and the cluster-autoscaler down-scaling), but etcd-main(-compactor) pods disallow cluster-autoscaler down-scaling

What would you like to be added

Do not add the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation to the etcd-main-compactor jobs (just because it's in the etcd-main spec)
Do not add the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation to the etcd-main pods for the clustered etcd-main statefulset
~~Either switch all clusters to clustered etcd-main (except for evaluation clusters) and remove the annotation in general~~ (cost-prohibitive)
...or consider a more elaborate approach how to deal with the singleton etcd-main statefulset:
- If inside the maintenance time window (with a jitter, so that we do not terminate pods all at once at the full hour, which may lead to prolonged detach/attach times)
- If the cluster does not have the purpose "production"
- If the requests utilization (CPU or memory) for the node hosting the pod is below 70% (atm.; ideally configurable)
Implementation details:
- Instead of the druid reading the purpose, gardenlet could do that and set a new field at the etcd resource that permits or forbids voluntary evictions.
- A new controller, let's call it terminator (:-)), could do the above as part of druid.

Motivation (Why is this needed?)

We have adopters seeing 80-90 % requests utilization on the regular seed cluster nodes, but only 40-50% for their dedicated ETCD seed cluster nodes.

The text was updated successfully, but these errors were encountered:

ishan16696 · 2024-02-25T18:13:11Z

Do not add the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation to the etcd-main pods for the clustered etcd-main statefulset

When evicting a member pod from a clustered etcd, there are a few points to keep in mind:

All members of the etcd cluster should be functioning correctly. If one etcd member is already down in a 3 member etcd cluster, evicting another member could lead to a quorum loss. While Pod Disruption Budgets (PDBs) will prevent etcd from losing quorum, I still believe it would be better to have a precondition that check all members of the etcd cluster are healthy.
When evicting a member from the etcd cluster, it's preferable to choose a follower rather than the leader. This helps to avoid unnecessary leader elections.

Terminate the singleton etcd-main statefulset pod

To avoid a worst case scenario i.e long restoration, I would suggest let the backup-restore takes the schedule full snapshot which is also performed during the maintenance window before evicting a single node etcd-main pod.

vlerenc · 2024-04-08T06:59:30Z

It will probably not amount to much/anything, but (1) could be done right away (compactor jobs), right?

@ishan16696 How would you suggest to do that? This is about cost savings, cluster-autoscaler, evictions, PDBs. You do not have a real hook into these things, so in order for the above to be implemented, you would need to take control of everything (put the PDB to 3 minAvailable / 0 maxUnavailable and let the druid do the job if it sees the "signs"), right?

But you already run ETCD with a PDB of 2 minAvailable / 1 maxUnavailable (for clustered ETCD), so you as a team already decided to go with a flexible PDB, have you not?

shreyas-s-rao · 2024-04-08T07:18:37Z

But you already run ETCD with a PDB of 2 minAvailable / 1 maxUnavailable (for clustered ETCD), so you as a team already decided to go with a flexible PDB, have you not?

@vlerenc Yes, but what does flexible mean here? The PDB allows for atmost 1 pod to be down/evicted. So the PDB should most certainly block a pod eviction if another pod is already healthy, shouldn't it?

I still believe it would be better to have a precondition that check all members of the etcd cluster are healthy.

@ishan16696 but where would this be precondition be? Druid doesn't have any say in eviction of an etcd pod or draining of a node. Those events can occur at any time, And druid really can't block it. Well, maybe it's possible, if it puts a finalizer on the pods themselves, and wouldn't remove the finalizer unless it allows the pod to be evicted. This can possibly be another part of the scope for druid-controlled pod-updates maybe? @unmarshall

When evicting a member from the etcd cluster, it's preferable to choose a follower rather than the leader. This helps to avoid unnecessary leader elections.

@ishan16696 if the node being scaled down is the one hosting the leader, then there would definitely be a leader election, no doubt. But the optimisation here would be for druid to manually move leadership from this pod to another, and then unblock the pod for deletion.

vlerenc · 2024-04-08T12:24:34Z

Yes, but what does flexible mean here? The PDB allows for atmost 1 pod to be down/evicted. So the PDB should most certainly block a pod eviction if another pod is already healthy, shouldn't it?

@shreyas-s-rao What I mean is that you already allow for 1 voluntary eviction (for clustered ETCDs). So, you made this decision already and are OK with it/that it may happen (in general). All I suggested, because I saw the/that PDB, was to drop the CA annotation that only influences CA. You already OKed a voluntary evictions in general in the PDB.

Or in other words: If there are general concerns with voluntary evictions (what PDBs do - which would be respected during scale-in of a node up until the timeout), why did we define a PDB allowing a voluntary eviction of 1?

shreyas-s-rao · 2024-04-08T14:00:33Z

Yes, I would guess so. Without this flexibility of allowing 1 replica to be evicted, it would be a nightmare to scale down any nodes at all, irrespective of the CA annotation.

unmarshall · 2024-04-09T05:26:16Z

Do not add the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation to the etcd-main-compactor jobs (just because it's in the etcd-main spec)
Do not add the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation to the etcd-main pods for the clustered etcd-main statefulset

For a multi-node etcd cluster there is no need for this annotation to be present since we already have a PDB defined which allows 1 eviction at the most. Today for a single node etcd cluster we set minAvailable to 0 so the annotation today prevents a single pod from being evicted and causing a downtime so here we need to be careful.

But the optimisation here would be for druid to manually move leadership from this pod to another, and then unblock the pod for deletion.

Do we really need this? Leader election can happen at any time during a lifecycle of an etcd cluster. So how is leader election caused due to CA scaling down a node any different?

I still believe it would be better to have a precondition that check all members of the etcd cluster are healthy.

So i checked the way disruption controller computes available pods. It checks the ready condition of the Pod to determine. See countHealthyPods where it checks the ready condition and based on the current healthy, desired heathy etcd it updates PDB status. Now the ready condition of the Pod is determined by ready conditions for all of its containers which is set here in the kubelet. For each individual container a worker is created whose run is called as a go-routine which periodically runs probes. The result of the readiness probe is then used to update the container status which then collectively across all containers constitute the pod readiness. Well it seems i went down a rabbit hold just to make a point. So the point is that the readiness probe defined on every etcd member today is a cluster readiness and not pod readiness. So if your readiness probe says that it is not ready because there is no quorum even if you have all 3 pods in a 3 member HA etcd cluster, then ideally the PDB should prevent even a single node from being evicted, right?

shreyas-s-rao · 2024-04-25T16:27:50Z

In an out-of-band discussion between myself, @unmarshall , @ishan16696 , @seshachalam-yv and @renormalize , we concluded the following:

It makes sense to remove annotation cluster-autoscaler.kubernetes.io/safe-to-evict: false for multi-node etcds. This change needs to be done on gardenlet side when it creates the Etcd resource, since druid simply sees the annotation and sets it on the pod.
Druid should not bother about terminating pods, since it has no knowledge about shoot purpose or shoot maintenance windows. Instead, gardenlet should be the one to take care of this, since it is the one with full knowledge about this. It is not semantically correct for druid to voluntarily delete perfectly well-running pods.
i. Gardenlet can do this by assessing nodes that are stuck in drain due to a blocking pod that has the annotation cluster-autoscaler.kubernetes.io/safe-to-evict: false set on it, and remove such a pod. Or better yet, it should remove the annotation during the maintenance window and add it back at the end of the maintenance window, depending on shoot purpose. Druid's job is to simply reconcile the etcd and add/update/delete annotations based on what is set in the Etcd resource.
Druid could still provide a trigger point for gardenlet to trigger deletion of a pod, like a PodDeletionTask, but this might be an overkill for a simple task of just deleting a pod.

@vlerenc PTAL.

vlerenc · 2024-04-25T17:20:45Z

Gardenlet can do this by assessing nodes that are stuck in drain due to a blocking pod that has the annotation cluster-autoscaler.kubernetes.io/safe-to-evict: false set on it, and remove such a pod.

Why would that happen? The annotation only influences the CA, nothing else, no?

voelzmo · 2024-04-26T18:51:05Z

Just a side note on this topic:

I still believe it would be better to have a precondition that check all members of the etcd cluster are healthy.

So i checked the way disruption controller computes available pods. It checks the ready condition of the Pod to determine.

This is true, until you set unhealthyPodEvictionPolicy: AlwaysAllow in your PDB. This was specifically introduced to help get out of situations where you do want eviction to happen, although Pods are currently unhealthy, for example when an OOM occurs.

We do that for many of the gardener managed components, but not for etcd, AFAIK.

unmarshall · 2024-04-29T07:15:38Z

This is true, until you set unhealthyPodEvictionPolicy: AlwaysAllow in your PDB.

Thanks for sharing this as i was unaware of this feature. Unhealthy translates to ready condition being false. For an etcd cluster if there is a loss of quorum then the ready condition for all pods will change to false. This then allows even the healthier pod (1 out of 3) to also be evicted. So in case of a permanent quorum loss this would not have much side effect. But i wondering how will this impact transient quorum loss situations. Today we do not set unhealthyPodEvictionPolicy in the PDB for etcd and IMHO we should continue to not set this field at all which then defaults to IfHealthyBudget which is what we currently have as well. Just removing the CA annotation cluster-autoscaler.kubernetes.io/safe-to-evict is sufficient. If the ready condition is false which means there is a quorum loss then an operator should react to an alert and ascertain if it is transient of permanent quorum loss. If we set unhealthyPodEvictionPolicy to AlwaysAllow then it can potentially result in eviction of all etcd members which we do not want.

shreyas-s-rao · 2024-04-29T07:41:53Z

PDB behavior and CA behavior are meant to tackle two different problems entirely. One aims to maintain minimum desired replicas for multi-node etcds, in order to keep them quorate, irrespective of what state the underlying nodes are in. Hence PDBs are necessary for multi-node etcds, and CA does not adversely affect the quorum of a multi-node etcd. Now, setting the field unhealthyPodEvictionPolicy for the PDB here can easily lead to downtimes for the etcd, especially during node rollout, because now we allow the PDB to go ahead and allow eviction of more than minAvailable number of pods. This seems counter-intuitive.

On the other hand, CA aims to control node lifecycle, including draining of nodes that run single-node etcds on them, and expects that an etcd pod should not block the draining of a node indefinitely. PDBs play no part for single-node etcds, since a single-node etcd pod must be rolled sometime or another, leading to a temporary unavailability of the etcd. Here, setting the unhealthyPodEvictionPolicy makes little difference, because at the end of the day, the main issue CA faces is that a healthy single-node etcd pod running on a node cannot be evicted because of the annotation cluster-autoscaler.kubernetes.io/safe-to-evict set on it, so PDB has no part to play here. Instead, it should be the responsibility of the one who blocked the pod eviction (gardenlet) to unblock pod eviction during node drain, and then re-block it after the pod has been relocated to a new, already-rolled node.

shreyas-s-rao · 2024-04-30T10:07:51Z

Meeting minutes from an out-of-band discussion between myself, @vlerenc, @unmarshall , @ishan16696 , @renormalize , @seshachalam-yv, @ashwani2k :

Agreed to remove cluster-autoscaler.kubernetes.io/safe-to-evict: "false" for compaction job pods, as suggested by @vlerenc
Agreed to remove cluster-autoscaler.kubernetes.io/safe-to-evict: "false" for multi-node etcd pods, as suggested by @vlerenc
For single-node etcd pods, we discussed the following:
- CA does not select a node for drain if the node runs at least one pod with the annotation cluster-autoscaler.kubernetes.io/safe-to-evict: "false" set on it. So there is no deterministic way for an agent to figure out if a node is "to be" drained in order to decide to kick the etcd pods off of it during their respective maintenance windows. The hope is that kicking the etcd pods will allow it to be scheduled to a node with larger utilisation, thanks to bin-packing.
- Another option is to set a PDB with minAvailable: 1 for single-node etcds and change this to minAvailable: 0 during maintenance window for each etcd. But CA considers PDB disruptions for the pods running on the node while selecting the node to drain. So even with this approach, we face the same issue of non-overlapping maintenance windows, and having to select the node to drain and then kick the etcd pods off one by one during their respective maintenance windows. Additionally, the PDB would block MCM from draining nodes during node rollouts, effectively blocking them or forcing MCM to forcefully shut down machines.
- Selecting a node for drain should be done in such a way that it doesn't interfere with CA's criteria for selecting a node for drain, based on node usage. For instance, if CA is configured to drain a node which is only 20% utilised, then our custom agent needs to select the node when it is at 40% usage, taints the node, and then starts kicking etcd pods off voluntarily during their maintenance windows.
- @unmarshall suggested co-locating etcd pods with same/similar maintenance windows onto the same nodes, so that the maintenance windows for the etcds on a node to be drained would more or less fall into the same time period, allowing for an agent to remove the safe-to-evict annotation and kick all the etcd pods off at the same time.
- Another suggestion is to entirely move to multi-node etcd for etcd-main, even for non-HA control planes, and entirely get rid of cluster-autoscaler.kubernetes.io/safe-to-evict: "false", possibly based on shoot purpose (testing, production, etc). This way, we allow CA to drain nodes from etcd worker pools, without any single-node etcd pods to block them.

Action items:

@vlerenc to check stats for switching from single-node to multi-node etcds for non-HA control planes, and check if cost savings with nodes would justify the extra costs for running 2 extra replicas of etcd-main (memory, CPU, disk). The main cost differential seems to be with etcd disks, especially for AWS:
- New AWS etcds pods will soon be created with 25Gi volumes instead of the previous 80Gi value , via Change the default capacity of volumes for Etcd from 80Gi to 25Gi. gardener-extension-provider-aws#933
- Reducing volume size for old etcd pods will require [Feature] Allow druid to delete and trigger recreation of PVCs on-demand #481 to be implemented, allowing druid to roll volumes

vlerenc · 2024-05-17T11:25:37Z

At the moment, it wouldn't be worth switching HA on for all etcd-mains that belong to non-eval/non-HA clusters (etcd-events would remain a singleton for non-HA clusters and doesn't run with the safe-to-evict: false annotation anyway on the regular worker pool). Here data from some test landscapes:

Read 11186 ETCDs (of which 1882 have the purpose evaluation and are not HA; hibernated clusters were ignored).
Read 96 seeds.

Proposal "All but eval clusters become HA, so that the ETCD worker pool and the safe-to-evict annotation can be dropped": Replicas:
- ETCD Main:   3 replicas for HA and non-evaluation clusters, otherwise 1
- ETCD Events: 3 replicas for HA clusters, otherwise 1 (as today)

Proposal "All but eval clusters become HA, so that the ETCD worker pool and the safe-to-evict annotation can be dropped": Number of clusters with N replicas (today vs. proposed):
- ETCD Main:
  - 1 replicas: 3871 today vs.  941 proposed requesting     1059 cores,    5.1T memory,    21K EUR today vs.      172 cores,    1.1T memory,   4.6K EUR proposed
  - 3 replicas: 1722 today vs. 4652 proposed requesting     1768 cores,    7.9T memory,    33K EUR today vs.     4428 cores,   19.8T memory,    83K EUR proposed
- ETCD Events:
  - 1 replicas: 3871 today vs. 3871 proposed requesting      372 cores,      2T memory,   8.6K EUR today vs.      372 cores,      2T memory,   8.6K EUR proposed
  - 3 replicas: 1722 today vs. 1722 proposed requesting      407 cores,    2.2T memory,   9.3K EUR today vs.      407 cores,    2.2T memory,   9.3K EUR proposed

Proposal 1 "All but eval clusters become HA, so that the ETCD worker pool and the safe-to-evict annotation can be dropped": Status quo and predictions:
- ETCDs Today:        2827 cores,   12.9T memory,    54K EUR
- Pools Today:    regular at 73% efficiency = 352K EUR and etcd at 56% efficiency = 104K EUR
- ETCDs Proposed:     4601 cores,   20.9T memory,    88K EUR
- Pools Proposed: regular at 72% efficiency = 478K EUR and etcd dropped
-     Difference: +22K EUR

That's all guessing. I was assuming that the etcd-mains will be placed with the same requests-to-allocatable ratio on the regular worker pool as also all other pods on that pool. In some cases, the ratio may actually improve (to the advantage of the proposal), but then the predicted costs above neglect the daemonsets for the additional nodes (to the disadvantage of the proposal). Given that the difference is 22K (significant) and we haven't even counted disk and network costs, it seems cost-prohibitive to run etcd-main generally in HA for non-eval/non-HA clusters.

Then again, most ETCDs still have HVPA configured (HVPA is removed only for shoots on seed canary/aws-ha-eu1 and would never scale down if marked productive) with extremely wasteful requests. When I rerun the calculation with the current VPA uncapped target recommendation:

Read 11186 ETCDs (of which 1882 have the purpose evaluation and are not HA; hibernated clusters were ignored).
Read 96 seeds.

Proposal 1 "All but eval clusters become HA, so that the ETCD worker pool and the safe-to-evict annotation can be dropped": Replicas:
- ETCD Main:   3 replicas for HA and non-evaluation clusters, otherwise 1
- ETCD Events: 3 replicas for HA clusters, otherwise 1 (as today)

Proposal "All but eval clusters become HA, so that the ETCD worker pool and the safe-to-evict annotation can be dropped": Number of clusters with N replicas (today vs. proposed):
- ETCD Main:
  - 1 replicas: 3871 today vs.  941 proposed requesting     1059 cores,    5.1T memory,    21K EUR today vs.       79 cores,  663.5G memory,   2.8K EUR proposed
  - 3 replicas: 1722 today vs. 4652 proposed requesting     1768 cores,    7.9T memory,    33K EUR today vs.     1495 cores,   13.7T memory,    58K EUR proposed
- ETCD Events:
  - 1 replicas: 3871 today vs. 3871 proposed requesting      372 cores,      2T memory,   8.6K EUR today vs.      106 cores,  750.1G memory,   3.1K EUR proposed
  - 3 replicas: 1722 today vs. 1722 proposed requesting      407 cores,    2.2T memory,   9.3K EUR today vs.      209 cores,    1.4T memory,   5.9K EUR proposed

Proposal "All but eval clusters become HA, so that the ETCD worker pool and the safe-to-evict annotation can be dropped": Status quo and predictions:
- ETCDs Today:        2827 cores,   12.9T memory,    54K EUR
- Pools Today:    regular at 73% efficiency = 352K EUR and etcd at 56% efficiency = 104K EUR
- ETCDs Proposed:     1574 cores,   14.4T memory,    60K EUR
- Pools Proposed: regular at 72% efficiency = 431K EUR and etcd dropped
-     Difference: -25K EUR

So, in this case, we would actually reduce our costs by 25K, but that's comparing apples to ~~pears~~ potatoes. Yes, even a single pod will prevent node scale-in, but realistically, we will require a lot less nodes in the ETCD worker pools with the new recommendations over time, because of bin-packing.

So, right now, we don't have the data/confidence to decide, one way or another, i.e. at the moment we would neither decide to make all etcd-mains generally HA for non-eval/non-HA clusters nor would we decide to invest into a more complex solution as depicted above.

What we can do: Go forward with the HVPA removal and clarify the downscaling question that @voelzmo will bring up:

HA ETCDs can probably be safely downscaled at all times
Non-HA ETCDs can probably be downscaled at least during the maintenance time window (even if marked productive)

Once we have that in, we can collect the data and run the prediction again. Then we have a proper foundation to decide what we are going to do with the ETCD worker pool and the safe-to-evict: false annotations.

WDYT?

ishan16696 · 2024-07-25T09:01:49Z

/assign

vlerenc added the kind/epic Large multi-story topic label Feb 15, 2024

gardener-robot added the kind/enhancement Enhancement, improvement, extension label Feb 15, 2024

vlerenc mentioned this issue Feb 20, 2024

Improve Node Utilisation / More Permissive CA Down-Scaling (safe-to-evict) #767

Closed

vlerenc added the area/cost Cost related label Apr 11, 2024

vlerenc removed the kind/epic Large multi-story topic label May 9, 2024

ashwani2k added this to the v0.23.0 milestone Jul 25, 2024

gardener-robot assigned ishan16696 Jul 25, 2024

ishan16696 removed their assignment Aug 8, 2024

shreyas-s-rao removed this from the v0.23.0 milestone Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Node Utilization by Avoiding "safe-to-evict" Annotation for Druid-Managed Pods #766

Improve Node Utilization by Avoiding "safe-to-evict" Annotation for Druid-Managed Pods #766

vlerenc commented Feb 15, 2024 •

edited

Loading

ishan16696 commented Feb 25, 2024

vlerenc commented Apr 8, 2024

shreyas-s-rao commented Apr 8, 2024

vlerenc commented Apr 8, 2024

shreyas-s-rao commented Apr 8, 2024

unmarshall commented Apr 9, 2024 •

edited

Loading

shreyas-s-rao commented Apr 25, 2024

vlerenc commented Apr 25, 2024

voelzmo commented Apr 26, 2024 •

edited

Loading

unmarshall commented Apr 29, 2024 •

edited

Loading

shreyas-s-rao commented Apr 29, 2024

shreyas-s-rao commented Apr 30, 2024

vlerenc commented May 17, 2024 •

edited

Loading

ishan16696 commented Jul 25, 2024

Improve Node Utilization by Avoiding "safe-to-evict" Annotation for Druid-Managed Pods #766

Improve Node Utilization by Avoiding "safe-to-evict" Annotation for Druid-Managed Pods #766

Comments

vlerenc commented Feb 15, 2024 • edited Loading

How to categorize this issue?

Stories

What would you like to be added

Motivation (Why is this needed?)

ishan16696 commented Feb 25, 2024

vlerenc commented Apr 8, 2024

shreyas-s-rao commented Apr 8, 2024

vlerenc commented Apr 8, 2024

shreyas-s-rao commented Apr 8, 2024

unmarshall commented Apr 9, 2024 • edited Loading

shreyas-s-rao commented Apr 25, 2024

vlerenc commented Apr 25, 2024

voelzmo commented Apr 26, 2024 • edited Loading

unmarshall commented Apr 29, 2024 • edited Loading

shreyas-s-rao commented Apr 29, 2024

shreyas-s-rao commented Apr 30, 2024

vlerenc commented May 17, 2024 • edited Loading

ishan16696 commented Jul 25, 2024

vlerenc commented Feb 15, 2024 •

edited

Loading

unmarshall commented Apr 9, 2024 •

edited

Loading

voelzmo commented Apr 26, 2024 •

edited

Loading

unmarshall commented Apr 29, 2024 •

edited

Loading

vlerenc commented May 17, 2024 •

edited

Loading