-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Node Utilization by Avoiding "safe-to-evict" Annotation for Druid-Managed Pods #766
Comments
When evicting a member pod from a clustered etcd, there are a few points to keep in mind:
To avoid a worst case scenario i.e long restoration, I would suggest let the backup-restore takes the schedule full snapshot which is also performed during the maintenance window before evicting a single node etcd-main pod. |
It will probably not amount to much/anything, but (1) could be done right away (compactor jobs), right? @ishan16696 How would you suggest to do that? This is about cost savings, cluster-autoscaler, evictions, PDBs. You do not have a real hook into these things, so in order for the above to be implemented, you would need to take control of everything (put the PDB to 3 minAvailable / 0 maxUnavailable and let the druid do the job if it sees the "signs"), right? But you already run ETCD with a PDB of 2 minAvailable / 1 maxUnavailable (for clustered ETCD), so you as a team already decided to go with a flexible PDB, have you not? |
@vlerenc Yes, but what does flexible mean here? The PDB allows for atmost 1 pod to be down/evicted. So the PDB should most certainly block a pod eviction if another pod is already healthy, shouldn't it?
@ishan16696 but where would this be precondition be? Druid doesn't have any say in eviction of an etcd pod or draining of a node. Those events can occur at any time, And druid really can't block it. Well, maybe it's possible, if it puts a finalizer on the pods themselves, and wouldn't remove the finalizer unless it allows the pod to be evicted. This can possibly be another part of the scope for druid-controlled pod-updates maybe? @unmarshall
@ishan16696 if the node being scaled down is the one hosting the leader, then there would definitely be a leader election, no doubt. But the optimisation here would be for druid to manually move leadership from this pod to another, and then unblock the pod for deletion. |
@shreyas-s-rao What I mean is that you already allow for 1 voluntary eviction (for clustered ETCDs). So, you made this decision already and are OK with it/that it may happen (in general). All I suggested, because I saw the/that PDB, was to drop the CA annotation that only influences CA. You already OKed a voluntary evictions in general in the PDB. Or in other words: If there are general concerns with voluntary evictions (what PDBs do - which would be respected during scale-in of a node up until the timeout), why did we define a PDB allowing a voluntary eviction of 1? |
Yes, I would guess so. Without this flexibility of allowing 1 replica to be evicted, it would be a nightmare to scale down any nodes at all, irrespective of the CA annotation. |
For a multi-node etcd cluster there is no need for this annotation to be present since we already have a PDB defined which allows 1 eviction at the most. Today for a single node etcd cluster we set
Do we really need this? Leader election can happen at any time during a lifecycle of an etcd cluster. So how is leader election caused due to CA scaling down a node any different?
So i checked the way disruption controller computes available pods. It checks the ready condition of the Pod to determine. See countHealthyPods where it checks the ready condition and based on the current healthy, desired heathy etcd it updates PDB status. Now the ready condition of the Pod is determined by ready conditions for all of its containers which is set here in the kubelet. For each individual container a worker is created whose run is called as a go-routine which periodically runs probes. The result of the readiness probe is then used to update the container status which then collectively across all containers constitute the pod readiness. Well it seems i went down a rabbit hold just to make a point. So the point is that the |
In an out-of-band discussion between myself, @unmarshall , @ishan16696 , @seshachalam-yv and @renormalize , we concluded the following:
@vlerenc PTAL. |
Why would that happen? The annotation only influences the CA, nothing else, no? |
Just a side note on this topic:
This is true, until you set We do that for many of the gardener managed components, but not for etcd, AFAIK. |
Thanks for sharing this as i was unaware of this feature. Unhealthy translates to ready condition being false. For an etcd cluster if there is a loss of quorum then the ready condition for all pods will change to false. This then allows even the healthier pod (1 out of 3) to also be evicted. So in case of a permanent quorum loss this would not have much side effect. But i wondering how will this impact transient quorum loss situations. Today we do not set |
PDB behavior and CA behavior are meant to tackle two different problems entirely. One aims to maintain minimum desired replicas for multi-node etcds, in order to keep them quorate, irrespective of what state the underlying nodes are in. Hence PDBs are necessary for multi-node etcds, and CA does not adversely affect the quorum of a multi-node etcd. Now, setting the field On the other hand, CA aims to control node lifecycle, including draining of nodes that run single-node etcds on them, and expects that an etcd pod should not block the draining of a node indefinitely. PDBs play no part for single-node etcds, since a single-node etcd pod must be rolled sometime or another, leading to a temporary unavailability of the etcd. Here, setting the |
Meeting minutes from an out-of-band discussion between myself, @vlerenc, @unmarshall , @ishan16696 , @renormalize , @seshachalam-yv, @ashwani2k :
Action items:
|
At the moment, it wouldn't be worth switching HA on for all
That's all guessing. I was assuming that the Then again, most ETCDs still have HVPA configured (HVPA is removed only for shoots on seed
So, in this case, we would actually reduce our costs by 25K, but that's comparing apples to So, right now, we don't have the data/confidence to decide, one way or another, i.e. at the moment we would neither decide to make all What we can do: Go forward with the HVPA removal and clarify the downscaling question that @voelzmo will bring up:
Once we have that in, we can collect the data and run the prediction again. Then we have a proper foundation to decide what we are going to do with the ETCD worker pool and the WDYT? |
/assign |
How to categorize this issue?
/kind enhancement
Stories
What would you like to be added
Do not add the
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
annotation to the etcd-main-compactor jobs (just because it's in the etcd-main spec)Do not add the
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
annotation to the etcd-main pods for the clustered etcd-main statefulsetEither switch all clusters to clustered etcd-main (except for evaluation clusters) and remove the annotation in general(cost-prohibitive)...or consider a more elaborate approach how to deal with the singleton etcd-main statefulset:
Implementation details:
Motivation (Why is this needed?)
We have adopters seeing 80-90 % requests utilization on the regular seed cluster nodes, but only 40-50% for their dedicated ETCD seed cluster nodes.
The text was updated successfully, but these errors were encountered: