Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NTH should issue lifecycle heartbeats #493

Open
jrsherry opened this issue Oct 6, 2021 · 14 comments
Open

NTH should issue lifecycle heartbeats #493

jrsherry opened this issue Oct 6, 2021 · 14 comments
Assignees
Labels
Priority: High This issue will be seen by most users stalebot-ignore To NOT let the stalebot update or close the Issue / PR Status: Help Wanted Type: Enhancement New feature or request

Comments

@jrsherry
Copy link
Contributor

jrsherry commented Oct 6, 2021

I've been using the NTH in queue processor mode. This implementation uses a lifecycle hook associated with the node instance to trigger the NTH to cordon/drain. Lifecycle hooks support two timeouts; the global timeout (max 48hrs) and the heartbeat timeout (max 7200 seconds).
https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_LifecycleHook.html

This means that if the NTH doesn't issue lifecycle heartbeats during the draining process, the node will be terminated (assuming CONTINUE on timeout vice ABANDON) within 7200 seconds (whatever the hook's heartbeat timeout is configured to).

This is problematic if you've got termination grace periods that can exceed 7200 seconds. The node will be terminated before the pod can safely evict.

If the NTH was issuing lifecycle heartbeats during the node drain, then this would effectively support grace periods that extend to the 48 hour global timeout.
https://docs.aws.amazon.com/cli/latest/reference/autoscaling/record-lifecycle-action-heartbeat.html

@bwagner5 bwagner5 added the Type: Enhancement New feature or request label Oct 8, 2021
@bwagner5
Copy link
Contributor

bwagner5 commented Oct 8, 2021

Interesting, I hadn't expected nodes taking hours to drain. We can look into adding the heartbeat to cover the long draining case. Thanks for reporting!

@jillmon jillmon added Priority: Low This issue will not be seen by most users. The issue is a very specific use case or corner case stalebot-ignore To NOT let the stalebot update or close the Issue / PR labels Oct 19, 2021
@gabegorelick
Copy link
Contributor

I would ask that this be made configurable so that installs that intentionally use the heartbeat timeout as the limiting factor (which is how NTH works now) can still do so. For cases where NTH can't reliably determine if progress is being made (which would probably be most of the time?), the heartbeat would be counterproductive if you want to timeout before the global timeout.

@varkey
Copy link

varkey commented Nov 16, 2021

We'd be interested in this as well, we have some applications which we run with a graceful termination threshold of upto 3 hours. In most cases the pods gracefully terminate much before the 3 hours but there would be cases where it can take close to 3 hours for the pods to terminate gracefully.

For cases where NTH can't reliably determine if progress is being made (which would probably be most of the time?), the heartbeat would be counterproductive if you want to timeout before the global timeout.

Yes, a configurable max period for sending heartbeats would be helpful. So NTH can probably be configured to send heartbeats for a given period (3 hours in our case) and the lifecycle heartbeat timeout can be set to 5 min or so, this way the node would be terminated anyway once the heartbeats stop.

@jillmon jillmon added Priority: High This issue will be seen by most users and removed Priority: Low This issue will not be seen by most users. The issue is a very specific use case or corner case labels Nov 17, 2021
@jillmon
Copy link
Contributor

jillmon commented Nov 17, 2021

It looks like this issue has a good amount of interest. We would absolutely be open to accepting a PR for this, but right now we are focusing on the next version of NTH (V2). In V2 we hope to eliminate the need for adding so many additional configurations and solve a number of other issues on this repository.

@gabegorelick
Copy link
Contributor

right now we are focusing on the next version of NTH (V2)

Is there any documentation on V2, and specifically for this issue how it would handle heartbeats?

@stevehipwell
Copy link
Contributor

@snay2 @cjerad is this implemented in v2?

@cjerad
Copy link
Contributor

cjerad commented Nov 3, 2022

@stevehipwell Currently, v2 does not issue heartbeats.

@stevehipwell
Copy link
Contributor

@cjerad is that a conscious decision or is it something you'd like to do if you had to resource to implement it?

@cjerad
Copy link
Contributor

cjerad commented Nov 16, 2022

@stevehipwell It just hasn't been investigated yet.

@stevehipwell
Copy link
Contributor

@cjerad are you looking for contributions?

@LikithaVemulapalli LikithaVemulapalli self-assigned this Jul 25, 2023
@0x91
Copy link

0x91 commented Nov 8, 2023

bump? it's been years since there was a mention of V2

@LikithaVemulapalli LikithaVemulapalli removed their assignment Nov 8, 2023
@riuvshyn
Copy link

riuvshyn commented Nov 21, 2023

any update on this? 🙏🏽
This is very important feature required for running stateful workloads like kafka clusters where graceful rotation of nodes is critical and also quite slow.

@riuvshyn
Copy link

riuvshyn commented Nov 23, 2023

Interesting, I hadn't expected nodes taking hours to drain. We can look into adding the heartbeat to cover the long draining case. Thanks for reporting!

@bwagner5 This problem affects not only pods that are taking long time to terminate but also this affects any workload where many pods involved with a podAntiAffinity + pdb limiting termination multiple pods at once.
So with these constraints if all nodes where such pods are running falling under 2h timeout due to PDB not allowing terminate many pods at the same time, so it really depends on how quickly pod is being replaced and allows next node to proceed with eviction. Sometimes when for some reason start up or termination might take longer than usual or hitting AWS capacity issues so new nodes can't be scheduled there is a hight chance to hit 2h time out by significant amount of pending termination nodes and they will be killed without draining which will cause down time for the workloads.

This is pretty serious limitation which has to be considered before using NTH in large clusters.

@AndiDog
Copy link

AndiDog commented Dec 11, 2024

I noticed that if NTH receives ASG lifecycle events, spot interruptions and EC2 state change events via SQS, it may still leave spot instances in Terminating:Wait because it doesn't receive an ASG event for all interrupted spot instances. Maybe that's because of a missing event from AWS ASG side – I didn't check yet. So the spot instance would be drained due to the interruption event, but stay around until the heartbeat timeout runs out.

By implementing sending of heartbeats, we'd get the advantage of setting a low heartbeat timeout (= instances not lingering around forever if NTH isn't running or didn't get an event) but also make use of the ASG lifecycle hook "global timeout" to set the maximum time NTH can keep an instance alive. Not sure if the above spot+ASG case could be much improved as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: High This issue will be seen by most users stalebot-ignore To NOT let the stalebot update or close the Issue / PR Status: Help Wanted Type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests