NTH should issue lifecycle heartbeats #493

jrsherry · 2021-10-06T11:26:53Z

I've been using the NTH in queue processor mode. This implementation uses a lifecycle hook associated with the node instance to trigger the NTH to cordon/drain. Lifecycle hooks support two timeouts; the global timeout (max 48hrs) and the heartbeat timeout (max 7200 seconds).
https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_LifecycleHook.html

This means that if the NTH doesn't issue lifecycle heartbeats during the draining process, the node will be terminated (assuming CONTINUE on timeout vice ABANDON) within 7200 seconds (whatever the hook's heartbeat timeout is configured to).

This is problematic if you've got termination grace periods that can exceed 7200 seconds. The node will be terminated before the pod can safely evict.

If the NTH was issuing lifecycle heartbeats during the node drain, then this would effectively support grace periods that extend to the 48 hour global timeout.
https://docs.aws.amazon.com/cli/latest/reference/autoscaling/record-lifecycle-action-heartbeat.html

bwagner5 · 2021-10-08T02:11:34Z

Interesting, I hadn't expected nodes taking hours to drain. We can look into adding the heartbeat to cover the long draining case. Thanks for reporting!

gabegorelick · 2021-10-28T19:25:57Z

I would ask that this be made configurable so that installs that intentionally use the heartbeat timeout as the limiting factor (which is how NTH works now) can still do so. For cases where NTH can't reliably determine if progress is being made (which would probably be most of the time?), the heartbeat would be counterproductive if you want to timeout before the global timeout.

varkey · 2021-11-16T15:47:01Z

We'd be interested in this as well, we have some applications which we run with a graceful termination threshold of upto 3 hours. In most cases the pods gracefully terminate much before the 3 hours but there would be cases where it can take close to 3 hours for the pods to terminate gracefully.

For cases where NTH can't reliably determine if progress is being made (which would probably be most of the time?), the heartbeat would be counterproductive if you want to timeout before the global timeout.

Yes, a configurable max period for sending heartbeats would be helpful. So NTH can probably be configured to send heartbeats for a given period (3 hours in our case) and the lifecycle heartbeat timeout can be set to 5 min or so, this way the node would be terminated anyway once the heartbeats stop.

jillmon · 2021-11-17T19:41:45Z

It looks like this issue has a good amount of interest. We would absolutely be open to accepting a PR for this, but right now we are focusing on the next version of NTH (V2). In V2 we hope to eliminate the need for adding so many additional configurations and solve a number of other issues on this repository.

gabegorelick · 2021-11-17T19:48:18Z

right now we are focusing on the next version of NTH (V2)

Is there any documentation on V2, and specifically for this issue how it would handle heartbeats?

stevehipwell · 2022-11-03T17:02:59Z

@snay2 @cjerad is this implemented in v2?

cjerad · 2022-11-03T18:20:30Z

@stevehipwell Currently, v2 does not issue heartbeats.

stevehipwell · 2022-11-03T18:37:26Z

@cjerad is that a conscious decision or is it something you'd like to do if you had to resource to implement it?

cjerad · 2022-11-16T20:08:43Z

@stevehipwell It just hasn't been investigated yet.

stevehipwell · 2022-11-16T21:49:39Z

@cjerad are you looking for contributions?

0x91 · 2023-11-08T12:17:18Z

bump? it's been years since there was a mention of V2

riuvshyn · 2023-11-21T10:06:30Z

any update on this? 🙏🏽
This is very important feature required for running stateful workloads like kafka clusters where graceful rotation of nodes is critical and also quite slow.

riuvshyn · 2023-11-23T18:07:34Z

Interesting, I hadn't expected nodes taking hours to drain. We can look into adding the heartbeat to cover the long draining case. Thanks for reporting!

@bwagner5 This problem affects not only pods that are taking long time to terminate but also this affects any workload where many pods involved with a podAntiAffinity + pdb limiting termination multiple pods at once.
So with these constraints if all nodes where such pods are running falling under 2h timeout due to PDB not allowing terminate many pods at the same time, so it really depends on how quickly pod is being replaced and allows next node to proceed with eviction. Sometimes when for some reason start up or termination might take longer than usual or hitting AWS capacity issues so new nodes can't be scheduled there is a hight chance to hit 2h time out by significant amount of pending termination nodes and they will be killed without draining which will cause down time for the workloads.

This is pretty serious limitation which has to be considered before using NTH in large clusters.

AndiDog · 2024-12-11T07:54:29Z

I noticed that if NTH receives ASG lifecycle events, spot interruptions and EC2 state change events via SQS, it may still leave spot instances in Terminating:Wait because it doesn't receive an ASG event for all interrupted spot instances. Maybe that's because of a missing event from AWS ASG side – I didn't check yet. So the spot instance would be drained due to the interruption event, but stay around until the heartbeat timeout runs out.

By implementing sending of heartbeats, we'd get the advantage of setting a low heartbeat timeout (= instances not lingering around forever if NTH isn't running or didn't get an event) but also make use of the ASG lifecycle hook "global timeout" to set the maximum time NTH can keep an instance alive. Not sure if the above spot+ASG case could be much improved as well.

bwagner5 added the Type: Enhancement New feature or request label Oct 8, 2021

jillmon added Priority: Low This issue will not be seen by most users. The issue is a very specific use case or corner case stalebot-ignore To NOT let the stalebot update or close the Issue / PR labels Oct 19, 2021

jillmon added Priority: High This issue will be seen by most users and removed Priority: Low This issue will not be seen by most users. The issue is a very specific use case or corner case labels Nov 17, 2021

jillmon added the Status: Help Wanted label Nov 17, 2021

LikithaVemulapalli self-assigned this Jul 25, 2023

LikithaVemulapalli removed their assignment Nov 8, 2023

Lu-David assigned hyeong01 Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NTH should issue lifecycle heartbeats #493

NTH should issue lifecycle heartbeats #493

jrsherry commented Oct 6, 2021

bwagner5 commented Oct 8, 2021

gabegorelick commented Oct 28, 2021

varkey commented Nov 16, 2021 •

edited

Loading

jillmon commented Nov 17, 2021

gabegorelick commented Nov 17, 2021

stevehipwell commented Nov 3, 2022

cjerad commented Nov 3, 2022

stevehipwell commented Nov 3, 2022

cjerad commented Nov 16, 2022

stevehipwell commented Nov 16, 2022

0x91 commented Nov 8, 2023

riuvshyn commented Nov 21, 2023 •

edited

Loading

riuvshyn commented Nov 23, 2023 •

edited

Loading

AndiDog commented Dec 11, 2024

NTH should issue lifecycle heartbeats #493

NTH should issue lifecycle heartbeats #493

Comments

jrsherry commented Oct 6, 2021

bwagner5 commented Oct 8, 2021

gabegorelick commented Oct 28, 2021

varkey commented Nov 16, 2021 • edited Loading

jillmon commented Nov 17, 2021

gabegorelick commented Nov 17, 2021

stevehipwell commented Nov 3, 2022

cjerad commented Nov 3, 2022

stevehipwell commented Nov 3, 2022

cjerad commented Nov 16, 2022

stevehipwell commented Nov 16, 2022

0x91 commented Nov 8, 2023

riuvshyn commented Nov 21, 2023 • edited Loading

riuvshyn commented Nov 23, 2023 • edited Loading

AndiDog commented Dec 11, 2024

varkey commented Nov 16, 2021 •

edited

Loading

riuvshyn commented Nov 21, 2023 •

edited

Loading

riuvshyn commented Nov 23, 2023 •

edited

Loading