add(fluentd,fluent-bit): added lifecycle object #185

applike-ss · 2021-11-26T08:27:15Z

This PR adds the possibility to add lifecycle management to the pods. That may be helpful to let the pod running long enough so that it can properly be deregistered from a load balancer.

Signed-off-by: Sebastian Struß <struss@justdice.io>

stevehipwell · 2021-11-26T16:55:10Z

@applike-ss do you have a particular use case for this?

applike-ss · 2021-11-27T09:56:22Z

Yes, as i stated: We want to give the load balancer enough time to deregister the target and not let fluentd exit right away. As new tasks take some time to get healthy we would loose logs for inputs that are not retried.

applike-ss · 2021-11-29T13:41:24Z

@stevehipwell mind merging it? Or are there open questions?

stevehipwell · 2021-11-29T14:32:02Z

@applike-ss I am still struggling to see why you need a lifecycle hook instead of a customisable termination grace period? Is there something non-standard that Fluent Bit does with SIGTERM?

applike-ss · 2021-11-29T14:35:21Z

the termination grace period (if i'm not recalling it incorrectly) is for letting kubernetes know how long fluentd/fluent-bit are allowed to take to shut down. However our problem is NOT that it shuts down slowly (the opposite is the case).
Our Problem is that we want the load balancer to:

cap all connections to fluentd by deregistering the target completely
shut down fluentd
start the new fluentd
register it in the target group

stevehipwell · 2021-11-29T14:46:39Z

@applike-ss I'm struggling to see why this doesn't already work when the correct K8s configuration is provided (update strategy, PDB, termination grace period)? See Kubernetes best practices: terminating with grace.

applike-ss · 2021-11-29T14:52:33Z

Ahhh now i see where the confusion is coming from. When i was talking about the load balancer, i wasn't thinking of the internal load balancer of k8s, but rather the load balancer of our cloud provider. The current setup makes requests still go onto an already not alive pod when there is a deployment. That is because fluentd's shutdown is quite fast, however the load balancer does only detect it after 2 failed attempts of health checks which have a ~10 second interval.

stevehipwell · 2021-11-29T14:56:00Z

@applike-ss technically this should still work, but I'm aware of a number of cases where there can be problems. I assume you've been seeing issues with this already? Which cloud and LB implementation are you using?

applike-ss · 2021-11-29T14:58:39Z

Yes, we have seen issues there. We are using AWS NLB. They do have a minimum of 2 healthy/unhealthy healthcheck results before changing status. The timeout needs to be at least 10 seconds there, so that would potentially be ~20 seconds of no logs being ingested into the pipeline for clients that do not retry.

stevehipwell · 2021-11-29T15:15:13Z

@applike-ss thanks for the details, this makes sense (kubernetes-sigs/aws-load-balancer-controller#2366 has some more details).

If you're still using the legacy in-tree controller or instance mode you might want to look at the new AWS Load Balancer Controller and IP mode as this removes a lot of the legacy issues.

stevehipwell

I've got a minor suggested change and then I'd be happy to merge.

charts/fluent-bit/templates/_pod.tpl

Signed-off-by: Sebastian Struß <struss@justdice.io>

applike-ss · 2021-11-29T15:22:25Z

In fact i recently switched to using IP mode. Prior to that (months ago) i read an article from an aws engineer who was recommending the add a preStop lifecycle, but that's so long ago that my browsers history wouldn't find it.

I implemented your suggested change to both charts. Note though that other properties handled it the same as i did priorly (welcome to copy paste hell).

stevehipwell

/lgtm

applike-ss requested review from edsiper, naseemkullah and stevehipwell as code owners November 26, 2021 08:27

add(fluentd,fluent-bit): added lifecycle object

d52a623

Signed-off-by: Sebastian Struß <struss@justdice.io>

applike-ss force-pushed the add-container-lifecycle branch from 285bb42 to d52a623 Compare November 26, 2021 10:20

naseemkullah approved these changes Nov 27, 2021

View reviewed changes

stevehipwell requested changes Nov 29, 2021

View reviewed changes

charts/fluent-bit/templates/_pod.tpl Outdated Show resolved Hide resolved

review fixes

13cdc90

Signed-off-by: Sebastian Struß <struss@justdice.io>

stevehipwell approved these changes Nov 29, 2021

View reviewed changes

stevehipwell merged commit e7e56cd into fluent:main Nov 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add(fluentd,fluent-bit): added lifecycle object #185

add(fluentd,fluent-bit): added lifecycle object #185

applike-ss commented Nov 26, 2021

stevehipwell commented Nov 26, 2021

applike-ss commented Nov 27, 2021

applike-ss commented Nov 29, 2021

stevehipwell commented Nov 29, 2021

applike-ss commented Nov 29, 2021

stevehipwell commented Nov 29, 2021

applike-ss commented Nov 29, 2021

stevehipwell commented Nov 29, 2021

applike-ss commented Nov 29, 2021 •

edited

Loading

stevehipwell commented Nov 29, 2021

stevehipwell left a comment

applike-ss commented Nov 29, 2021

stevehipwell left a comment

add(fluentd,fluent-bit): added lifecycle object #185

add(fluentd,fluent-bit): added lifecycle object #185

Conversation

applike-ss commented Nov 26, 2021

stevehipwell commented Nov 26, 2021

applike-ss commented Nov 27, 2021

applike-ss commented Nov 29, 2021

stevehipwell commented Nov 29, 2021

applike-ss commented Nov 29, 2021

stevehipwell commented Nov 29, 2021

applike-ss commented Nov 29, 2021

stevehipwell commented Nov 29, 2021

applike-ss commented Nov 29, 2021 • edited Loading

stevehipwell commented Nov 29, 2021

stevehipwell left a comment

Choose a reason for hiding this comment

applike-ss commented Nov 29, 2021

stevehipwell left a comment

Choose a reason for hiding this comment

applike-ss commented Nov 29, 2021 •

edited

Loading