chore: Configure leader election timeout #1443

Tomasz-Smelcerz-SAP · 2024-04-03T11:25:47Z

Description

Tomasz-Smelcerz-SAP · 2024-04-03T11:35:02Z

Note for the reviewer: Testing that is hard. You have to make the API Server unavailable for some time. I found a simple way to test it locally, but it requires a code change so it's not suitable for automated test.

How to test locally:

Add the following to the cmd/main.go, near line 146:

LeaderElectionNamespace: "default",

Start the local k3d cluster
Run the LM locally: go run cmd/main.go --leader-elect=true -leader-election-renew-deadline 20s
Kill the local k3d cluster
The LM process should terminate with exit code 2 after 20 seconds.

Example run:

[...]
E0403 13:30:12.213201 1006013 leaderelection.go:332] error retrieving resource lock default/893110f7.kyma-project.io: Get "https://0.0.0.0:43177/apis/coordination.k8s.io/v1/namespaces/default/leases/893110f7.kyma-project.io": dial tcp 0.0.0.0:43177: connect: connection refused
E0403 13:30:14.212719 1006013 leaderelection.go:332] error retrieving resource lock default/893110f7.kyma-project.io: Get "https://0.0.0.0:43177/apis/coordination.k8s.io/v1/namespaces/default/leases/893110f7.kyma-project.io": dial tcp 0.0.0.0:43177: connect: connection refused
E0403 13:30:16.212918 1006013 leaderelection.go:332] error retrieving resource lock default/893110f7.kyma-project.io: Get "https://0.0.0.0:43177/apis/coordination.k8s.io/v1/namespaces/default/leases/893110f7.kyma-project.io": dial tcp 0.0.0.0:43177: connect: connection refused
E0403 13:30:18.213662 1006013 leaderelection.go:332] error retrieving resource lock default/893110f7.kyma-project.io: Get "https://0.0.0.0:43177/apis/coordination.k8s.io/v1/namespaces/default/leases/893110f7.kyma-project.io": dial tcp 0.0.0.0:43177: connect: connection refused
E0403 13:30:20.212556 1006013 leaderelection.go:332] error retrieving resource lock default/893110f7.kyma-project.io: client rate limiter Wait returned an error: context deadline exceeded
I0403 13:30:20.212574 1006013 leaderelection.go:285] failed to renew lease default/893110f7.kyma-project.io: timed out waiting for the condition
{"level":"ERROR","date":"2024-04-03T13:30:20.212595966+02:00","logger":"setup","caller":"cmd/main.go:194","msg":"problem running manager","context":{"error":"leader election lost"},"stacktrace":"main.setupManager\n\t/[...]/src/github.com/kyma-project/lifecycle-manager/cmd/main.go:194\nmain.main\n\t/[...]/src/github.com/kyma-project/lifecycle-manager/cmd/main.go:105\nruntime.main\n\t/usr/lib/go/src/runtime/proc.go:271"}
exit status 2

Note that the time difference between the first and the last log entry in the example above is not exactly 20 seconds, but rather around 18 seconds. This is because the controller creates the first error log message about 2 seconds after the API Server connectivity is lost

c-pius

Thanks for the good description, I can replicate the behavior as you described.

Two general things I am wondering about:

a) when does the leader election actually get used? I looked at Argo for stage and prod and can see that we only have one pod there. Are we only using it when a new version of KLM is deployed?
b) in the ticket you described that API server outages are about 2 mins. Why are we setting the renew deadline to 90 seconds then and not something slightly above 2 mins?

Tomasz-Smelcerz-SAP · 2024-04-04T11:43:47Z

Two general things I am wondering about:

a) when does the leader election actually get used? I looked at Argo for stage and prod and can see that we only have one pod there. Are we only using it when a new version of KLM is deployed?

A quote from some k8s book: Some master components, such as the scheduler and the controller manager, can't have multiple instances active at the same time. This will be chaos, as multiple schedulers try to schedule the same pod into multiple nodes or multiple times into the same node. The correct way to have a highly-scalable Kubernetes cluster is to have these components run in leader election mode. This means that multiple instances are running, but only one is active at a time, and if it fails, another one is elected as leader and takes its place.
In our case this mechanism is mainly used to prevent two instances of KLM running in parallel due to misconfiguration.

b) in the ticket you described that API server outages are about 2 mins. Why are we setting the renew deadline to 90 seconds then and not something slightly above 2 mins?

Short answer: I think shorter is better, but we decided to test how it behaves for longer time-outs...
Long answer: We raise this mainly to "see what happens" during ETCD outages, hoping the KLM will survive the short ones. Other than that we don't have any strict recommendation. I think the 90 seconds is OK - it is almost one order of magnitude bigger than the default 10 seconds. Raising it requires also raising the lease time and this has consequences - two KLM instances running in parallel will "detect" that situation with time resolution determined by the lease time. So the longer the lease time, the greater risk of having two KLMs reconciling the same cluster.

Edit: Despite my preferences for shorter values, I decided to raise the deadline to 2 minutes and lease time to 3 minutes (to keep their ratio as in library's defaults)

c-pius

Thanks for the explanations! Good for me, just one thing I just remembered. We once agreed to add all flag defaults to the tests file which has not been done yet: https://github.com/kyma-project/lifecycle-manager/blob/main/internal/pkg/flags/flags_test.go

Tomasz-Smelcerz-SAP requested a review from a team as a code owner April 3, 2024 11:25

kyma-bot added cla: yes Indicates the PR's author has signed the CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 3, 2024

Tomasz-Smelcerz-SAP changed the title ~~configure leader election timeout~~ chore: Configure leader election timeout Apr 3, 2024

Tomasz-Smelcerz-SAP force-pushed the chore/leader-election-timeout branch from b20e74b to b5cac34 Compare April 4, 2024 05:54

c-pius reviewed Apr 4, 2024

View reviewed changes

Tomasz-Smelcerz-SAP force-pushed the chore/leader-election-timeout branch from b5cac34 to 548ea6d Compare April 4, 2024 11:51

c-pius reviewed Apr 4, 2024

View reviewed changes

Tomasz-Smelcerz-SAP force-pushed the chore/leader-election-timeout branch from 854e8f9 to 01838f1 Compare April 5, 2024 07:42

Tomasz-Smelcerz-SAP added 3 commits April 5, 2024 10:07

configure leader election timeout

ea91508

Review fix

f2e9500

Review fix

dc0385d

Tomasz-Smelcerz-SAP force-pushed the chore/leader-election-timeout branch from 01838f1 to dc0385d Compare April 5, 2024 08:07

c-pius approved these changes Apr 5, 2024

View reviewed changes

kyma-bot assigned c-pius Apr 5, 2024

kyma-bot added the lgtm Looks good to me! label Apr 5, 2024

kyma-bot merged commit ffb9491 into kyma-project:main Apr 5, 2024
41 checks passed

Tomasz-Smelcerz-SAP deleted the chore/leader-election-timeout branch April 5, 2024 08:59

Tomasz-Smelcerz-SAP mentioned this pull request Dec 23, 2024

Investigate start up delay of KLM #1836

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Configure leader election timeout #1443

chore: Configure leader election timeout #1443

Tomasz-Smelcerz-SAP commented Apr 3, 2024

Tomasz-Smelcerz-SAP commented Apr 3, 2024 •

edited

Loading

c-pius left a comment

Tomasz-Smelcerz-SAP commented Apr 4, 2024 •

edited

Loading

c-pius left a comment

chore: Configure leader election timeout #1443

chore: Configure leader election timeout #1443

Conversation

Tomasz-Smelcerz-SAP commented Apr 3, 2024

Tomasz-Smelcerz-SAP commented Apr 3, 2024 • edited Loading

c-pius left a comment

Choose a reason for hiding this comment

Tomasz-Smelcerz-SAP commented Apr 4, 2024 • edited Loading

c-pius left a comment

Choose a reason for hiding this comment

Tomasz-Smelcerz-SAP commented Apr 3, 2024 •

edited

Loading

Tomasz-Smelcerz-SAP commented Apr 4, 2024 •

edited

Loading