Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Configure leader election timeout #1443

Merged

Conversation

Tomasz-Smelcerz-SAP
Copy link
Member

Description

Fixes #1351

@Tomasz-Smelcerz-SAP Tomasz-Smelcerz-SAP requested a review from a team as a code owner April 3, 2024 11:25
@kyma-bot kyma-bot added cla: yes Indicates the PR's author has signed the CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 3, 2024
@Tomasz-Smelcerz-SAP Tomasz-Smelcerz-SAP changed the title configure leader election timeout chore: Configure leader election timeout Apr 3, 2024
@Tomasz-Smelcerz-SAP
Copy link
Member Author

Tomasz-Smelcerz-SAP commented Apr 3, 2024

Note for the reviewer: Testing that is hard. You have to make the API Server unavailable for some time. I found a simple way to test it locally, but it requires a code change so it's not suitable for automated test.

How to test locally:

Add the following to the cmd/main.go, near line 146:

LeaderElectionNamespace: "default",
  1. Start the local k3d cluster
  2. Run the LM locally: go run cmd/main.go --leader-elect=true -leader-election-renew-deadline 20s
  3. Kill the local k3d cluster
  4. The LM process should terminate with exit code 2 after 20 seconds.

Example run:

[...]
E0403 13:30:12.213201 1006013 leaderelection.go:332] error retrieving resource lock default/893110f7.kyma-project.io: Get "https://0.0.0.0:43177/apis/coordination.k8s.io/v1/namespaces/default/leases/893110f7.kyma-project.io": dial tcp 0.0.0.0:43177: connect: connection refused
E0403 13:30:14.212719 1006013 leaderelection.go:332] error retrieving resource lock default/893110f7.kyma-project.io: Get "https://0.0.0.0:43177/apis/coordination.k8s.io/v1/namespaces/default/leases/893110f7.kyma-project.io": dial tcp 0.0.0.0:43177: connect: connection refused
E0403 13:30:16.212918 1006013 leaderelection.go:332] error retrieving resource lock default/893110f7.kyma-project.io: Get "https://0.0.0.0:43177/apis/coordination.k8s.io/v1/namespaces/default/leases/893110f7.kyma-project.io": dial tcp 0.0.0.0:43177: connect: connection refused
E0403 13:30:18.213662 1006013 leaderelection.go:332] error retrieving resource lock default/893110f7.kyma-project.io: Get "https://0.0.0.0:43177/apis/coordination.k8s.io/v1/namespaces/default/leases/893110f7.kyma-project.io": dial tcp 0.0.0.0:43177: connect: connection refused
E0403 13:30:20.212556 1006013 leaderelection.go:332] error retrieving resource lock default/893110f7.kyma-project.io: client rate limiter Wait returned an error: context deadline exceeded
I0403 13:30:20.212574 1006013 leaderelection.go:285] failed to renew lease default/893110f7.kyma-project.io: timed out waiting for the condition
{"level":"ERROR","date":"2024-04-03T13:30:20.212595966+02:00","logger":"setup","caller":"cmd/main.go:194","msg":"problem running manager","context":{"error":"leader election lost"},"stacktrace":"main.setupManager\n\t/[...]/src/github.com/kyma-project/lifecycle-manager/cmd/main.go:194\nmain.main\n\t/[...]/src/github.com/kyma-project/lifecycle-manager/cmd/main.go:105\nruntime.main\n\t/usr/lib/go/src/runtime/proc.go:271"}
exit status 2

Note that the time difference between the first and the last log entry in the example above is not exactly 20 seconds, but rather around 18 seconds. This is because the controller creates the first error log message about 2 seconds after the API Server connectivity is lost

@Tomasz-Smelcerz-SAP Tomasz-Smelcerz-SAP force-pushed the chore/leader-election-timeout branch from b20e74b to b5cac34 Compare April 4, 2024 05:54
Copy link
Contributor

@c-pius c-pius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the good description, I can replicate the behavior as you described.

Two general things I am wondering about:

a) when does the leader election actually get used? I looked at Argo for stage and prod and can see that we only have one pod there. Are we only using it when a new version of KLM is deployed?
b) in the ticket you described that API server outages are about 2 mins. Why are we setting the renew deadline to 90 seconds then and not something slightly above 2 mins?

@Tomasz-Smelcerz-SAP
Copy link
Member Author

Tomasz-Smelcerz-SAP commented Apr 4, 2024

Two general things I am wondering about:

a) when does the leader election actually get used? I looked at Argo for stage and prod and can see that we only have one pod there. Are we only using it when a new version of KLM is deployed?

A quote from some k8s book: Some master components, such as the scheduler and the controller manager, can't have multiple instances active at the same time. This will be chaos, as multiple schedulers try to schedule the same pod into multiple nodes or multiple times into the same node. The correct way to have a highly-scalable Kubernetes cluster is to have these components run in leader election mode. This means that multiple instances are running, but only one is active at a time, and if it fails, another one is elected as leader and takes its place.
In our case this mechanism is mainly used to prevent two instances of KLM running in parallel due to misconfiguration.

b) in the ticket you described that API server outages are about 2 mins. Why are we setting the renew deadline to 90 seconds then and not something slightly above 2 mins?

Short answer: I think shorter is better, but we decided to test how it behaves for longer time-outs...
Long answer: We raise this mainly to "see what happens" during ETCD outages, hoping the KLM will survive the short ones. Other than that we don't have any strict recommendation. I think the 90 seconds is OK - it is almost one order of magnitude bigger than the default 10 seconds. Raising it requires also raising the lease time and this has consequences - two KLM instances running in parallel will "detect" that situation with time resolution determined by the lease time. So the longer the lease time, the greater risk of having two KLMs reconciling the same cluster.

Edit: Despite my preferences for shorter values, I decided to raise the deadline to 2 minutes and lease time to 3 minutes (to keep their ratio as in library's defaults)

@Tomasz-Smelcerz-SAP Tomasz-Smelcerz-SAP force-pushed the chore/leader-election-timeout branch from b5cac34 to 548ea6d Compare April 4, 2024 11:51
Copy link
Contributor

@c-pius c-pius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanations! Good for me, just one thing I just remembered. We once agreed to add all flag defaults to the tests file which has not been done yet: https://github.com/kyma-project/lifecycle-manager/blob/main/internal/pkg/flags/flags_test.go

@Tomasz-Smelcerz-SAP Tomasz-Smelcerz-SAP force-pushed the chore/leader-election-timeout branch from 854e8f9 to 01838f1 Compare April 5, 2024 07:42
@Tomasz-Smelcerz-SAP Tomasz-Smelcerz-SAP force-pushed the chore/leader-election-timeout branch from 01838f1 to dc0385d Compare April 5, 2024 08:07
@kyma-bot kyma-bot added the lgtm Looks good to me! label Apr 5, 2024
@kyma-bot kyma-bot merged commit ffb9491 into kyma-project:main Apr 5, 2024
41 checks passed
@Tomasz-Smelcerz-SAP Tomasz-Smelcerz-SAP deleted the chore/leader-election-timeout branch April 5, 2024 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Indicates the PR's author has signed the CLA. lgtm Looks good to me! size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chore: Increase leader election timeout
3 participants