Help! What caused the leader election lost? #1813

AllenZMC · 2019-08-12T03:14:13Z

What did you do?
I use Leader-with-lease in my operator code.

What did you expect to see?
A clear and concise description of what you expected to happen (or insert a code snippet).

What did you see instead? Under which circumstances?
A clear and concise description of what you expected to happen (or insert a code snippet).
leader election lost” will appear every other time.

2019-08-12T03:05:56.866Z	ERROR	cmd	Manager exited non-zero	{"error": "leader election lost"}
github.com/go-logr/zapr.(*zapLogger).Error
	redis-operator/vendor/github.com/go-logr/zapr/zapr.go:128
main.main
	redis-operator/cmd/manager/main.go:174
runtime.main

Environment

operator-sdk version: 0.9.1

insert release or Git SHA here: master
Kubernetes version information: v.1.12

insert output of kubectl version here
Kubernetes cluster kind:

Additional context
Add any other context about the question here.

The text was updated successfully, but these errors were encountered:

joelanford · 2019-08-13T17:42:41Z

@AllenZMC based on that error message, it sounds like you're using the leader election built into the controller-runtime Manager. Is that correct? If so, here is where that error is coming from.

Are you attempting to run multiple replicas of your operator? It sounds like another instance is getting the leader lock and the instance with this log is losing that election.

AllenZMC · 2019-08-14T02:16:23Z

@joelanford , Yes. I am using the leader election built into the controller-runtime Manager.
I only run one replica of my operator. I will continue to pay attention to it.

joelanford · 2019-08-21T20:09:19Z

@AllenZMC Sounds good. Yeah my guess is that you have a rogue replica running somewhere. Maybe an old in-cluster replica that didn't get cleaned up or something running in another terminal or in the background locally with up local?

If you're sure you're only running one replica and you keep seeing this, and it does turn out to be a bug, let us know and we'll help you out with an issue or PR upstream in controller-runtime.

AllenZMC · 2019-08-22T01:50:34Z

I guess the reason is that LeaseDuration or RenewDeadline is too short. @joelanford

joelanford · 2019-08-22T01:55:30Z

@AllenZMC I'm not sure. I'm definitely not an expert on controller-runtime's leader election internals, but I don't think a single replica should ever lose an election, which seems to be the case here.

AllenZMC · 2019-08-22T01:58:16Z

@joelanford Thanks, I think I have found the problem.

jessehu · 2019-12-04T10:48:45Z

Hi @AllenZMC , what's the root cause please?

wlmvp · 2020-01-09T05:53:57Z

Hi @AllenZMC , what's the root cause please?

snorwin · 2020-10-30T13:57:10Z

Hi @AllenZMC , what's the root cause please?

snorwin · 2020-11-25T07:10:00Z

Since we migrated to operator-sdk 1.0.0, there have been many container restarts caused by the leader election lost error. We use a single replica of an operator on openshift v3.11.170 and kubernetes v1.11.0+d4cacc0.

The issue has been resolved by increasing the LeaseDuration to 30s and the RenewDeadline to 20s.

AllenZMC · 2020-11-25T13:54:45Z

Yes @snorwin See #1813 (comment)

Following advise from: operator-framework/operator-sdk#1813 (comment)

Bryce-huang · 2020-12-29T10:15:02Z

Since we migrated to operator-sdk 1.0.0, there have been many container restarts caused by the leader election lost error. We use a single replica of an operator on openshift v3.11.170 and kubernetes v1.11.0+d4cacc0.

The issue has been resolved by increasing the LeaseDuration to 30s and the RenewDeadline to 20s.

It just slowed down the frequency and did not solve the problem

I found below error in a long running MC controller, so increase the timeout longer considering we only have one controller running, leader election should always be able to get the lease. ``` E0106 07:29:05.501113 1 leaderelection.go:361] Failed to update lock: context deadline exceeded I0106 07:29:05.895992 1 leaderelection.go:278] failed to renew lease antrea-mcs-ns/6536456a.crd.antrea.io: timed out waiting for the condition 2022-01-06T07:29:05.896Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"ConfigMap","namespace":"antrea-mcs-ns","name":"6536456a.crd.antrea.io","uid":"a4de74cd-0441-4140-a78b-acf163055f91","apiVersion":"v1","resourceVersion":"23629919"}, "reason": "LeaderElection", "message": "antrea-mc-controller-6dcb88b9d6-vxqvm_e1b1b0a9-b2b5-471f-b424-b11a34343d64 stopped leading"} 2022-01-06T07:29:05.999Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"Lease","namespace":"antrea-mcs-ns","name":"6536456a.crd.antrea.io","uid":"6709c340-ee00-459b-b186-e56c15fbde67","apiVersion":"coordination.k8s.io/v1","resourceVersion":"23629901"}, "reason": "LeaderElection", "message": "antrea-mc-controller-6dcb88b9d6-vxqvm_e1b1b0a9-b2b5-471f-b424-b11a34343d64 stopped leading"} 2022-01-06T07:29:05.598Z DEBUG controller-runtime.webhook.webhooks received request {"webhook": "/validate-multicluster-crd-antrea-io-v1alpha1-memberclusterannounce", "UID": "da938dc5-cbda-4714-a9f3-f25d7f105353", "kind": "multicluster.crd.antrea.io/v1alpha1, Kind=MemberClusterAnnounce", "resource": {"group":"multicluster.crd.antrea.io","version":"v1alpha1","resource":"memberclusterannounces"}} F0106 07:29:06.099280 1 leader.go:41] Error running controller: error running Manager: leader election lost ``` refer to operator-framework/operator-sdk#1813 Signed-off-by: Lan Luo <luola@vmware.com>

So... when deploying Service Binding Operator on a bare-metal environment I'm running into THIS issue *a lot*: operator-framework/operator-sdk#1813 Why? Because my bare-metal cluster is running 100's of containers and unfortunately it takes a while to get the leader. Usually it's within a few seconds, but if there is some HDD usage, it'll take a little while. What a lot of operator maintainers have done, is implemented the LeaseDuration and RenewDeadline commands to the operator so that us bare-metal people can increase the timeouts. In this PR I have: - Added this to the `main.go` file - Set the default 30 / 20 second timeouts. So that we have the ability to change the settings via a configmap / yaml within k8s: ```yaml leaderElection: leaderElect: false resourceName: <example-domain>.io leaseDuration: "30s" renewDeadline: "20s" ``` This would have no impact on current functionality, but help those who are experiencing high amount of restarts / CrashLoopback's of the service binding operator pod: ```sh $ k get pods -A operators pgo-f96b88c9d-z2nfx 1/1 Running 0 18h operators service-binding-operator-7795b785b4-wh265 0/1 CrashLoopBackOff 225 (2m30s ago) 23h ``` And the output here: ```sh {"level":"error","ts":1643209250.576585,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"apiextensions.k8s.io","reconcilerKind":"CustomResourceDefinition","controller":"customresourcedefinition","name":"orders.acme.cert-manager.io","namespace":"","error":"no matches for kind \"Order\" in version \"acme.cert-manager.io/v1alpha2\"","stacktrace":"g$ thub.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWork$ tem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/workspace/vendor/k8s.io/apimachinery/$ kg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"} E0126 15:00:50.850700 1 leaderelection.go:321] error retrieving resource lock operators/8fa65150.coreos.com: Get "https://10.96.0.1:443/api/v1/namespaces/operators/configmaps/8fa65150.coreos.com": context deadline exceeded I0126 15:00:50.850783 1 leaderelection.go:278] failed to renew lease operators/8fa65150.coreos.com: timed out waiting for the condition {"level":"info","ts":1643209250.8508487,"logger":"controller","msg":"Stopping workers","reconcilerGroup":"servicebinding.io","reconcilerKind":"ServiceBinding","controller":"servicebinding"} {"level":"error","ts":1643209250.8508205,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nmain.main\n\t/workspace/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"} {"level":"info","ts":1643209250.8509405,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"} ``` Which would fix: Fixes redhat-developer/odo#5396

So... when deploying Service Binding Operator on a bare-metal environment I'm running into THIS issue *a lot*: operator-framework/operator-sdk#1813 Why? Because my bare-metal cluster is running 100's of containers and unfortunately it takes a while to get the leader. Usually it's within a few seconds, but if there is some HDD usage, it'll take a little while. What a lot of operator maintainers have done, is implemented the LeaseDuration and RenewDeadline commands to the operator so that us bare-metal people can increase the timeouts. In this PR I have: - Added this to the `main.go` file - Set the default 30 / 20 second timeouts. So that we have the ability to change the settings via a configmap / yaml within k8s: ```yaml leaderElection: leaderElect: false resourceName: <example-domain>.io leaseDuration: "30s" renewDeadline: "20s" ``` This would have no impact on current functionality, but help those who are experiencing high amount of restarts / CrashLoopback's of the service binding operator pod: ```sh $ k get pods -A operators pgo-f96b88c9d-z2nfx 1/1 Running 0 18h operators service-binding-operator-7795b785b4-wh265 0/1 CrashLoopBackOff 225 (2m30s ago) 23h ``` And the output here: ```sh {"level":"error","ts":1643209250.576585,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"apiextensions.k8s.io","reconcilerKind":"CustomResourceDefinition","controller":"customresourcedefinition","name":"orders.acme.cert-manager.io","namespace":"","error":"no matches for kind \"Order\" in version \"acme.cert-manager.io/v1alpha2\"","stacktrace":"g$ thub.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWork$ tem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/workspace/vendor/k8s.io/apimachinery/$ kg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"} E0126 15:00:50.850700 1 leaderelection.go:321] error retrieving resource lock operators/8fa65150.coreos.com: Get "https://10.96.0.1:443/api/v1/namespaces/operators/configmaps/8fa65150.coreos.com": context deadline exceeded I0126 15:00:50.850783 1 leaderelection.go:278] failed to renew lease operators/8fa65150.coreos.com: timed out waiting for the condition {"level":"info","ts":1643209250.8508487,"logger":"controller","msg":"Stopping workers","reconcilerGroup":"servicebinding.io","reconcilerKind":"ServiceBinding","controller":"servicebinding"} {"level":"error","ts":1643209250.8508205,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nmain.main\n\t/workspace/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"} {"level":"info","ts":1643209250.8509405,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"} ``` Which would fix: Fixes redhat-developer/odo#5396 Signed-off-by: Charlie Drage <charlie@charliedrage.com>

joelanford self-assigned this Aug 13, 2019

joelanford added the triage/support Indicates an issue that is a support question. label Aug 13, 2019

joelanford closed this as completed Aug 21, 2019

J12934 added a commit to secureCodeBox/secureCodeBox that referenced this issue Dec 2, 2020

Increase Lease Duration and Renew Deadline

5a58fc6

Following advise from: operator-framework/operator-sdk#1813 (comment)

J12934 mentioned this issue Dec 2, 2020

Increase Lease Duration and Renew Deadline secureCodeBox/secureCodeBox#237

Merged

Bryce-huang mentioned this issue Dec 29, 2020

Check the node status when leader re-election. operator-framework/operator-lib#24

Closed

andreaskaris mentioned this issue Jan 29, 2021

ERROR setup problem running manager {"error": "leader election lost"} andreaskaris/sosreport-operator#14

Closed

TomStuart92 mentioned this issue Apr 29, 2021

Update default lease duration and renew deadline monzo/egress-operator#19

Merged

This was referenced Aug 19, 2021

Bug 1989741: Increase Lease Duration and Renew Deadline metallb/metallb-operator#87

Closed

Bug 1989741: Increase Lease Duration and Renew Deadline openshift/metallb-operator#30

Closed

leochr mentioned this issue Nov 23, 2021

Increase leader lease application-stacks/runtime-component-operator#269

Merged

2 tasks

luolanzone mentioned this issue Jan 7, 2022

Fix leader eletion timeout issue antrea-io/antrea#3177

Merged

cdrage mentioned this issue Jan 26, 2022

Leader election issues with Service Binding Operator on bare metal. Need docs + possible different error output redhat-developer/odo#5396

Closed

3 tasks

cdrage mentioned this issue Jan 27, 2022

Add LeaseDuration and RenewDeadline timeouts for operator redhat-developer/service-binding-operator#1097

Closed

cdrage mentioned this issue Jul 27, 2022

Indicate OLM / operators / SBO is resource hungry / may not run on small clusters redhat-developer/odo#5982

Closed

Frapschen mentioned this issue Jun 27, 2023

support config leader election SumoLogic/tailing-sidecar#550

Merged

rajatjindal mentioned this issue Mar 15, 2024

Spin operator crashes with "leader election lost" spinkube/spin-operator#180

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help! What caused the leader election lost? #1813

Help! What caused the leader election lost? #1813

AllenZMC commented Aug 12, 2019

joelanford commented Aug 13, 2019

AllenZMC commented Aug 14, 2019

joelanford commented Aug 21, 2019

AllenZMC commented Aug 22, 2019

joelanford commented Aug 22, 2019

AllenZMC commented Aug 22, 2019

jessehu commented Dec 4, 2019

wlmvp commented Jan 9, 2020

snorwin commented Oct 30, 2020

snorwin commented Nov 25, 2020 •

edited

Loading

AllenZMC commented Nov 25, 2020 •

edited

Loading

Bryce-huang commented Dec 29, 2020

Help! What caused the leader election lost? #1813

Help! What caused the leader election lost? #1813

Comments

AllenZMC commented Aug 12, 2019

joelanford commented Aug 13, 2019

AllenZMC commented Aug 14, 2019

joelanford commented Aug 21, 2019

AllenZMC commented Aug 22, 2019

joelanford commented Aug 22, 2019

AllenZMC commented Aug 22, 2019

jessehu commented Dec 4, 2019

wlmvp commented Jan 9, 2020

snorwin commented Oct 30, 2020

snorwin commented Nov 25, 2020 • edited Loading

AllenZMC commented Nov 25, 2020 • edited Loading

Bryce-huang commented Dec 29, 2020

snorwin commented Nov 25, 2020 •

edited

Loading

AllenZMC commented Nov 25, 2020 •

edited

Loading