-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help! What caused the leader election lost? #1813
Comments
@AllenZMC based on that error message, it sounds like you're using the leader election built into the controller-runtime Are you attempting to run multiple replicas of your operator? It sounds like another instance is getting the leader lock and the instance with this log is losing that election. |
@joelanford , Yes. I am using the leader election built into the controller-runtime |
@AllenZMC Sounds good. Yeah my guess is that you have a rogue replica running somewhere. Maybe an old in-cluster replica that didn't get cleaned up or something running in another terminal or in the background locally with If you're sure you're only running one replica and you keep seeing this, and it does turn out to be a bug, let us know and we'll help you out with an issue or PR upstream in controller-runtime. |
I guess the reason is that |
@AllenZMC I'm not sure. I'm definitely not an expert on controller-runtime's leader election internals, but I don't think a single replica should ever lose an election, which seems to be the case here. |
@joelanford Thanks, I think I have found the problem. |
Hi @AllenZMC , what's the root cause please? |
1 similar comment
Hi @AllenZMC , what's the root cause please? |
Hi @AllenZMC , what's the root cause please? |
Since we migrated to operator-sdk 1.0.0, there have been many container restarts caused by the The issue has been resolved by increasing the |
Yes @snorwin See #1813 (comment) |
Following advise from: operator-framework/operator-sdk#1813 (comment)
It just slowed down the frequency and did not solve the problem |
I found below error in a long running MC controller, so increase the timeout longer considering we only have one controller running, leader election should always be able to get the lease. ``` E0106 07:29:05.501113 1 leaderelection.go:361] Failed to update lock: context deadline exceeded I0106 07:29:05.895992 1 leaderelection.go:278] failed to renew lease antrea-mcs-ns/6536456a.crd.antrea.io: timed out waiting for the condition 2022-01-06T07:29:05.896Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"ConfigMap","namespace":"antrea-mcs-ns","name":"6536456a.crd.antrea.io","uid":"a4de74cd-0441-4140-a78b-acf163055f91","apiVersion":"v1","resourceVersion":"23629919"}, "reason": "LeaderElection", "message": "antrea-mc-controller-6dcb88b9d6-vxqvm_e1b1b0a9-b2b5-471f-b424-b11a34343d64 stopped leading"} 2022-01-06T07:29:05.999Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"Lease","namespace":"antrea-mcs-ns","name":"6536456a.crd.antrea.io","uid":"6709c340-ee00-459b-b186-e56c15fbde67","apiVersion":"coordination.k8s.io/v1","resourceVersion":"23629901"}, "reason": "LeaderElection", "message": "antrea-mc-controller-6dcb88b9d6-vxqvm_e1b1b0a9-b2b5-471f-b424-b11a34343d64 stopped leading"} 2022-01-06T07:29:05.598Z DEBUG controller-runtime.webhook.webhooks received request {"webhook": "/validate-multicluster-crd-antrea-io-v1alpha1-memberclusterannounce", "UID": "da938dc5-cbda-4714-a9f3-f25d7f105353", "kind": "multicluster.crd.antrea.io/v1alpha1, Kind=MemberClusterAnnounce", "resource": {"group":"multicluster.crd.antrea.io","version":"v1alpha1","resource":"memberclusterannounces"}} F0106 07:29:06.099280 1 leader.go:41] Error running controller: error running Manager: leader election lost ``` refer to operator-framework/operator-sdk#1813 Signed-off-by: Lan Luo <luola@vmware.com>
I found below error in a long running MC controller, so increase the timeout longer considering we only have one controller running, leader election should always be able to get the lease. ``` E0106 07:29:05.501113 1 leaderelection.go:361] Failed to update lock: context deadline exceeded I0106 07:29:05.895992 1 leaderelection.go:278] failed to renew lease antrea-mcs-ns/6536456a.crd.antrea.io: timed out waiting for the condition 2022-01-06T07:29:05.896Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"ConfigMap","namespace":"antrea-mcs-ns","name":"6536456a.crd.antrea.io","uid":"a4de74cd-0441-4140-a78b-acf163055f91","apiVersion":"v1","resourceVersion":"23629919"}, "reason": "LeaderElection", "message": "antrea-mc-controller-6dcb88b9d6-vxqvm_e1b1b0a9-b2b5-471f-b424-b11a34343d64 stopped leading"} 2022-01-06T07:29:05.999Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"Lease","namespace":"antrea-mcs-ns","name":"6536456a.crd.antrea.io","uid":"6709c340-ee00-459b-b186-e56c15fbde67","apiVersion":"coordination.k8s.io/v1","resourceVersion":"23629901"}, "reason": "LeaderElection", "message": "antrea-mc-controller-6dcb88b9d6-vxqvm_e1b1b0a9-b2b5-471f-b424-b11a34343d64 stopped leading"} 2022-01-06T07:29:05.598Z DEBUG controller-runtime.webhook.webhooks received request {"webhook": "/validate-multicluster-crd-antrea-io-v1alpha1-memberclusterannounce", "UID": "da938dc5-cbda-4714-a9f3-f25d7f105353", "kind": "multicluster.crd.antrea.io/v1alpha1, Kind=MemberClusterAnnounce", "resource": {"group":"multicluster.crd.antrea.io","version":"v1alpha1","resource":"memberclusterannounces"}} F0106 07:29:06.099280 1 leader.go:41] Error running controller: error running Manager: leader election lost ``` refer to operator-framework/operator-sdk#1813 Signed-off-by: Lan Luo <luola@vmware.com>
I found below error in a long running MC controller, so increase the timeout longer considering we only have one controller running, leader election should always be able to get the lease. ``` E0106 07:29:05.501113 1 leaderelection.go:361] Failed to update lock: context deadline exceeded I0106 07:29:05.895992 1 leaderelection.go:278] failed to renew lease antrea-mcs-ns/6536456a.crd.antrea.io: timed out waiting for the condition 2022-01-06T07:29:05.896Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"ConfigMap","namespace":"antrea-mcs-ns","name":"6536456a.crd.antrea.io","uid":"a4de74cd-0441-4140-a78b-acf163055f91","apiVersion":"v1","resourceVersion":"23629919"}, "reason": "LeaderElection", "message": "antrea-mc-controller-6dcb88b9d6-vxqvm_e1b1b0a9-b2b5-471f-b424-b11a34343d64 stopped leading"} 2022-01-06T07:29:05.999Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"Lease","namespace":"antrea-mcs-ns","name":"6536456a.crd.antrea.io","uid":"6709c340-ee00-459b-b186-e56c15fbde67","apiVersion":"coordination.k8s.io/v1","resourceVersion":"23629901"}, "reason": "LeaderElection", "message": "antrea-mc-controller-6dcb88b9d6-vxqvm_e1b1b0a9-b2b5-471f-b424-b11a34343d64 stopped leading"} 2022-01-06T07:29:05.598Z DEBUG controller-runtime.webhook.webhooks received request {"webhook": "/validate-multicluster-crd-antrea-io-v1alpha1-memberclusterannounce", "UID": "da938dc5-cbda-4714-a9f3-f25d7f105353", "kind": "multicluster.crd.antrea.io/v1alpha1, Kind=MemberClusterAnnounce", "resource": {"group":"multicluster.crd.antrea.io","version":"v1alpha1","resource":"memberclusterannounces"}} F0106 07:29:06.099280 1 leader.go:41] Error running controller: error running Manager: leader election lost ``` refer to operator-framework/operator-sdk#1813 Signed-off-by: Lan Luo <luola@vmware.com>
So... when deploying Service Binding Operator on a bare-metal environment I'm running into THIS issue *a lot*: operator-framework/operator-sdk#1813 Why? Because my bare-metal cluster is running 100's of containers and unfortunately it takes a while to get the leader. Usually it's within a few seconds, but if there is some HDD usage, it'll take a little while. What a lot of operator maintainers have done, is implemented the LeaseDuration and RenewDeadline commands to the operator so that us bare-metal people can increase the timeouts. In this PR I have: - Added this to the `main.go` file - Set the default 30 / 20 second timeouts. So that we have the ability to change the settings via a configmap / yaml within k8s: ```yaml leaderElection: leaderElect: false resourceName: <example-domain>.io leaseDuration: "30s" renewDeadline: "20s" ``` This would have no impact on current functionality, but help those who are experiencing high amount of restarts / CrashLoopback's of the service binding operator pod: ```sh $ k get pods -A operators pgo-f96b88c9d-z2nfx 1/1 Running 0 18h operators service-binding-operator-7795b785b4-wh265 0/1 CrashLoopBackOff 225 (2m30s ago) 23h ``` And the output here: ```sh {"level":"error","ts":1643209250.576585,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"apiextensions.k8s.io","reconcilerKind":"CustomResourceDefinition","controller":"customresourcedefinition","name":"orders.acme.cert-manager.io","namespace":"","error":"no matches for kind \"Order\" in version \"acme.cert-manager.io/v1alpha2\"","stacktrace":"g$ thub.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWork$ tem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/workspace/vendor/k8s.io/apimachinery/$ kg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"} E0126 15:00:50.850700 1 leaderelection.go:321] error retrieving resource lock operators/8fa65150.coreos.com: Get "https://10.96.0.1:443/api/v1/namespaces/operators/configmaps/8fa65150.coreos.com": context deadline exceeded I0126 15:00:50.850783 1 leaderelection.go:278] failed to renew lease operators/8fa65150.coreos.com: timed out waiting for the condition {"level":"info","ts":1643209250.8508487,"logger":"controller","msg":"Stopping workers","reconcilerGroup":"servicebinding.io","reconcilerKind":"ServiceBinding","controller":"servicebinding"} {"level":"error","ts":1643209250.8508205,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nmain.main\n\t/workspace/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"} {"level":"info","ts":1643209250.8509405,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"} ``` Which would fix: Fixes redhat-developer/odo#5396
So... when deploying Service Binding Operator on a bare-metal environment I'm running into THIS issue *a lot*: operator-framework/operator-sdk#1813 Why? Because my bare-metal cluster is running 100's of containers and unfortunately it takes a while to get the leader. Usually it's within a few seconds, but if there is some HDD usage, it'll take a little while. What a lot of operator maintainers have done, is implemented the LeaseDuration and RenewDeadline commands to the operator so that us bare-metal people can increase the timeouts. In this PR I have: - Added this to the `main.go` file - Set the default 30 / 20 second timeouts. So that we have the ability to change the settings via a configmap / yaml within k8s: ```yaml leaderElection: leaderElect: false resourceName: <example-domain>.io leaseDuration: "30s" renewDeadline: "20s" ``` This would have no impact on current functionality, but help those who are experiencing high amount of restarts / CrashLoopback's of the service binding operator pod: ```sh $ k get pods -A operators pgo-f96b88c9d-z2nfx 1/1 Running 0 18h operators service-binding-operator-7795b785b4-wh265 0/1 CrashLoopBackOff 225 (2m30s ago) 23h ``` And the output here: ```sh {"level":"error","ts":1643209250.576585,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"apiextensions.k8s.io","reconcilerKind":"CustomResourceDefinition","controller":"customresourcedefinition","name":"orders.acme.cert-manager.io","namespace":"","error":"no matches for kind \"Order\" in version \"acme.cert-manager.io/v1alpha2\"","stacktrace":"g$ thub.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWork$ tem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/workspace/vendor/k8s.io/apimachinery/$ kg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"} E0126 15:00:50.850700 1 leaderelection.go:321] error retrieving resource lock operators/8fa65150.coreos.com: Get "https://10.96.0.1:443/api/v1/namespaces/operators/configmaps/8fa65150.coreos.com": context deadline exceeded I0126 15:00:50.850783 1 leaderelection.go:278] failed to renew lease operators/8fa65150.coreos.com: timed out waiting for the condition {"level":"info","ts":1643209250.8508487,"logger":"controller","msg":"Stopping workers","reconcilerGroup":"servicebinding.io","reconcilerKind":"ServiceBinding","controller":"servicebinding"} {"level":"error","ts":1643209250.8508205,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nmain.main\n\t/workspace/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"} {"level":"info","ts":1643209250.8509405,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"} ``` Which would fix: Fixes redhat-developer/odo#5396 Signed-off-by: Charlie Drage <charlie@charliedrage.com>
So... when deploying Service Binding Operator on a bare-metal environment I'm running into THIS issue *a lot*: operator-framework/operator-sdk#1813 Why? Because my bare-metal cluster is running 100's of containers and unfortunately it takes a while to get the leader. Usually it's within a few seconds, but if there is some HDD usage, it'll take a little while. What a lot of operator maintainers have done, is implemented the LeaseDuration and RenewDeadline commands to the operator so that us bare-metal people can increase the timeouts. In this PR I have: - Added this to the `main.go` file - Set the default 30 / 20 second timeouts. So that we have the ability to change the settings via a configmap / yaml within k8s: ```yaml leaderElection: leaderElect: false resourceName: <example-domain>.io leaseDuration: "30s" renewDeadline: "20s" ``` This would have no impact on current functionality, but help those who are experiencing high amount of restarts / CrashLoopback's of the service binding operator pod: ```sh $ k get pods -A operators pgo-f96b88c9d-z2nfx 1/1 Running 0 18h operators service-binding-operator-7795b785b4-wh265 0/1 CrashLoopBackOff 225 (2m30s ago) 23h ``` And the output here: ```sh {"level":"error","ts":1643209250.576585,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"apiextensions.k8s.io","reconcilerKind":"CustomResourceDefinition","controller":"customresourcedefinition","name":"orders.acme.cert-manager.io","namespace":"","error":"no matches for kind \"Order\" in version \"acme.cert-manager.io/v1alpha2\"","stacktrace":"g$ thub.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWork$ tem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/workspace/vendor/k8s.io/apimachinery/$ kg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"} E0126 15:00:50.850700 1 leaderelection.go:321] error retrieving resource lock operators/8fa65150.coreos.com: Get "https://10.96.0.1:443/api/v1/namespaces/operators/configmaps/8fa65150.coreos.com": context deadline exceeded I0126 15:00:50.850783 1 leaderelection.go:278] failed to renew lease operators/8fa65150.coreos.com: timed out waiting for the condition {"level":"info","ts":1643209250.8508487,"logger":"controller","msg":"Stopping workers","reconcilerGroup":"servicebinding.io","reconcilerKind":"ServiceBinding","controller":"servicebinding"} {"level":"error","ts":1643209250.8508205,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nmain.main\n\t/workspace/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"} {"level":"info","ts":1643209250.8509405,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"} ``` Which would fix: Fixes redhat-developer/odo#5396 Signed-off-by: Charlie Drage <charlie@charliedrage.com>
So... when deploying Service Binding Operator on a bare-metal environment I'm running into THIS issue *a lot*: operator-framework/operator-sdk#1813 Why? Because my bare-metal cluster is running 100's of containers and unfortunately it takes a while to get the leader. Usually it's within a few seconds, but if there is some HDD usage, it'll take a little while. What a lot of operator maintainers have done, is implemented the LeaseDuration and RenewDeadline commands to the operator so that us bare-metal people can increase the timeouts. In this PR I have: - Added this to the `main.go` file - Set the default 30 / 20 second timeouts. So that we have the ability to change the settings via a configmap / yaml within k8s: ```yaml leaderElection: leaderElect: false resourceName: <example-domain>.io leaseDuration: "30s" renewDeadline: "20s" ``` This would have no impact on current functionality, but help those who are experiencing high amount of restarts / CrashLoopback's of the service binding operator pod: ```sh $ k get pods -A operators pgo-f96b88c9d-z2nfx 1/1 Running 0 18h operators service-binding-operator-7795b785b4-wh265 0/1 CrashLoopBackOff 225 (2m30s ago) 23h ``` And the output here: ```sh {"level":"error","ts":1643209250.576585,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"apiextensions.k8s.io","reconcilerKind":"CustomResourceDefinition","controller":"customresourcedefinition","name":"orders.acme.cert-manager.io","namespace":"","error":"no matches for kind \"Order\" in version \"acme.cert-manager.io/v1alpha2\"","stacktrace":"g$ thub.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWork$ tem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/workspace/vendor/k8s.io/apimachinery/$ kg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"} E0126 15:00:50.850700 1 leaderelection.go:321] error retrieving resource lock operators/8fa65150.coreos.com: Get "https://10.96.0.1:443/api/v1/namespaces/operators/configmaps/8fa65150.coreos.com": context deadline exceeded I0126 15:00:50.850783 1 leaderelection.go:278] failed to renew lease operators/8fa65150.coreos.com: timed out waiting for the condition {"level":"info","ts":1643209250.8508487,"logger":"controller","msg":"Stopping workers","reconcilerGroup":"servicebinding.io","reconcilerKind":"ServiceBinding","controller":"servicebinding"} {"level":"error","ts":1643209250.8508205,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nmain.main\n\t/workspace/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"} {"level":"info","ts":1643209250.8509405,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"} ``` Which would fix: Fixes redhat-developer/odo#5396 Signed-off-by: Charlie Drage <charlie@charliedrage.com>
So... when deploying Service Binding Operator on a bare-metal environment I'm running into THIS issue *a lot*: operator-framework/operator-sdk#1813 Why? Because my bare-metal cluster is running 100's of containers and unfortunately it takes a while to get the leader. Usually it's within a few seconds, but if there is some HDD usage, it'll take a little while. What a lot of operator maintainers have done, is implemented the LeaseDuration and RenewDeadline commands to the operator so that us bare-metal people can increase the timeouts. In this PR I have: - Added this to the `main.go` file - Set the default 30 / 20 second timeouts. So that we have the ability to change the settings via a configmap / yaml within k8s: ```yaml leaderElection: leaderElect: false resourceName: <example-domain>.io leaseDuration: "30s" renewDeadline: "20s" ``` This would have no impact on current functionality, but help those who are experiencing high amount of restarts / CrashLoopback's of the service binding operator pod: ```sh $ k get pods -A operators pgo-f96b88c9d-z2nfx 1/1 Running 0 18h operators service-binding-operator-7795b785b4-wh265 0/1 CrashLoopBackOff 225 (2m30s ago) 23h ``` And the output here: ```sh {"level":"error","ts":1643209250.576585,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"apiextensions.k8s.io","reconcilerKind":"CustomResourceDefinition","controller":"customresourcedefinition","name":"orders.acme.cert-manager.io","namespace":"","error":"no matches for kind \"Order\" in version \"acme.cert-manager.io/v1alpha2\"","stacktrace":"g$ thub.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWork$ tem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/workspace/vendor/k8s.io/apimachinery/$ kg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"} E0126 15:00:50.850700 1 leaderelection.go:321] error retrieving resource lock operators/8fa65150.coreos.com: Get "https://10.96.0.1:443/api/v1/namespaces/operators/configmaps/8fa65150.coreos.com": context deadline exceeded I0126 15:00:50.850783 1 leaderelection.go:278] failed to renew lease operators/8fa65150.coreos.com: timed out waiting for the condition {"level":"info","ts":1643209250.8508487,"logger":"controller","msg":"Stopping workers","reconcilerGroup":"servicebinding.io","reconcilerKind":"ServiceBinding","controller":"servicebinding"} {"level":"error","ts":1643209250.8508205,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nmain.main\n\t/workspace/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"} {"level":"info","ts":1643209250.8509405,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"} ``` Which would fix: Fixes redhat-developer/odo#5396 Signed-off-by: Charlie Drage <charlie@charliedrage.com>
So... when deploying Service Binding Operator on a bare-metal environment I'm running into THIS issue *a lot*: operator-framework/operator-sdk#1813 Why? Because my bare-metal cluster is running 100's of containers and unfortunately it takes a while to get the leader. Usually it's within a few seconds, but if there is some HDD usage, it'll take a little while. What a lot of operator maintainers have done, is implemented the LeaseDuration and RenewDeadline commands to the operator so that us bare-metal people can increase the timeouts. In this PR I have: - Added this to the `main.go` file - Set the default 30 / 20 second timeouts. So that we have the ability to change the settings via a configmap / yaml within k8s: ```yaml leaderElection: leaderElect: false resourceName: <example-domain>.io leaseDuration: "30s" renewDeadline: "20s" ``` This would have no impact on current functionality, but help those who are experiencing high amount of restarts / CrashLoopback's of the service binding operator pod: ```sh $ k get pods -A operators pgo-f96b88c9d-z2nfx 1/1 Running 0 18h operators service-binding-operator-7795b785b4-wh265 0/1 CrashLoopBackOff 225 (2m30s ago) 23h ``` And the output here: ```sh {"level":"error","ts":1643209250.576585,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"apiextensions.k8s.io","reconcilerKind":"CustomResourceDefinition","controller":"customresourcedefinition","name":"orders.acme.cert-manager.io","namespace":"","error":"no matches for kind \"Order\" in version \"acme.cert-manager.io/v1alpha2\"","stacktrace":"g$ thub.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWork$ tem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/workspace/vendor/k8s.io/apimachinery/$ kg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"} E0126 15:00:50.850700 1 leaderelection.go:321] error retrieving resource lock operators/8fa65150.coreos.com: Get "https://10.96.0.1:443/api/v1/namespaces/operators/configmaps/8fa65150.coreos.com": context deadline exceeded I0126 15:00:50.850783 1 leaderelection.go:278] failed to renew lease operators/8fa65150.coreos.com: timed out waiting for the condition {"level":"info","ts":1643209250.8508487,"logger":"controller","msg":"Stopping workers","reconcilerGroup":"servicebinding.io","reconcilerKind":"ServiceBinding","controller":"servicebinding"} {"level":"error","ts":1643209250.8508205,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nmain.main\n\t/workspace/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"} {"level":"info","ts":1643209250.8509405,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"} ``` Which would fix: Fixes redhat-developer/odo#5396 Signed-off-by: Charlie Drage <charlie@charliedrage.com>
So... when deploying Service Binding Operator on a bare-metal environment I'm running into THIS issue *a lot*: operator-framework/operator-sdk#1813 Why? Because my bare-metal cluster is running 100's of containers and unfortunately it takes a while to get the leader. Usually it's within a few seconds, but if there is some HDD usage, it'll take a little while. What a lot of operator maintainers have done, is implemented the LeaseDuration and RenewDeadline commands to the operator so that us bare-metal people can increase the timeouts. In this PR I have: - Added this to the `main.go` file - Set the default 30 / 20 second timeouts. So that we have the ability to change the settings via a configmap / yaml within k8s: ```yaml leaderElection: leaderElect: false resourceName: <example-domain>.io leaseDuration: "30s" renewDeadline: "20s" ``` This would have no impact on current functionality, but help those who are experiencing high amount of restarts / CrashLoopback's of the service binding operator pod: ```sh $ k get pods -A operators pgo-f96b88c9d-z2nfx 1/1 Running 0 18h operators service-binding-operator-7795b785b4-wh265 0/1 CrashLoopBackOff 225 (2m30s ago) 23h ``` And the output here: ```sh {"level":"error","ts":1643209250.576585,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"apiextensions.k8s.io","reconcilerKind":"CustomResourceDefinition","controller":"customresourcedefinition","name":"orders.acme.cert-manager.io","namespace":"","error":"no matches for kind \"Order\" in version \"acme.cert-manager.io/v1alpha2\"","stacktrace":"g$ thub.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWork$ tem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/workspace/vendor/k8s.io/apimachinery/$ kg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"} E0126 15:00:50.850700 1 leaderelection.go:321] error retrieving resource lock operators/8fa65150.coreos.com: Get "https://10.96.0.1:443/api/v1/namespaces/operators/configmaps/8fa65150.coreos.com": context deadline exceeded I0126 15:00:50.850783 1 leaderelection.go:278] failed to renew lease operators/8fa65150.coreos.com: timed out waiting for the condition {"level":"info","ts":1643209250.8508487,"logger":"controller","msg":"Stopping workers","reconcilerGroup":"servicebinding.io","reconcilerKind":"ServiceBinding","controller":"servicebinding"} {"level":"error","ts":1643209250.8508205,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nmain.main\n\t/workspace/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"} {"level":"info","ts":1643209250.8509405,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"} ``` Which would fix: Fixes redhat-developer/odo#5396 Signed-off-by: Charlie Drage <charlie@charliedrage.com>
What did you do?
I use
Leader-with-lease
in my operator code.What did you expect to see?
A clear and concise description of what you expected to happen (or insert a code snippet).
What did you see instead? Under which circumstances?
A clear and concise description of what you expected to happen (or insert a code snippet).
leader election lost” will appear every other time.
Environment
operator-sdk version: 0.9.1
insert release or Git SHA here: master
Kubernetes version information: v.1.12
insert output of
kubectl version
hereKubernetes cluster kind:
Additional context
Add any other context about the question here.
The text was updated successfully, but these errors were encountered: