Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help! What caused the leader election lost? #1813

Closed
AllenZMC opened this issue Aug 12, 2019 · 12 comments
Closed

Help! What caused the leader election lost? #1813

AllenZMC opened this issue Aug 12, 2019 · 12 comments
Assignees
Labels
triage/support Indicates an issue that is a support question.

Comments

@AllenZMC
Copy link

What did you do?
I use Leader-with-lease in my operator code.

What did you expect to see?
A clear and concise description of what you expected to happen (or insert a code snippet).

What did you see instead? Under which circumstances?
A clear and concise description of what you expected to happen (or insert a code snippet).
leader election lost” will appear every other time.

2019-08-12T03:05:56.866Z	ERROR	cmd	Manager exited non-zero	{"error": "leader election lost"}
github.com/go-logr/zapr.(*zapLogger).Error
	redis-operator/vendor/github.com/go-logr/zapr/zapr.go:128
main.main
	redis-operator/cmd/manager/main.go:174
runtime.main

Environment

  • operator-sdk version: 0.9.1

    insert release or Git SHA here: master

  • Kubernetes version information: v.1.12

    insert output of kubectl version here

  • Kubernetes cluster kind:

Additional context
Add any other context about the question here.

@joelanford joelanford self-assigned this Aug 13, 2019
@joelanford joelanford added the triage/support Indicates an issue that is a support question. label Aug 13, 2019
@joelanford
Copy link
Member

@AllenZMC based on that error message, it sounds like you're using the leader election built into the controller-runtime Manager. Is that correct? If so, here is where that error is coming from.

Are you attempting to run multiple replicas of your operator? It sounds like another instance is getting the leader lock and the instance with this log is losing that election.

@AllenZMC
Copy link
Author

@joelanford , Yes. I am using the leader election built into the controller-runtime Manager.
I only run one replica of my operator. I will continue to pay attention to it.

@joelanford
Copy link
Member

@AllenZMC Sounds good. Yeah my guess is that you have a rogue replica running somewhere. Maybe an old in-cluster replica that didn't get cleaned up or something running in another terminal or in the background locally with up local?

If you're sure you're only running one replica and you keep seeing this, and it does turn out to be a bug, let us know and we'll help you out with an issue or PR upstream in controller-runtime.

@AllenZMC
Copy link
Author

I guess the reason is that LeaseDuration or RenewDeadline is too short. @joelanford

@joelanford
Copy link
Member

@AllenZMC I'm not sure. I'm definitely not an expert on controller-runtime's leader election internals, but I don't think a single replica should ever lose an election, which seems to be the case here.

@AllenZMC
Copy link
Author

@joelanford Thanks, I think I have found the problem.

@jessehu
Copy link

jessehu commented Dec 4, 2019

Hi @AllenZMC , what's the root cause please?

1 similar comment
@wlmvp
Copy link

wlmvp commented Jan 9, 2020

Hi @AllenZMC , what's the root cause please?

@snorwin
Copy link

snorwin commented Oct 30, 2020

Hi @AllenZMC , what's the root cause please?

@snorwin
Copy link

snorwin commented Nov 25, 2020

Since we migrated to operator-sdk 1.0.0, there have been many container restarts caused by the leader election lost error. We use a single replica of an operator on openshift v3.11.170 and kubernetes v1.11.0+d4cacc0.

The issue has been resolved by increasing the LeaseDuration to 30s and the RenewDeadline to 20s.

@AllenZMC
Copy link
Author

AllenZMC commented Nov 25, 2020

Yes @snorwin See #1813 (comment)

@Bryce-huang
Copy link

Since we migrated to operator-sdk 1.0.0, there have been many container restarts caused by the leader election lost error. We use a single replica of an operator on openshift v3.11.170 and kubernetes v1.11.0+d4cacc0.

The issue has been resolved by increasing the LeaseDuration to 30s and the RenewDeadline to 20s.

It just slowed down the frequency and did not solve the problem

luolanzone added a commit to luolanzone/antrea that referenced this issue Jan 7, 2022
I found below error in a long running MC controller, so increase the
timeout longer considering we only have one controller running, leader
election should always be able to get the lease.

```
E0106 07:29:05.501113       1 leaderelection.go:361] Failed to update lock: context deadline exceeded
I0106 07:29:05.895992       1 leaderelection.go:278] failed to renew lease antrea-mcs-ns/6536456a.crd.antrea.io: timed out waiting for the condition
2022-01-06T07:29:05.896Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"ConfigMap","namespace":"antrea-mcs-ns","name":"6536456a.crd.antrea.io","uid":"a4de74cd-0441-4140-a78b-acf163055f91","apiVersion":"v1","resourceVersion":"23629919"}, "reason": "LeaderElection", "message": "antrea-mc-controller-6dcb88b9d6-vxqvm_e1b1b0a9-b2b5-471f-b424-b11a34343d64 stopped leading"}
2022-01-06T07:29:05.999Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"Lease","namespace":"antrea-mcs-ns","name":"6536456a.crd.antrea.io","uid":"6709c340-ee00-459b-b186-e56c15fbde67","apiVersion":"coordination.k8s.io/v1","resourceVersion":"23629901"}, "reason": "LeaderElection", "message": "antrea-mc-controller-6dcb88b9d6-vxqvm_e1b1b0a9-b2b5-471f-b424-b11a34343d64 stopped leading"}
2022-01-06T07:29:05.598Z	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/validate-multicluster-crd-antrea-io-v1alpha1-memberclusterannounce", "UID": "da938dc5-cbda-4714-a9f3-f25d7f105353", "kind": "multicluster.crd.antrea.io/v1alpha1, Kind=MemberClusterAnnounce", "resource": {"group":"multicluster.crd.antrea.io","version":"v1alpha1","resource":"memberclusterannounces"}}
F0106 07:29:06.099280       1 leader.go:41] Error running controller: error running Manager: leader election lost
```
refer to operator-framework/operator-sdk#1813

Signed-off-by: Lan Luo <luola@vmware.com>
luolanzone added a commit to luolanzone/antrea that referenced this issue Jan 7, 2022
I found below error in a long running MC controller, so increase the
timeout longer considering we only have one controller running, leader
election should always be able to get the lease.

```
E0106 07:29:05.501113       1 leaderelection.go:361] Failed to update lock: context deadline exceeded
I0106 07:29:05.895992       1 leaderelection.go:278] failed to renew lease antrea-mcs-ns/6536456a.crd.antrea.io: timed out waiting for the condition
2022-01-06T07:29:05.896Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"ConfigMap","namespace":"antrea-mcs-ns","name":"6536456a.crd.antrea.io","uid":"a4de74cd-0441-4140-a78b-acf163055f91","apiVersion":"v1","resourceVersion":"23629919"}, "reason": "LeaderElection", "message": "antrea-mc-controller-6dcb88b9d6-vxqvm_e1b1b0a9-b2b5-471f-b424-b11a34343d64 stopped leading"}
2022-01-06T07:29:05.999Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"Lease","namespace":"antrea-mcs-ns","name":"6536456a.crd.antrea.io","uid":"6709c340-ee00-459b-b186-e56c15fbde67","apiVersion":"coordination.k8s.io/v1","resourceVersion":"23629901"}, "reason": "LeaderElection", "message": "antrea-mc-controller-6dcb88b9d6-vxqvm_e1b1b0a9-b2b5-471f-b424-b11a34343d64 stopped leading"}
2022-01-06T07:29:05.598Z	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/validate-multicluster-crd-antrea-io-v1alpha1-memberclusterannounce", "UID": "da938dc5-cbda-4714-a9f3-f25d7f105353", "kind": "multicluster.crd.antrea.io/v1alpha1, Kind=MemberClusterAnnounce", "resource": {"group":"multicluster.crd.antrea.io","version":"v1alpha1","resource":"memberclusterannounces"}}
F0106 07:29:06.099280       1 leader.go:41] Error running controller: error running Manager: leader election lost
```
refer to operator-framework/operator-sdk#1813

Signed-off-by: Lan Luo <luola@vmware.com>
luolanzone added a commit to luolanzone/antrea that referenced this issue Jan 7, 2022
I found below error in a long running MC controller, so increase the
timeout longer considering we only have one controller running, leader
election should always be able to get the lease.

```
E0106 07:29:05.501113       1 leaderelection.go:361] Failed to update lock: context deadline exceeded
I0106 07:29:05.895992       1 leaderelection.go:278] failed to renew lease antrea-mcs-ns/6536456a.crd.antrea.io: timed out waiting for the condition
2022-01-06T07:29:05.896Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"ConfigMap","namespace":"antrea-mcs-ns","name":"6536456a.crd.antrea.io","uid":"a4de74cd-0441-4140-a78b-acf163055f91","apiVersion":"v1","resourceVersion":"23629919"}, "reason": "LeaderElection", "message": "antrea-mc-controller-6dcb88b9d6-vxqvm_e1b1b0a9-b2b5-471f-b424-b11a34343d64 stopped leading"}
2022-01-06T07:29:05.999Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"Lease","namespace":"antrea-mcs-ns","name":"6536456a.crd.antrea.io","uid":"6709c340-ee00-459b-b186-e56c15fbde67","apiVersion":"coordination.k8s.io/v1","resourceVersion":"23629901"}, "reason": "LeaderElection", "message": "antrea-mc-controller-6dcb88b9d6-vxqvm_e1b1b0a9-b2b5-471f-b424-b11a34343d64 stopped leading"}
2022-01-06T07:29:05.598Z	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/validate-multicluster-crd-antrea-io-v1alpha1-memberclusterannounce", "UID": "da938dc5-cbda-4714-a9f3-f25d7f105353", "kind": "multicluster.crd.antrea.io/v1alpha1, Kind=MemberClusterAnnounce", "resource": {"group":"multicluster.crd.antrea.io","version":"v1alpha1","resource":"memberclusterannounces"}}
F0106 07:29:06.099280       1 leader.go:41] Error running controller: error running Manager: leader election lost
```
refer to operator-framework/operator-sdk#1813

Signed-off-by: Lan Luo <luola@vmware.com>
cdrage added a commit to cdrage/service-binding-operator that referenced this issue Jan 27, 2022
So... when deploying Service Binding Operator on a bare-metal
environment I'm running into THIS issue *a lot*:

operator-framework/operator-sdk#1813

Why? Because my bare-metal cluster is running 100's of containers and
unfortunately it takes a while to get the leader. Usually it's within a
few seconds, but if there is some HDD usage, it'll take a little while.

What a lot of operator maintainers have done, is implemented the
LeaseDuration and RenewDeadline commands to the operator so that us
bare-metal people can increase the timeouts.

In this PR I have:
- Added this to the `main.go` file
- Set the default 30 / 20 second timeouts.

So that we have the ability to change the settings via a configmap /
yaml within k8s:
```yaml
    leaderElection:
      leaderElect: false
      resourceName: <example-domain>.io
      leaseDuration: "30s"
      renewDeadline: "20s"
```

This would have no impact on current functionality, but help those who
are experiencing high amount of restarts / CrashLoopback's of the
service binding operator pod:

```sh
$ k get pods -A
operators       pgo-f96b88c9d-z2nfx                                               1/1     Running                 0                 18h
operators       service-binding-operator-7795b785b4-wh265                         0/1     CrashLoopBackOff        225 (2m30s ago)   23h
```

And the output here:

```sh
{"level":"error","ts":1643209250.576585,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"apiextensions.k8s.io","reconcilerKind":"CustomResourceDefinition","controller":"customresourcedefinition","name":"orders.acme.cert-manager.io","namespace":"","error":"no matches for kind \"Order\" in version \"acme.cert-manager.io/v1alpha2\"","stacktrace":"g$
thub.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWork$
tem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/workspace/vendor/k8s.io/apimachinery/$
kg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"}
E0126 15:00:50.850700       1 leaderelection.go:321] error retrieving resource lock operators/8fa65150.coreos.com: Get "https://10.96.0.1:443/api/v1/namespaces/operators/configmaps/8fa65150.coreos.com": context deadline exceeded
I0126 15:00:50.850783       1 leaderelection.go:278] failed to renew lease operators/8fa65150.coreos.com: timed out waiting for the condition
{"level":"info","ts":1643209250.8508487,"logger":"controller","msg":"Stopping workers","reconcilerGroup":"servicebinding.io","reconcilerKind":"ServiceBinding","controller":"servicebinding"}
{"level":"error","ts":1643209250.8508205,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nmain.main\n\t/workspace/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"}
{"level":"info","ts":1643209250.8509405,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"}
```

Which would fix:

Fixes redhat-developer/odo#5396
cdrage added a commit to cdrage/service-binding-operator that referenced this issue Feb 7, 2022
So... when deploying Service Binding Operator on a bare-metal
environment I'm running into THIS issue *a lot*:

operator-framework/operator-sdk#1813

Why? Because my bare-metal cluster is running 100's of containers and
unfortunately it takes a while to get the leader. Usually it's within a
few seconds, but if there is some HDD usage, it'll take a little while.

What a lot of operator maintainers have done, is implemented the
LeaseDuration and RenewDeadline commands to the operator so that us
bare-metal people can increase the timeouts.

In this PR I have:
- Added this to the `main.go` file
- Set the default 30 / 20 second timeouts.

So that we have the ability to change the settings via a configmap /
yaml within k8s:
```yaml
    leaderElection:
      leaderElect: false
      resourceName: <example-domain>.io
      leaseDuration: "30s"
      renewDeadline: "20s"
```

This would have no impact on current functionality, but help those who
are experiencing high amount of restarts / CrashLoopback's of the
service binding operator pod:

```sh
$ k get pods -A
operators       pgo-f96b88c9d-z2nfx                                               1/1     Running                 0                 18h
operators       service-binding-operator-7795b785b4-wh265                         0/1     CrashLoopBackOff        225 (2m30s ago)   23h
```

And the output here:

```sh
{"level":"error","ts":1643209250.576585,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"apiextensions.k8s.io","reconcilerKind":"CustomResourceDefinition","controller":"customresourcedefinition","name":"orders.acme.cert-manager.io","namespace":"","error":"no matches for kind \"Order\" in version \"acme.cert-manager.io/v1alpha2\"","stacktrace":"g$
thub.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWork$
tem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/workspace/vendor/k8s.io/apimachinery/$
kg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"}
E0126 15:00:50.850700       1 leaderelection.go:321] error retrieving resource lock operators/8fa65150.coreos.com: Get "https://10.96.0.1:443/api/v1/namespaces/operators/configmaps/8fa65150.coreos.com": context deadline exceeded
I0126 15:00:50.850783       1 leaderelection.go:278] failed to renew lease operators/8fa65150.coreos.com: timed out waiting for the condition
{"level":"info","ts":1643209250.8508487,"logger":"controller","msg":"Stopping workers","reconcilerGroup":"servicebinding.io","reconcilerKind":"ServiceBinding","controller":"servicebinding"}
{"level":"error","ts":1643209250.8508205,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nmain.main\n\t/workspace/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"}
{"level":"info","ts":1643209250.8509405,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"}
```

Which would fix:

Fixes redhat-developer/odo#5396

Signed-off-by: Charlie Drage <charlie@charliedrage.com>
cdrage added a commit to cdrage/service-binding-operator that referenced this issue Feb 9, 2022
So... when deploying Service Binding Operator on a bare-metal
environment I'm running into THIS issue *a lot*:

operator-framework/operator-sdk#1813

Why? Because my bare-metal cluster is running 100's of containers and
unfortunately it takes a while to get the leader. Usually it's within a
few seconds, but if there is some HDD usage, it'll take a little while.

What a lot of operator maintainers have done, is implemented the
LeaseDuration and RenewDeadline commands to the operator so that us
bare-metal people can increase the timeouts.

In this PR I have:
- Added this to the `main.go` file
- Set the default 30 / 20 second timeouts.

So that we have the ability to change the settings via a configmap /
yaml within k8s:
```yaml
    leaderElection:
      leaderElect: false
      resourceName: <example-domain>.io
      leaseDuration: "30s"
      renewDeadline: "20s"
```

This would have no impact on current functionality, but help those who
are experiencing high amount of restarts / CrashLoopback's of the
service binding operator pod:

```sh
$ k get pods -A
operators       pgo-f96b88c9d-z2nfx                                               1/1     Running                 0                 18h
operators       service-binding-operator-7795b785b4-wh265                         0/1     CrashLoopBackOff        225 (2m30s ago)   23h
```

And the output here:

```sh
{"level":"error","ts":1643209250.576585,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"apiextensions.k8s.io","reconcilerKind":"CustomResourceDefinition","controller":"customresourcedefinition","name":"orders.acme.cert-manager.io","namespace":"","error":"no matches for kind \"Order\" in version \"acme.cert-manager.io/v1alpha2\"","stacktrace":"g$
thub.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWork$
tem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/workspace/vendor/k8s.io/apimachinery/$
kg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"}
E0126 15:00:50.850700       1 leaderelection.go:321] error retrieving resource lock operators/8fa65150.coreos.com: Get "https://10.96.0.1:443/api/v1/namespaces/operators/configmaps/8fa65150.coreos.com": context deadline exceeded
I0126 15:00:50.850783       1 leaderelection.go:278] failed to renew lease operators/8fa65150.coreos.com: timed out waiting for the condition
{"level":"info","ts":1643209250.8508487,"logger":"controller","msg":"Stopping workers","reconcilerGroup":"servicebinding.io","reconcilerKind":"ServiceBinding","controller":"servicebinding"}
{"level":"error","ts":1643209250.8508205,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nmain.main\n\t/workspace/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"}
{"level":"info","ts":1643209250.8509405,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"}
```

Which would fix:

Fixes redhat-developer/odo#5396

Signed-off-by: Charlie Drage <charlie@charliedrage.com>
cdrage added a commit to cdrage/service-binding-operator that referenced this issue Jul 27, 2022
So... when deploying Service Binding Operator on a bare-metal
environment I'm running into THIS issue *a lot*:

operator-framework/operator-sdk#1813

Why? Because my bare-metal cluster is running 100's of containers and
unfortunately it takes a while to get the leader. Usually it's within a
few seconds, but if there is some HDD usage, it'll take a little while.

What a lot of operator maintainers have done, is implemented the
LeaseDuration and RenewDeadline commands to the operator so that us
bare-metal people can increase the timeouts.

In this PR I have:
- Added this to the `main.go` file
- Set the default 30 / 20 second timeouts.

So that we have the ability to change the settings via a configmap /
yaml within k8s:
```yaml
    leaderElection:
      leaderElect: false
      resourceName: <example-domain>.io
      leaseDuration: "30s"
      renewDeadline: "20s"
```

This would have no impact on current functionality, but help those who
are experiencing high amount of restarts / CrashLoopback's of the
service binding operator pod:

```sh
$ k get pods -A
operators       pgo-f96b88c9d-z2nfx                                               1/1     Running                 0                 18h
operators       service-binding-operator-7795b785b4-wh265                         0/1     CrashLoopBackOff        225 (2m30s ago)   23h
```

And the output here:

```sh
{"level":"error","ts":1643209250.576585,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"apiextensions.k8s.io","reconcilerKind":"CustomResourceDefinition","controller":"customresourcedefinition","name":"orders.acme.cert-manager.io","namespace":"","error":"no matches for kind \"Order\" in version \"acme.cert-manager.io/v1alpha2\"","stacktrace":"g$
thub.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWork$
tem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/workspace/vendor/k8s.io/apimachinery/$
kg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"}
E0126 15:00:50.850700       1 leaderelection.go:321] error retrieving resource lock operators/8fa65150.coreos.com: Get "https://10.96.0.1:443/api/v1/namespaces/operators/configmaps/8fa65150.coreos.com": context deadline exceeded
I0126 15:00:50.850783       1 leaderelection.go:278] failed to renew lease operators/8fa65150.coreos.com: timed out waiting for the condition
{"level":"info","ts":1643209250.8508487,"logger":"controller","msg":"Stopping workers","reconcilerGroup":"servicebinding.io","reconcilerKind":"ServiceBinding","controller":"servicebinding"}
{"level":"error","ts":1643209250.8508205,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nmain.main\n\t/workspace/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"}
{"level":"info","ts":1643209250.8509405,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"}
```

Which would fix:

Fixes redhat-developer/odo#5396

Signed-off-by: Charlie Drage <charlie@charliedrage.com>
cdrage added a commit to cdrage/service-binding-operator that referenced this issue Jul 27, 2022
So... when deploying Service Binding Operator on a bare-metal
environment I'm running into THIS issue *a lot*:

operator-framework/operator-sdk#1813

Why? Because my bare-metal cluster is running 100's of containers and
unfortunately it takes a while to get the leader. Usually it's within a
few seconds, but if there is some HDD usage, it'll take a little while.

What a lot of operator maintainers have done, is implemented the
LeaseDuration and RenewDeadline commands to the operator so that us
bare-metal people can increase the timeouts.

In this PR I have:
- Added this to the `main.go` file
- Set the default 30 / 20 second timeouts.

So that we have the ability to change the settings via a configmap /
yaml within k8s:
```yaml
    leaderElection:
      leaderElect: false
      resourceName: <example-domain>.io
      leaseDuration: "30s"
      renewDeadline: "20s"
```

This would have no impact on current functionality, but help those who
are experiencing high amount of restarts / CrashLoopback's of the
service binding operator pod:

```sh
$ k get pods -A
operators       pgo-f96b88c9d-z2nfx                                               1/1     Running                 0                 18h
operators       service-binding-operator-7795b785b4-wh265                         0/1     CrashLoopBackOff        225 (2m30s ago)   23h
```

And the output here:

```sh
{"level":"error","ts":1643209250.576585,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"apiextensions.k8s.io","reconcilerKind":"CustomResourceDefinition","controller":"customresourcedefinition","name":"orders.acme.cert-manager.io","namespace":"","error":"no matches for kind \"Order\" in version \"acme.cert-manager.io/v1alpha2\"","stacktrace":"g$
thub.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWork$
tem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/workspace/vendor/k8s.io/apimachinery/$
kg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"}
E0126 15:00:50.850700       1 leaderelection.go:321] error retrieving resource lock operators/8fa65150.coreos.com: Get "https://10.96.0.1:443/api/v1/namespaces/operators/configmaps/8fa65150.coreos.com": context deadline exceeded
I0126 15:00:50.850783       1 leaderelection.go:278] failed to renew lease operators/8fa65150.coreos.com: timed out waiting for the condition
{"level":"info","ts":1643209250.8508487,"logger":"controller","msg":"Stopping workers","reconcilerGroup":"servicebinding.io","reconcilerKind":"ServiceBinding","controller":"servicebinding"}
{"level":"error","ts":1643209250.8508205,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nmain.main\n\t/workspace/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"}
{"level":"info","ts":1643209250.8509405,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"}
```

Which would fix:

Fixes redhat-developer/odo#5396

Signed-off-by: Charlie Drage <charlie@charliedrage.com>
cdrage added a commit to cdrage/service-binding-operator that referenced this issue Jul 27, 2022
So... when deploying Service Binding Operator on a bare-metal
environment I'm running into THIS issue *a lot*:

operator-framework/operator-sdk#1813

Why? Because my bare-metal cluster is running 100's of containers and
unfortunately it takes a while to get the leader. Usually it's within a
few seconds, but if there is some HDD usage, it'll take a little while.

What a lot of operator maintainers have done, is implemented the
LeaseDuration and RenewDeadline commands to the operator so that us
bare-metal people can increase the timeouts.

In this PR I have:
- Added this to the `main.go` file
- Set the default 30 / 20 second timeouts.

So that we have the ability to change the settings via a configmap /
yaml within k8s:
```yaml
    leaderElection:
      leaderElect: false
      resourceName: <example-domain>.io
      leaseDuration: "30s"
      renewDeadline: "20s"
```

This would have no impact on current functionality, but help those who
are experiencing high amount of restarts / CrashLoopback's of the
service binding operator pod:

```sh
$ k get pods -A
operators       pgo-f96b88c9d-z2nfx                                               1/1     Running                 0                 18h
operators       service-binding-operator-7795b785b4-wh265                         0/1     CrashLoopBackOff        225 (2m30s ago)   23h
```

And the output here:

```sh
{"level":"error","ts":1643209250.576585,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"apiextensions.k8s.io","reconcilerKind":"CustomResourceDefinition","controller":"customresourcedefinition","name":"orders.acme.cert-manager.io","namespace":"","error":"no matches for kind \"Order\" in version \"acme.cert-manager.io/v1alpha2\"","stacktrace":"g$
thub.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWork$
tem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/workspace/vendor/k8s.io/apimachinery/$
kg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"}
E0126 15:00:50.850700       1 leaderelection.go:321] error retrieving resource lock operators/8fa65150.coreos.com: Get "https://10.96.0.1:443/api/v1/namespaces/operators/configmaps/8fa65150.coreos.com": context deadline exceeded
I0126 15:00:50.850783       1 leaderelection.go:278] failed to renew lease operators/8fa65150.coreos.com: timed out waiting for the condition
{"level":"info","ts":1643209250.8508487,"logger":"controller","msg":"Stopping workers","reconcilerGroup":"servicebinding.io","reconcilerKind":"ServiceBinding","controller":"servicebinding"}
{"level":"error","ts":1643209250.8508205,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nmain.main\n\t/workspace/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"}
{"level":"info","ts":1643209250.8509405,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"}
```

Which would fix:

Fixes redhat-developer/odo#5396

Signed-off-by: Charlie Drage <charlie@charliedrage.com>
cdrage added a commit to cdrage/service-binding-operator that referenced this issue Jul 27, 2022
So... when deploying Service Binding Operator on a bare-metal
environment I'm running into THIS issue *a lot*:

operator-framework/operator-sdk#1813

Why? Because my bare-metal cluster is running 100's of containers and
unfortunately it takes a while to get the leader. Usually it's within a
few seconds, but if there is some HDD usage, it'll take a little while.

What a lot of operator maintainers have done, is implemented the
LeaseDuration and RenewDeadline commands to the operator so that us
bare-metal people can increase the timeouts.

In this PR I have:
- Added this to the `main.go` file
- Set the default 30 / 20 second timeouts.

So that we have the ability to change the settings via a configmap /
yaml within k8s:
```yaml
    leaderElection:
      leaderElect: false
      resourceName: <example-domain>.io
      leaseDuration: "30s"
      renewDeadline: "20s"
```

This would have no impact on current functionality, but help those who
are experiencing high amount of restarts / CrashLoopback's of the
service binding operator pod:

```sh
$ k get pods -A
operators       pgo-f96b88c9d-z2nfx                                               1/1     Running                 0                 18h
operators       service-binding-operator-7795b785b4-wh265                         0/1     CrashLoopBackOff        225 (2m30s ago)   23h
```

And the output here:

```sh
{"level":"error","ts":1643209250.576585,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"apiextensions.k8s.io","reconcilerKind":"CustomResourceDefinition","controller":"customresourcedefinition","name":"orders.acme.cert-manager.io","namespace":"","error":"no matches for kind \"Order\" in version \"acme.cert-manager.io/v1alpha2\"","stacktrace":"g$
thub.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWork$
tem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/workspace/vendor/k8s.io/apimachinery/$
kg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"}
E0126 15:00:50.850700       1 leaderelection.go:321] error retrieving resource lock operators/8fa65150.coreos.com: Get "https://10.96.0.1:443/api/v1/namespaces/operators/configmaps/8fa65150.coreos.com": context deadline exceeded
I0126 15:00:50.850783       1 leaderelection.go:278] failed to renew lease operators/8fa65150.coreos.com: timed out waiting for the condition
{"level":"info","ts":1643209250.8508487,"logger":"controller","msg":"Stopping workers","reconcilerGroup":"servicebinding.io","reconcilerKind":"ServiceBinding","controller":"servicebinding"}
{"level":"error","ts":1643209250.8508205,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/workspace/vendor/github.com/go-logr/zapr/zapr.go:132\nmain.main\n\t/workspace/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"}
{"level":"info","ts":1643209250.8509405,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"}
```

Which would fix:

Fixes redhat-developer/odo#5396

Signed-off-by: Charlie Drage <charlie@charliedrage.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/support Indicates an issue that is a support question.
Projects
None yet
Development

No branches or pull requests

6 participants