Helm operator leader election lost #5186

dolfim · 2021-09-03T06:53:54Z

Bug Report

What did you do?

Our operator is based on the helm-operator v1.11.0. When the manager is running, we get constant CrashLoop with the message leader election lost.

Here are the logs produced by all the reboots:

{"level":"info","ts":1630651003.0890665,"logger":"cmd","msg":"Version","Go Version":"go1.16.7","GOOS":"linux","GOARCH":"amd64","helm-operator":"v1.11.0","commit":"28dcd12a776d8a8ff597e1d8527b08792e7312fd"}
{"level":"info","ts":1630651003.09068,"logger":"cmd","msg":"Watch namespaces not configured by environment variable WATCH_NAMESPACE or file. Watching all namespaces.","Namespace":""}
I0903 06:36:44.244793       1 request.go:668] Waited for 1.039457424s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/elasticsearch.k8s.elastic.co/v1beta1?timeout=32s
{"level":"info","ts":1630651006.893125,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"info","ts":1630651006.8952067,"logger":"helm.controller","msg":"Watching resource","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp","namespace":"","reconcilePeriod":"1m0s"}
{"level":"info","ts":1630651006.8960173,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
I0903 06:36:46.896255       1 leaderelection.go:243] attempting to acquire leader lease openshift-operators/ibm-cps-operator...
I0903 06:37:05.245608       1 leaderelection.go:253] successfully acquired lease openshift-operators/ibm-cps-operator
{"level":"info","ts":1630651025.2462952,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting EventSource","source":"kind source: apps.cps.deepsearch.ibm.com/v1alpha1, Kind=KgAmqp"}
{"level":"info","ts":1630651025.2464244,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting Controller"}
{"level":"info","ts":1630651025.7481437,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting workers","worker count":16}
I0903 06:37:08.088098       1 request.go:668] Waited for 1.030828222s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/operators.coreos.com/v1alpha1?timeout=32s
I0903 06:37:19.237634       1 request.go:668] Waited for 1.048805637s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/tuned.openshift.io/v1?timeout=32s
I0903 06:37:31.488584       1 request.go:668] Waited for 1.093715185s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/certificates.k8s.io/v1beta1?timeout=32s
E0903 06:37:41.490874       1 leaderelection.go:361] Failed to update lock: Put "https://172.21.0.1:443/api/v1/namespaces/openshift-operators/configmaps/ibm-cps-operator": context deadline exceeded
I0903 06:37:41.491073       1 leaderelection.go:278] failed to renew lease openshift-operators/ibm-cps-operator: timed out waiting for the condition
{"level":"error","ts":1630651061.4911356,"logger":"cmd","msg":"Manager exited non-zero.","Namespace":"","error":"leader election lost","stacktrace":"github.com/operator-framework/operator-sdk/internal/cmd/helm-operator/run.NewCmd.func1\n\t/workspace/internal/cmd/helm-operator/run/cmd.go:74\ngit.luolix.top/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856\ngit.luolix.top/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960\ngit.luolix.top/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897\nmain.main\n\t/workspace/cmd/helm-operator/main.go:40\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}

Environment

Operator type:

/language helm

Kubernetes cluster type:

OpenShift 4.7.

$ operator-sdk version

operator-sdk version: "v1.11.0", commit: "28dcd12a776d8a8ff597e1d8527b08792e7312fd", kubernetes version: "1.20.2", go version: "go1.16.7", GOOS: "darwin", GOARCH: "amd64"

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"clean", BuildDate:"2021-07-15T21:04:39Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0+558d959", GitCommit:"558d959332b3f1f7bf786673bf294e6e0932bb18", GitTreeState:"clean", BuildDate:"2021-07-13T20:45:42Z", GoVersion:"go1.15.14", Compiler:"gc", Platform:"linux/amd64"}

The text was updated successfully, but these errors were encountered:

dolfim · 2021-09-03T06:57:44Z

It might be a side discussion, but from the logs it looks like the helm-operator is fetching APIs which should not be related to our Helm chart https://172.21.0.1:443/apis/elasticsearch.k8s.elastic.co/v1beta1, etc which are creating some client-side throttling.
Could this be related?

jberkhahn · 2021-09-13T22:06:09Z

If your operator is hitting some API that you've never heard of, that's a pretty good indicator that you've generated your operator incorrectly. It looks like you might be getting rate limited due to hammering the API, and that's causing the leader elections to time out. What helm chart are you using and what commands did you run to generate your operator?

dolfim · 2021-09-15T06:20:22Z

The operator was created using the tutorial on the operator-sdk page:

operator-sdk init --domain cps.deepsearch.ibm.com --plugins helm
operator-sdk create api --group apps --version v1alpha1 --kind KgAmqp

jberkhahn · 2021-09-17T00:05:49Z

Is "KgAmqp" a pre-existing Helm chart or are you just making a void Helm chart like the example? I can't get the void operator example to fail like this.

I suspect that you are getting rate-limited, but it looks like it's happening client side (on the controller) rather than by the API server. This can be configured via flags thrown on the controller-manager - is there anything funky when you look at the startup command there? "--kube-api-qps" in particular, as it looks like that's what sets the client side rate limit for the controller-manager.

dolfim · 2021-09-17T08:51:31Z

Our current intuition is that the error is visible only on a large OCP 4.6 cluster (+60 nodes, +2500 Pods, etc).

As advised we tried out to introduce the selectors matching the label of some KgAmqp CR.
When the controller was not matching any CR, the logs were quite and the memory usage was very low.
After patching a few (not all) KgAmqp, we saw a huge increase in the memory usage, and the logs are showing 1) constant "Reconciled release", 2) "client-side throttling" from APIs which we would not expect to be queried (tekton.dev, knative, etc). Those APIs are for sure not matching our labels, and don't contain the CRD defined in the watches.yaml.
For reference, here is the watches.yaml

# Use the 'create api' subcommand to add watches to this file.
- group: apps.cps.deepsearch.ibm.com
  version: v1alpha1
  kind: KgAmqp
  chart: helm-charts/kgamqp
  watchDependentResources: false # adding this doesn't help
  selector: # adding this doesn't help, when
    matchLabels:
      app.kubernetes.io/name: kgamqp
      app.kubernetes.io/managed-by: ibm-cps-operator
#+kubebuilder:scaffold:watch

Regarding the previous questions.

The helm-chart was started with the vanilla example, and modified with a) a bit more values, b) remove hpa, c) add configmap and secrets.

The controller (reading from the running deployment) has the following args

args:
- --health-probe-bind-address=:8081
- --metrics-bind-address=127.0.0.1:8080
- --leader-elect
- --leader-election-id=ibm-cps-operator

varshaprasad96 · 2021-09-20T19:26:56Z

@dolfim Thanks for raising the issue. With the current information we are also not able to figure out the reason behind this error. And as you have mentioned since the client-side throttling is happening only on large clusters, it would be difficult for us to reproduce. On brainstorming about this in our community meeting, we could think of a few pointers:

How many CRDs are installed in the cluster?
Could you try setting the --zap-log-level (ref) to increase the verbosity so that we could have more logs to debug.
If possible, it would be helpful if you could share the operator code. We can try reproducing it locally or at the least verify if there is something which could be improved or is wrong.

dolfim · 2021-09-21T07:55:17Z

Thanks for looking at our issue and brainstorming about possibilities.

kubectl get crd shows 152 CRDs

We launched the controller with the following flags

      args:
        - '--health-probe-bind-address=:8081'
        - '--metrics-bind-address=127.0.0.1:8080'
        - '--leader-elect'
        - '--leader-election-id=ibm-cps-operator'
        - '--zap-log-level=debug'

The debug logs now look like:

{"level":"info","ts":1632210241.791085,"logger":"cmd","msg":"Version","Go Version":"go1.16.8","GOOS":"linux","GOARCH":"amd64","helm-operator":"v1.11.0","commit":"28dcd12a776d8a8ff597e1d8527b08792e7312fd"}
{"level":"info","ts":1632210241.7931075,"logger":"cmd","msg":"Watch namespaces not configured by environment variable WATCH_NAMESPACE or file. Watching all namespaces.","Namespace":""}
I0921 07:44:03.538887       1 request.go:668] Waited for 1.046670127s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/storage.k8s.io/v1?timeout=32s
{"level":"info","ts":1632210245.8048182,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"info","ts":1632210245.8914208,"logger":"helm.controller","msg":"Watching resource","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp","namespace":"","reconcilePeriod":"1m0s"}
I0921 07:44:05.892824       1 leaderelection.go:243] attempting to acquire leader lease openshift-operators/ibm-cps-operator...
{"level":"info","ts":1632210245.8931193,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
I0921 07:44:22.484699       1 leaderelection.go:253] successfully acquired lease openshift-operators/ibm-cps-operator
{"level":"debug","ts":1632210262.4847672,"logger":"controller-runtime.manager.events","msg":"Normal","object":{"kind":"ConfigMap","namespace":"openshift-operators","name":"ibm-cps-operator","uid":"4600a34a-c21e-4698-b83f-0bcbdfc4929c","apiVersion":"v1","resourceVersion":"610631877"},"reason":"LeaderElection","message":"ibm-cps-operator-controller-manager-557c88f7f6-mhlt2_be7ef8ca-2fb9-41b1-9d2d-c3d044ab3633 became leader"}
{"level":"info","ts":1632210262.4856992,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting EventSource","source":"kind source: apps.cps.deepsearch.ibm.com/v1alpha1, Kind=KgAmqp"}
{"level":"info","ts":1632210262.485977,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting Controller"}
{"level":"info","ts":1632210262.9877303,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting workers","worker count":16}
{"level":"debug","ts":1632210262.9914088,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-edda45b6-kgamqp-90f46298","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210262.9916945,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-30b90812","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210262.9957469,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-f21574fe-kgamqp-5f323a8a","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210262.998314,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-59041f3c","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.091478,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-6dc398bc-kgamqp-5e842217","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0931582,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-0bbf559a","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.093546,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-fa7b5983","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0950687,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-801751ea","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0951493,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-7f18f401","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0968616,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-617ffb14-kgamqp-5b8ab839","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.190764,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-17e384d1","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.193185,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-a4f4d947","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0953085,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-1f739374","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.4915593,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-99cf5236-kgamqp-72952c4d","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.491967,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-fa5a863e","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.4956002,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-26239ca0-kgamqp-503b8ba5","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
I0921 07:44:25.491421       1 request.go:668] Waited for 1.078316488s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/node.k8s.io/v1beta1?timeout=32s
I0921 07:44:35.494616       1 request.go:668] Waited for 2.301755739s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/rbac.authorization.k8s.io/v1beta1?timeout=32s
I0921 07:44:50.992501       1 request.go:668] Waited for 1.000852157s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/scheduling.k8s.io/v1beta1?timeout=32s
E0921 07:44:53.894135       1 leaderelection.go:325] error retrieving resource lock openshift-operators/ibm-cps-operator: Get "https://172.21.0.1:443/api/v1/namespaces/openshift-operators/configmaps/ibm-cps-operator": context deadline exceeded
I0921 07:44:53.990918       1 leaderelection.go:278] failed to renew lease openshift-operators/ibm-cps-operator: timed out waiting for the condition
{"level":"error","ts":1632210293.9922085,"logger":"cmd","msg":"Manager exited non-zero.","Namespace":"","error":"leader election lost","stacktrace":"github.com/operator-framework/operator-sdk/internal/cmd/helm-operator/run.NewCmd.func1\n\t/workspace/internal/cmd/helm-operator/run/cmd.go:74\ngit.luolix.top/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856\ngit.luolix.top/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960\ngit.luolix.top/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897\nmain.main\n\t/workspace/cmd/helm-operator/main.go:40\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}

It looks like it is getting the leader election once, and then it fails.

Since the timeout seems to be on the configmap, I tried count all of them:

# count configmaps
❯ kubectl get cm --all-namespaces -o name |wc -l
    1211

At the moment the controller doesn't survive more than 2min, so I cannot inspect the API calls done when raising the memory.

dolfim · 2021-09-21T08:52:38Z

I don't think the debug zap-log-level introduced more verbose output. The only debug entries are Reconciling and Reconciled release.

As posted before, I see lots of API request which (at least to me) look weird for our controller.

https://172.21.0.1:443/apis/migration.k8s.io/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/autoscaling/v2beta1?timeout=32s
https://172.21.0.1:443/apis/serving.knative.dev/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/image.openshift.io/v1?timeout=32s
https://172.21.0.1:443/apis/elasticsearch.k8s.elastic.co/v1?timeout=32s
https://172.21.0.1:443/apis/snapshot.storage.k8s.io/v1?timeout=32s
https://172.21.0.1:443/apis/cloudcredential.openshift.io/v1?timeout=32s
https://172.21.0.1:443/apis/operators.coreos.com/v1?timeout=32s
https://172.21.0.1:443/apis/extensions/v1beta1?timeout=32s
https://172.21.0.1:443/apis/autoscaling/v2beta2?timeout=32s
https://172.21.0.1:443/apis/whereabouts.cni.cncf.io/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/operators.coreos.com/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/triggers.tekton.dev/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/kibana.k8s.elastic.co/v1beta1?timeout=32s

I don't know the inner logic of the helm-controller but I don't really see a reason for it querying stuff like Tekton, Knative, etc.
But I can imagine that if the controller is caching the output of all tekton jobs, then this might explain the origin of the memory usage.
Does anybody understand why those APIs are called? Is this maybe a helm sdk issue?

jberkhahn · 2021-09-22T21:45:26Z

When the helm controller comes up it dynamically queries the API to build a spec for talking to the cluster, so it might hit a bunch of APIs that look weird. Going to have to do some more digging.

jberkhahn · 2021-09-23T21:53:37Z

So, trying to reproduce this on my rinky-dink minikube with 150+ CRDS deployed, my controller comes up fine with no hint of self-rate limiting. I suspect that maybe some Openshift specific stuff or configuration might be causing this.

jberkhahn · 2021-10-05T23:44:40Z

Tried this again with the cluster also saturated with CMs, still unable to reproduce this. Would it be possible to get access to the cluster you're experiencing this error on? Not much we can do locally without the ability to reproduce the error.

jberkhahn · 2021-11-08T23:31:09Z

Closing this as the user is no longer experiencing the problem and we're unable to reproduce it.

openshift-ci bot added the language/helm Issue is related to a Helm operator project label Sep 3, 2021

jberkhahn added the triage/support Indicates an issue that is a support question. label Sep 13, 2021

jberkhahn added this to the v1.13.0 milestone Sep 13, 2021

jberkhahn self-assigned this Sep 13, 2021

varshaprasad96 modified the milestones: v1.13.0, Backlog Sep 20, 2021

varshaprasad96 added the triage/needs-information Indicates an issue needs more information in order to work on it. label Sep 20, 2021

jberkhahn closed this as completed Nov 8, 2021

djsly mentioned this issue Sep 19, 2022

KEDA frequent reconciling causing ApiServer LIST verb Saturation kedacore/keda#3678

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helm operator leader election lost #5186

Helm operator leader election lost #5186

dolfim commented Sep 3, 2021

dolfim commented Sep 3, 2021

jberkhahn commented Sep 13, 2021

dolfim commented Sep 15, 2021

jberkhahn commented Sep 17, 2021 •

edited

Loading

dolfim commented Sep 17, 2021

varshaprasad96 commented Sep 20, 2021

dolfim commented Sep 21, 2021

dolfim commented Sep 21, 2021

jberkhahn commented Sep 22, 2021

jberkhahn commented Sep 23, 2021

jberkhahn commented Oct 5, 2021

jberkhahn commented Nov 8, 2021

Helm operator leader election lost #5186

Helm operator leader election lost #5186

Comments

dolfim commented Sep 3, 2021

Bug Report

What did you do?

Environment

dolfim commented Sep 3, 2021

jberkhahn commented Sep 13, 2021

dolfim commented Sep 15, 2021

jberkhahn commented Sep 17, 2021 • edited Loading

dolfim commented Sep 17, 2021

varshaprasad96 commented Sep 20, 2021

dolfim commented Sep 21, 2021

dolfim commented Sep 21, 2021

jberkhahn commented Sep 22, 2021

jberkhahn commented Sep 23, 2021

jberkhahn commented Oct 5, 2021

jberkhahn commented Nov 8, 2021

jberkhahn commented Sep 17, 2021 •

edited

Loading