Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm operator leader election lost #5186

Closed
dolfim opened this issue Sep 3, 2021 · 12 comments
Closed

Helm operator leader election lost #5186

dolfim opened this issue Sep 3, 2021 · 12 comments
Assignees
Labels
language/helm Issue is related to a Helm operator project triage/needs-information Indicates an issue needs more information in order to work on it. triage/support Indicates an issue that is a support question.
Milestone

Comments

@dolfim
Copy link

dolfim commented Sep 3, 2021

Bug Report

What did you do?

Our operator is based on the helm-operator v1.11.0. When the manager is running, we get constant CrashLoop with the message leader election lost.

Here are the logs produced by all the reboots:

{"level":"info","ts":1630651003.0890665,"logger":"cmd","msg":"Version","Go Version":"go1.16.7","GOOS":"linux","GOARCH":"amd64","helm-operator":"v1.11.0","commit":"28dcd12a776d8a8ff597e1d8527b08792e7312fd"}
{"level":"info","ts":1630651003.09068,"logger":"cmd","msg":"Watch namespaces not configured by environment variable WATCH_NAMESPACE or file. Watching all namespaces.","Namespace":""}
I0903 06:36:44.244793       1 request.go:668] Waited for 1.039457424s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/elasticsearch.k8s.elastic.co/v1beta1?timeout=32s
{"level":"info","ts":1630651006.893125,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"info","ts":1630651006.8952067,"logger":"helm.controller","msg":"Watching resource","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp","namespace":"","reconcilePeriod":"1m0s"}
{"level":"info","ts":1630651006.8960173,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
I0903 06:36:46.896255       1 leaderelection.go:243] attempting to acquire leader lease openshift-operators/ibm-cps-operator...
I0903 06:37:05.245608       1 leaderelection.go:253] successfully acquired lease openshift-operators/ibm-cps-operator
{"level":"info","ts":1630651025.2462952,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting EventSource","source":"kind source: apps.cps.deepsearch.ibm.com/v1alpha1, Kind=KgAmqp"}
{"level":"info","ts":1630651025.2464244,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting Controller"}
{"level":"info","ts":1630651025.7481437,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting workers","worker count":16}
I0903 06:37:08.088098       1 request.go:668] Waited for 1.030828222s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/operators.coreos.com/v1alpha1?timeout=32s
I0903 06:37:19.237634       1 request.go:668] Waited for 1.048805637s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/tuned.openshift.io/v1?timeout=32s
I0903 06:37:31.488584       1 request.go:668] Waited for 1.093715185s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/certificates.k8s.io/v1beta1?timeout=32s
E0903 06:37:41.490874       1 leaderelection.go:361] Failed to update lock: Put "https://172.21.0.1:443/api/v1/namespaces/openshift-operators/configmaps/ibm-cps-operator": context deadline exceeded
I0903 06:37:41.491073       1 leaderelection.go:278] failed to renew lease openshift-operators/ibm-cps-operator: timed out waiting for the condition
{"level":"error","ts":1630651061.4911356,"logger":"cmd","msg":"Manager exited non-zero.","Namespace":"","error":"leader election lost","stacktrace":"github.com/operator-framework/operator-sdk/internal/cmd/helm-operator/run.NewCmd.func1\n\t/workspace/internal/cmd/helm-operator/run/cmd.go:74\ngit.luolix.top/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856\ngit.luolix.top/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960\ngit.luolix.top/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897\nmain.main\n\t/workspace/cmd/helm-operator/main.go:40\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}

Environment

Operator type:

/language helm

Kubernetes cluster type:

OpenShift 4.7.

$ operator-sdk version

operator-sdk version: "v1.11.0", commit: "28dcd12a776d8a8ff597e1d8527b08792e7312fd", kubernetes version: "1.20.2", go version: "go1.16.7", GOOS: "darwin", GOARCH: "amd64"

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"clean", BuildDate:"2021-07-15T21:04:39Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0+558d959", GitCommit:"558d959332b3f1f7bf786673bf294e6e0932bb18", GitTreeState:"clean", BuildDate:"2021-07-13T20:45:42Z", GoVersion:"go1.15.14", Compiler:"gc", Platform:"linux/amd64"}
@openshift-ci openshift-ci bot added the language/helm Issue is related to a Helm operator project label Sep 3, 2021
@dolfim
Copy link
Author

dolfim commented Sep 3, 2021

It might be a side discussion, but from the logs it looks like the helm-operator is fetching APIs which should not be related to our Helm chart https://172.21.0.1:443/apis/elasticsearch.k8s.elastic.co/v1beta1, etc which are creating some client-side throttling.
Could this be related?

@jberkhahn jberkhahn added the triage/support Indicates an issue that is a support question. label Sep 13, 2021
@jberkhahn jberkhahn added this to the v1.13.0 milestone Sep 13, 2021
@jberkhahn jberkhahn self-assigned this Sep 13, 2021
@jberkhahn
Copy link
Contributor

If your operator is hitting some API that you've never heard of, that's a pretty good indicator that you've generated your operator incorrectly. It looks like you might be getting rate limited due to hammering the API, and that's causing the leader elections to time out. What helm chart are you using and what commands did you run to generate your operator?

@dolfim
Copy link
Author

dolfim commented Sep 15, 2021

The operator was created using the tutorial on the operator-sdk page:

operator-sdk init --domain cps.deepsearch.ibm.com --plugins helm
operator-sdk create api --group apps --version v1alpha1 --kind KgAmqp

@jberkhahn
Copy link
Contributor

jberkhahn commented Sep 17, 2021

Is "KgAmqp" a pre-existing Helm chart or are you just making a void Helm chart like the example? I can't get the void operator example to fail like this.

I suspect that you are getting rate-limited, but it looks like it's happening client side (on the controller) rather than by the API server. This can be configured via flags thrown on the controller-manager - is there anything funky when you look at the startup command there? "--kube-api-qps" in particular, as it looks like that's what sets the client side rate limit for the controller-manager.

@dolfim
Copy link
Author

dolfim commented Sep 17, 2021

Our current intuition is that the error is visible only on a large OCP 4.6 cluster (+60 nodes, +2500 Pods, etc).

As advised we tried out to introduce the selectors matching the label of some KgAmqp CR.
When the controller was not matching any CR, the logs were quite and the memory usage was very low.
After patching a few (not all) KgAmqp, we saw a huge increase in the memory usage, and the logs are showing 1) constant "Reconciled release", 2) "client-side throttling" from APIs which we would not expect to be queried (tekton.dev, knative, etc). Those APIs are for sure not matching our labels, and don't contain the CRD defined in the watches.yaml.
For reference, here is the watches.yaml

# Use the 'create api' subcommand to add watches to this file.
- group: apps.cps.deepsearch.ibm.com
  version: v1alpha1
  kind: KgAmqp
  chart: helm-charts/kgamqp
  watchDependentResources: false # adding this doesn't help
  selector: # adding this doesn't help, when
    matchLabels:
      app.kubernetes.io/name: kgamqp
      app.kubernetes.io/managed-by: ibm-cps-operator
#+kubebuilder:scaffold:watch

image

Regarding the previous questions.

The helm-chart was started with the vanilla example, and modified with a) a bit more values, b) remove hpa, c) add configmap and secrets.

The controller (reading from the running deployment) has the following args

args:
- --health-probe-bind-address=:8081
- --metrics-bind-address=127.0.0.1:8080
- --leader-elect
- --leader-election-id=ibm-cps-operator

@varshaprasad96 varshaprasad96 modified the milestones: v1.13.0, Backlog Sep 20, 2021
@varshaprasad96 varshaprasad96 added the triage/needs-information Indicates an issue needs more information in order to work on it. label Sep 20, 2021
@varshaprasad96
Copy link
Member

@dolfim Thanks for raising the issue. With the current information we are also not able to figure out the reason behind this error. And as you have mentioned since the client-side throttling is happening only on large clusters, it would be difficult for us to reproduce. On brainstorming about this in our community meeting, we could think of a few pointers:

  1. How many CRDs are installed in the cluster?
  2. Could you try setting the --zap-log-level (ref) to increase the verbosity so that we could have more logs to debug.
  3. If possible, it would be helpful if you could share the operator code. We can try reproducing it locally or at the least verify if there is something which could be improved or is wrong.

@dolfim
Copy link
Author

dolfim commented Sep 21, 2021

Thanks for looking at our issue and brainstorming about possibilities.

  1. kubectl get crd shows 152 CRDs
  2. We launched the controller with the following flags
          args:
            - '--health-probe-bind-address=:8081'
            - '--metrics-bind-address=127.0.0.1:8080'
            - '--leader-elect'
            - '--leader-election-id=ibm-cps-operator'
            - '--zap-log-level=debug'

The debug logs now look like:

{"level":"info","ts":1632210241.791085,"logger":"cmd","msg":"Version","Go Version":"go1.16.8","GOOS":"linux","GOARCH":"amd64","helm-operator":"v1.11.0","commit":"28dcd12a776d8a8ff597e1d8527b08792e7312fd"}
{"level":"info","ts":1632210241.7931075,"logger":"cmd","msg":"Watch namespaces not configured by environment variable WATCH_NAMESPACE or file. Watching all namespaces.","Namespace":""}
I0921 07:44:03.538887       1 request.go:668] Waited for 1.046670127s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/storage.k8s.io/v1?timeout=32s
{"level":"info","ts":1632210245.8048182,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"info","ts":1632210245.8914208,"logger":"helm.controller","msg":"Watching resource","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp","namespace":"","reconcilePeriod":"1m0s"}
I0921 07:44:05.892824       1 leaderelection.go:243] attempting to acquire leader lease openshift-operators/ibm-cps-operator...
{"level":"info","ts":1632210245.8931193,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
I0921 07:44:22.484699       1 leaderelection.go:253] successfully acquired lease openshift-operators/ibm-cps-operator
{"level":"debug","ts":1632210262.4847672,"logger":"controller-runtime.manager.events","msg":"Normal","object":{"kind":"ConfigMap","namespace":"openshift-operators","name":"ibm-cps-operator","uid":"4600a34a-c21e-4698-b83f-0bcbdfc4929c","apiVersion":"v1","resourceVersion":"610631877"},"reason":"LeaderElection","message":"ibm-cps-operator-controller-manager-557c88f7f6-mhlt2_be7ef8ca-2fb9-41b1-9d2d-c3d044ab3633 became leader"}
{"level":"info","ts":1632210262.4856992,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting EventSource","source":"kind source: apps.cps.deepsearch.ibm.com/v1alpha1, Kind=KgAmqp"}
{"level":"info","ts":1632210262.485977,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting Controller"}
{"level":"info","ts":1632210262.9877303,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting workers","worker count":16}
{"level":"debug","ts":1632210262.9914088,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-edda45b6-kgamqp-90f46298","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210262.9916945,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-30b90812","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210262.9957469,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-f21574fe-kgamqp-5f323a8a","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210262.998314,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-59041f3c","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.091478,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-6dc398bc-kgamqp-5e842217","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0931582,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-0bbf559a","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.093546,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-fa7b5983","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0950687,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-801751ea","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0951493,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-7f18f401","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0968616,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-617ffb14-kgamqp-5b8ab839","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.190764,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-17e384d1","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.193185,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-a4f4d947","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0953085,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-1f739374","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.4915593,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-99cf5236-kgamqp-72952c4d","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.491967,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-fa5a863e","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.4956002,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-26239ca0-kgamqp-503b8ba5","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
I0921 07:44:25.491421       1 request.go:668] Waited for 1.078316488s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/node.k8s.io/v1beta1?timeout=32s
I0921 07:44:35.494616       1 request.go:668] Waited for 2.301755739s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/rbac.authorization.k8s.io/v1beta1?timeout=32s
I0921 07:44:50.992501       1 request.go:668] Waited for 1.000852157s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/scheduling.k8s.io/v1beta1?timeout=32s
E0921 07:44:53.894135       1 leaderelection.go:325] error retrieving resource lock openshift-operators/ibm-cps-operator: Get "https://172.21.0.1:443/api/v1/namespaces/openshift-operators/configmaps/ibm-cps-operator": context deadline exceeded
I0921 07:44:53.990918       1 leaderelection.go:278] failed to renew lease openshift-operators/ibm-cps-operator: timed out waiting for the condition
{"level":"error","ts":1632210293.9922085,"logger":"cmd","msg":"Manager exited non-zero.","Namespace":"","error":"leader election lost","stacktrace":"github.com/operator-framework/operator-sdk/internal/cmd/helm-operator/run.NewCmd.func1\n\t/workspace/internal/cmd/helm-operator/run/cmd.go:74\ngit.luolix.top/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856\ngit.luolix.top/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960\ngit.luolix.top/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897\nmain.main\n\t/workspace/cmd/helm-operator/main.go:40\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}

It looks like it is getting the leader election once, and then it fails.

Since the timeout seems to be on the configmap, I tried count all of them:

# count configmaps
❯ kubectl get cm --all-namespaces -o name |wc -l
    1211

At the moment the controller doesn't survive more than 2min, so I cannot inspect the API calls done when raising the memory.

@dolfim
Copy link
Author

dolfim commented Sep 21, 2021

I don't think the debug zap-log-level introduced more verbose output. The only debug entries are Reconciling and Reconciled release.

As posted before, I see lots of API request which (at least to me) look weird for our controller.

https://172.21.0.1:443/apis/migration.k8s.io/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/autoscaling/v2beta1?timeout=32s
https://172.21.0.1:443/apis/serving.knative.dev/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/image.openshift.io/v1?timeout=32s
https://172.21.0.1:443/apis/elasticsearch.k8s.elastic.co/v1?timeout=32s
https://172.21.0.1:443/apis/snapshot.storage.k8s.io/v1?timeout=32s
https://172.21.0.1:443/apis/cloudcredential.openshift.io/v1?timeout=32s
https://172.21.0.1:443/apis/operators.coreos.com/v1?timeout=32s
https://172.21.0.1:443/apis/extensions/v1beta1?timeout=32s
https://172.21.0.1:443/apis/autoscaling/v2beta2?timeout=32s
https://172.21.0.1:443/apis/whereabouts.cni.cncf.io/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/operators.coreos.com/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/triggers.tekton.dev/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/kibana.k8s.elastic.co/v1beta1?timeout=32s

I don't know the inner logic of the helm-controller but I don't really see a reason for it querying stuff like Tekton, Knative, etc.
But I can imagine that if the controller is caching the output of all tekton jobs, then this might explain the origin of the memory usage.
Does anybody understand why those APIs are called? Is this maybe a helm sdk issue?

@jberkhahn
Copy link
Contributor

When the helm controller comes up it dynamically queries the API to build a spec for talking to the cluster, so it might hit a bunch of APIs that look weird. Going to have to do some more digging.

@jberkhahn
Copy link
Contributor

So, trying to reproduce this on my rinky-dink minikube with 150+ CRDS deployed, my controller comes up fine with no hint of self-rate limiting. I suspect that maybe some Openshift specific stuff or configuration might be causing this.

@jberkhahn
Copy link
Contributor

Tried this again with the cluster also saturated with CMs, still unable to reproduce this. Would it be possible to get access to the cluster you're experiencing this error on? Not much we can do locally without the ability to reproduce the error.

@jberkhahn
Copy link
Contributor

Closing this as the user is no longer experiencing the problem and we're unable to reproduce it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language/helm Issue is related to a Helm operator project triage/needs-information Indicates an issue needs more information in order to work on it. triage/support Indicates an issue that is a support question.
Projects
None yet
Development

No branches or pull requests

3 participants