-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Helm operator leader election lost #5186
Comments
It might be a side discussion, but from the logs it looks like the helm-operator is fetching APIs which should not be related to our Helm chart |
If your operator is hitting some API that you've never heard of, that's a pretty good indicator that you've generated your operator incorrectly. It looks like you might be getting rate limited due to hammering the API, and that's causing the leader elections to time out. What helm chart are you using and what commands did you run to generate your operator? |
The operator was created using the tutorial on the operator-sdk page: operator-sdk init --domain cps.deepsearch.ibm.com --plugins helm
operator-sdk create api --group apps --version v1alpha1 --kind KgAmqp |
Is "KgAmqp" a pre-existing Helm chart or are you just making a void Helm chart like the example? I can't get the void operator example to fail like this. I suspect that you are getting rate-limited, but it looks like it's happening client side (on the controller) rather than by the API server. This can be configured via flags thrown on the controller-manager - is there anything funky when you look at the startup command there? "--kube-api-qps" in particular, as it looks like that's what sets the client side rate limit for the controller-manager. |
Our current intuition is that the error is visible only on a large OCP 4.6 cluster (+60 nodes, +2500 Pods, etc). As advised we tried out to introduce the # Use the 'create api' subcommand to add watches to this file.
- group: apps.cps.deepsearch.ibm.com
version: v1alpha1
kind: KgAmqp
chart: helm-charts/kgamqp
watchDependentResources: false # adding this doesn't help
selector: # adding this doesn't help, when
matchLabels:
app.kubernetes.io/name: kgamqp
app.kubernetes.io/managed-by: ibm-cps-operator
#+kubebuilder:scaffold:watch Regarding the previous questions. The helm-chart was started with the vanilla example, and modified with a) a bit more values, b) remove hpa, c) add configmap and secrets. The controller (reading from the running deployment) has the following args args:
- --health-probe-bind-address=:8081
- --metrics-bind-address=127.0.0.1:8080
- --leader-elect
- --leader-election-id=ibm-cps-operator |
@dolfim Thanks for raising the issue. With the current information we are also not able to figure out the reason behind this error. And as you have mentioned since the client-side throttling is happening only on large clusters, it would be difficult for us to reproduce. On brainstorming about this in our community meeting, we could think of a few pointers:
|
Thanks for looking at our issue and brainstorming about possibilities.
The debug logs now look like: {"level":"info","ts":1632210241.791085,"logger":"cmd","msg":"Version","Go Version":"go1.16.8","GOOS":"linux","GOARCH":"amd64","helm-operator":"v1.11.0","commit":"28dcd12a776d8a8ff597e1d8527b08792e7312fd"}
{"level":"info","ts":1632210241.7931075,"logger":"cmd","msg":"Watch namespaces not configured by environment variable WATCH_NAMESPACE or file. Watching all namespaces.","Namespace":""}
I0921 07:44:03.538887 1 request.go:668] Waited for 1.046670127s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/storage.k8s.io/v1?timeout=32s
{"level":"info","ts":1632210245.8048182,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"info","ts":1632210245.8914208,"logger":"helm.controller","msg":"Watching resource","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp","namespace":"","reconcilePeriod":"1m0s"}
I0921 07:44:05.892824 1 leaderelection.go:243] attempting to acquire leader lease openshift-operators/ibm-cps-operator...
{"level":"info","ts":1632210245.8931193,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
I0921 07:44:22.484699 1 leaderelection.go:253] successfully acquired lease openshift-operators/ibm-cps-operator
{"level":"debug","ts":1632210262.4847672,"logger":"controller-runtime.manager.events","msg":"Normal","object":{"kind":"ConfigMap","namespace":"openshift-operators","name":"ibm-cps-operator","uid":"4600a34a-c21e-4698-b83f-0bcbdfc4929c","apiVersion":"v1","resourceVersion":"610631877"},"reason":"LeaderElection","message":"ibm-cps-operator-controller-manager-557c88f7f6-mhlt2_be7ef8ca-2fb9-41b1-9d2d-c3d044ab3633 became leader"}
{"level":"info","ts":1632210262.4856992,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting EventSource","source":"kind source: apps.cps.deepsearch.ibm.com/v1alpha1, Kind=KgAmqp"}
{"level":"info","ts":1632210262.485977,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting Controller"}
{"level":"info","ts":1632210262.9877303,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting workers","worker count":16}
{"level":"debug","ts":1632210262.9914088,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-edda45b6-kgamqp-90f46298","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210262.9916945,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-30b90812","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210262.9957469,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-f21574fe-kgamqp-5f323a8a","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210262.998314,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-59041f3c","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.091478,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-6dc398bc-kgamqp-5e842217","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0931582,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-0bbf559a","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.093546,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-fa7b5983","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0950687,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-801751ea","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0951493,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-7f18f401","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0968616,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-617ffb14-kgamqp-5b8ab839","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.190764,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-17e384d1","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.193185,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-a4f4d947","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0953085,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-1f739374","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.4915593,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-99cf5236-kgamqp-72952c4d","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.491967,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-fa5a863e","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.4956002,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-26239ca0-kgamqp-503b8ba5","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
I0921 07:44:25.491421 1 request.go:668] Waited for 1.078316488s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/node.k8s.io/v1beta1?timeout=32s
I0921 07:44:35.494616 1 request.go:668] Waited for 2.301755739s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/rbac.authorization.k8s.io/v1beta1?timeout=32s
I0921 07:44:50.992501 1 request.go:668] Waited for 1.000852157s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/scheduling.k8s.io/v1beta1?timeout=32s
E0921 07:44:53.894135 1 leaderelection.go:325] error retrieving resource lock openshift-operators/ibm-cps-operator: Get "https://172.21.0.1:443/api/v1/namespaces/openshift-operators/configmaps/ibm-cps-operator": context deadline exceeded
I0921 07:44:53.990918 1 leaderelection.go:278] failed to renew lease openshift-operators/ibm-cps-operator: timed out waiting for the condition
{"level":"error","ts":1632210293.9922085,"logger":"cmd","msg":"Manager exited non-zero.","Namespace":"","error":"leader election lost","stacktrace":"github.com/operator-framework/operator-sdk/internal/cmd/helm-operator/run.NewCmd.func1\n\t/workspace/internal/cmd/helm-operator/run/cmd.go:74\ngit.luolix.top/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856\ngit.luolix.top/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960\ngit.luolix.top/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897\nmain.main\n\t/workspace/cmd/helm-operator/main.go:40\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"} It looks like it is getting the leader election once, and then it fails. Since the timeout seems to be on the configmap, I tried count all of them: # count configmaps
❯ kubectl get cm --all-namespaces -o name |wc -l
1211 At the moment the controller doesn't survive more than 2min, so I cannot inspect the API calls done when raising the memory. |
I don't think the debug zap-log-level introduced more verbose output. The only debug entries are Reconciling and Reconciled release. As posted before, I see lots of API request which (at least to me) look weird for our controller.
I don't know the inner logic of the helm-controller but I don't really see a reason for it querying stuff like Tekton, Knative, etc. |
When the helm controller comes up it dynamically queries the API to build a spec for talking to the cluster, so it might hit a bunch of APIs that look weird. Going to have to do some more digging. |
So, trying to reproduce this on my rinky-dink minikube with 150+ CRDS deployed, my controller comes up fine with no hint of self-rate limiting. I suspect that maybe some Openshift specific stuff or configuration might be causing this. |
Tried this again with the cluster also saturated with CMs, still unable to reproduce this. Would it be possible to get access to the cluster you're experiencing this error on? Not much we can do locally without the ability to reproduce the error. |
Closing this as the user is no longer experiencing the problem and we're unable to reproduce it. |
Bug Report
What did you do?
Our operator is based on the helm-operator v1.11.0. When the manager is running, we get constant CrashLoop with the message
leader election lost
.Here are the logs produced by all the reboots:
Environment
Operator type:
/language helm
Kubernetes cluster type:
OpenShift 4.7.
$ operator-sdk version
$ kubectl version
The text was updated successfully, but these errors were encountered: