Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost leader election may be leading to restarts #810

Closed
pjbgf opened this issue Jul 1, 2022 · 5 comments
Closed

Lost leader election may be leading to restarts #810

pjbgf opened this issue Jul 1, 2022 · 5 comments

Comments

@pjbgf
Copy link
Member

pjbgf commented Jul 1, 2022

Source controller seems to be restarting abruptly, potentially due to lost leader election.

❯ k get pod -n flux-system -l app=source-controller --context stag-eun
NAME                                 READY   STATUS    RESTARTS   AGE
source-controller-57bff99765-lz9zq   1/1     Running   73         20d

❯ k get pod -n flux-system -l app=source-controller --context prod-eun
NAME                                 READY   STATUS    RESTARTS   AGE
source-controller-57bff99765-pjt4q   1/1     Running   7          20d

Logs before last restart:

❯ kubectl logs -n flux-system -l app=source-controller --previous --context stag-eun
{"level":"info","ts":"2022-06-30T16:20:54.545Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'refs/tags/v2.7.1-anova-k8s-deployment/be9df238f8324b0f62c2338a991a50164fe66500'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"anova-helm-chart-test","namespace":"marketer"}
{"level":"info","ts":"2022-06-30T16:20:54.825Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/cb507d8bcd12660b03286f04f4cec3e4e5048fed'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"ml","namespace":"ml"}
{"level":"info","ts":"2022-06-30T16:20:55.236Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/d695fe4c1576eafe37e1bcdc5c85ce7781bb2283'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"transcend","namespace":"transcend"}
{"level":"info","ts":"2022-06-30T16:21:02.600Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/dee5837abc258ed1e920429cef70dd1484cd3dae'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"data","namespace":"data"}
{"level":"info","ts":"2022-06-30T16:21:05.469Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/6e3d4626c10d9fc02cb6876546740eb66ec6db2d'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system"}
{"level":"info","ts":"2022-06-30T16:21:12.310Z","logger":"controller.gitrepository","msg":"garbage collected 2 artifacts","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"core","namespace":"core"}
{"level":"info","ts":"2022-06-30T16:21:14.302Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/3474c21267453b67cbac67f2eae727321bf23f46'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"core","namespace":"core"}
E0630 16:21:15.031656       1 leaderelection.go:367] Failed to update lock: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/flux-system/leases/source-controller-leader-election": context deadline exceeded
I0630 16:21:15.031717       1 leaderelection.go:283] failed to renew lease flux-system/source-controller-leader-election: timed out waiting for the condition
{"level":"error","ts":"2022-06-30T16:21:23.502Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}

On the specific case, the AKS clusters are syncing with the same Azure DevOps git repositories (gitImplementation: libgit2).
The deployment has cluster in multiple regions, most of which work as expected, apart from the clusters based in North Europe, which gets restarted every so often.

First reported by @mfamador at #402 (comment).

@pjbgf pjbgf added the bug Something isn't working label Jul 1, 2022
@hiddeco
Copy link
Member

hiddeco commented Jul 1, 2022

How do we know this is actually is a bug on our side, and not instability of the cluster?

On a side note, when we upgrade controller-runtime the leader election will start using "leases" instead of ConfigMaps: https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.12.0

@pjbgf pjbgf removed the bug Something isn't working label Jul 1, 2022
@pjbgf
Copy link
Member Author

pjbgf commented Jul 1, 2022

@hiddeco we don't, therefore I removed the bug label.

@pjbgf
Copy link
Member Author

pjbgf commented Jul 1, 2022

@mfamador I noticed that the staging cluster has a 10-fold occurrence rate on this issue. Have you noticed any other metric that could be correlated to indicate why is that so?

I am assuming from an API server perspective, the only difference amongst the two environments would be the availability that Azure provides. Or are both environments configured with the same API server availability?

Can you please also confirm whether this is happening only in North Europe, or across all regions?

@mfamador
Copy link

mfamador commented Jul 1, 2022

@pjbgf we're only using the rc-a00d0edc tag on north European regions due to #402 .
Regarding API server availability, yes, prod-eun has uptime sla activated while stag-eun hasn't.

The other regions we're using v0.25.5 and don't seem to have so many restarts, but the pods are running only for a week.

❯ k get pod -n flux-system -l app=source-controller --context stag-eun
NAME                                 READY   STATUS    RESTARTS   AGE
source-controller-57bff99765-lz9zq   1/1     Running   83         20d
❯ k get pod -n flux-system -l app=source-controller --context stag-ue
NAME                                READY   STATUS    RESTARTS   AGE
source-controller-dcd8fc8cb-vsqvg   1/1     Running   3          8d
❯ k get pod -n flux-system -l app=source-controller --context prod-ue
NAME                                READY   STATUS    RESTARTS   AGE
source-controller-dcd8fc8cb-5xt94   1/1     Running   0          8d
❯ k get pod -n flux-system -l app=source-controller --context prod-eun
NAME                                 READY   STATUS    RESTARTS   AGE
source-controller-57bff99765-pjt4q   1/1     Running   7          20d
❯ k describe deploy source-controller -n flux-system --context stag-eun | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:rc-a00d0edc
❯ k describe deploy source-controller -n flux-system --context stag-ue | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:v0.25.5
❯ k describe deploy source-controller -n flux-system --context prod-eun | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:rc-a00d0edc
❯ k describe deploy source-controller -n flux-system --context prod-ue | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:v0.25.5

I'll leave here the latest logs from the most problematic region stag-eun

❯ kubectl logs -n flux-system -l app=source-controller --previous --context stag-eun
{"level":"info","ts":"2022-07-01T10:45:07.434Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'refs/tags/v2.3.6-anova-k8s-deployment/61e09533622381155e867cff9c0beeccfff0049b'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"anova-helm-chart-test","namespace":"vtm"}
{"level":"info","ts":"2022-07-01T10:45:12.235Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'refs/tags/v0.4.7-anova-argo-workflow/5caa3023d61b6a11b3de52ad00232853296e92ae'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"anova-workflow-helm-chart-stable","namespace":"connectivity"}
{"level":"info","ts":"2022-07-01T10:45:21.347Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'refs/tags/v0.4.7-anova-argo-workflow/5caa3023d61b6a11b3de52ad00232853296e92ae'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"anova-workflow-helm-chart-stable","namespace":"data"}
{"level":"info","ts":"2022-07-01T10:45:23.474Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/58e004f856d450bb873fef4786623edca2b562f2'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"devices","namespace":"devices"}
{"level":"info","ts":"2022-07-01T10:45:24.055Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'refs/tags/v2.3.6-anova-k8s-deployment/61e09533622381155e867cff9c0beeccfff0049b'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"anova-helm-chart-stable","namespace":"ml"}
{"level":"info","ts":"2022-07-01T10:45:24.778Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'refs/tags/v2.8.0-anova-k8s-deployment/9e3640a46a17b08e6c9bc506d697b81fd32b5c32'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"anova-helm-chart-test","namespace":"transcend"}
I0701 10:45:25.600013       1 trace.go:205] Trace[1120697951]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167 (01-Jul-2022 10:44:23.736) (total time: 61863ms):
Trace[1120697951]: ---"Objects listed" error:<nil> 61861ms (10:45:25.597)
Trace[1120697951]: [1m1.863913038s] [1m1.863913038s] END
{"level":"error","ts":"2022-07-01T10:45:25.768Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}

@pjbgf
Copy link
Member Author

pjbgf commented Jul 1, 2022

@mfamador thanks for sharing the above, I will try to reproduce the issue and get back to you.

By the bye, the changes in rc-a00d0edc are now included in the official releases from v0.25.6+, feel free to upgrade to the latest, so you can have the same version across your fleet.

@stefanprodan stefanprodan closed this as not planned Won't fix, can't repro, duplicate, stale Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants