Lost leader election may be leading to restarts #810

pjbgf · 2022-07-01T08:54:26Z

Source controller seems to be restarting abruptly, potentially due to lost leader election.

❯ k get pod -n flux-system -l app=source-controller --context stag-eun
NAME                                 READY   STATUS    RESTARTS   AGE
source-controller-57bff99765-lz9zq   1/1     Running   73         20d

❯ k get pod -n flux-system -l app=source-controller --context prod-eun
NAME                                 READY   STATUS    RESTARTS   AGE
source-controller-57bff99765-pjt4q   1/1     Running   7          20d

Logs before last restart:

❯ kubectl logs -n flux-system -l app=source-controller --previous --context stag-eun
{"level":"info","ts":"2022-06-30T16:20:54.545Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'refs/tags/v2.7.1-anova-k8s-deployment/be9df238f8324b0f62c2338a991a50164fe66500'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"anova-helm-chart-test","namespace":"marketer"}
{"level":"info","ts":"2022-06-30T16:20:54.825Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/cb507d8bcd12660b03286f04f4cec3e4e5048fed'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"ml","namespace":"ml"}
{"level":"info","ts":"2022-06-30T16:20:55.236Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/d695fe4c1576eafe37e1bcdc5c85ce7781bb2283'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"transcend","namespace":"transcend"}
{"level":"info","ts":"2022-06-30T16:21:02.600Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/dee5837abc258ed1e920429cef70dd1484cd3dae'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"data","namespace":"data"}
{"level":"info","ts":"2022-06-30T16:21:05.469Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/6e3d4626c10d9fc02cb6876546740eb66ec6db2d'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system"}
{"level":"info","ts":"2022-06-30T16:21:12.310Z","logger":"controller.gitrepository","msg":"garbage collected 2 artifacts","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"core","namespace":"core"}
{"level":"info","ts":"2022-06-30T16:21:14.302Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/3474c21267453b67cbac67f2eae727321bf23f46'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"core","namespace":"core"}
E0630 16:21:15.031656       1 leaderelection.go:367] Failed to update lock: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/flux-system/leases/source-controller-leader-election": context deadline exceeded
I0630 16:21:15.031717       1 leaderelection.go:283] failed to renew lease flux-system/source-controller-leader-election: timed out waiting for the condition
{"level":"error","ts":"2022-06-30T16:21:23.502Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}

On the specific case, the AKS clusters are syncing with the same Azure DevOps git repositories (gitImplementation: libgit2).
The deployment has cluster in multiple regions, most of which work as expected, apart from the clusters based in North Europe, which gets restarted every so often.

First reported by @mfamador at #402 (comment).

The text was updated successfully, but these errors were encountered:

hiddeco · 2022-07-01T09:06:17Z

How do we know this is actually is a bug on our side, and not instability of the cluster?

On a side note, when we upgrade controller-runtime the leader election will start using "leases" instead of ConfigMaps: https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.12.0

pjbgf · 2022-07-01T09:09:11Z

@hiddeco we don't, therefore I removed the bug label.

pjbgf · 2022-07-01T09:24:15Z

@mfamador I noticed that the staging cluster has a 10-fold occurrence rate on this issue. Have you noticed any other metric that could be correlated to indicate why is that so?

I am assuming from an API server perspective, the only difference amongst the two environments would be the availability that Azure provides. Or are both environments configured with the same API server availability?

Can you please also confirm whether this is happening only in North Europe, or across all regions?

mfamador · 2022-07-01T12:09:22Z

@pjbgf we're only using the rc-a00d0edc tag on north European regions due to #402 .
Regarding API server availability, yes, prod-eun has uptime sla activated while stag-eun hasn't.

The other regions we're using v0.25.5 and don't seem to have so many restarts, but the pods are running only for a week.

❯ k get pod -n flux-system -l app=source-controller --context stag-eun
NAME                                 READY   STATUS    RESTARTS   AGE
source-controller-57bff99765-lz9zq   1/1     Running   83         20d
❯ k get pod -n flux-system -l app=source-controller --context stag-ue
NAME                                READY   STATUS    RESTARTS   AGE
source-controller-dcd8fc8cb-vsqvg   1/1     Running   3          8d
❯ k get pod -n flux-system -l app=source-controller --context prod-ue
NAME                                READY   STATUS    RESTARTS   AGE
source-controller-dcd8fc8cb-5xt94   1/1     Running   0          8d
❯ k get pod -n flux-system -l app=source-controller --context prod-eun
NAME                                 READY   STATUS    RESTARTS   AGE
source-controller-57bff99765-pjt4q   1/1     Running   7          20d

❯ k describe deploy source-controller -n flux-system --context stag-eun | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:rc-a00d0edc
❯ k describe deploy source-controller -n flux-system --context stag-ue | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:v0.25.5
❯ k describe deploy source-controller -n flux-system --context prod-eun | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:rc-a00d0edc
❯ k describe deploy source-controller -n flux-system --context prod-ue | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:v0.25.5

I'll leave here the latest logs from the most problematic region stag-eun

❯ kubectl logs -n flux-system -l app=source-controller --previous --context stag-eun
{"level":"info","ts":"2022-07-01T10:45:07.434Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'refs/tags/v2.3.6-anova-k8s-deployment/61e09533622381155e867cff9c0beeccfff0049b'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"anova-helm-chart-test","namespace":"vtm"}
{"level":"info","ts":"2022-07-01T10:45:12.235Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'refs/tags/v0.4.7-anova-argo-workflow/5caa3023d61b6a11b3de52ad00232853296e92ae'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"anova-workflow-helm-chart-stable","namespace":"connectivity"}
{"level":"info","ts":"2022-07-01T10:45:21.347Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'refs/tags/v0.4.7-anova-argo-workflow/5caa3023d61b6a11b3de52ad00232853296e92ae'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"anova-workflow-helm-chart-stable","namespace":"data"}
{"level":"info","ts":"2022-07-01T10:45:23.474Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/58e004f856d450bb873fef4786623edca2b562f2'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"devices","namespace":"devices"}
{"level":"info","ts":"2022-07-01T10:45:24.055Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'refs/tags/v2.3.6-anova-k8s-deployment/61e09533622381155e867cff9c0beeccfff0049b'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"anova-helm-chart-stable","namespace":"ml"}
{"level":"info","ts":"2022-07-01T10:45:24.778Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'refs/tags/v2.8.0-anova-k8s-deployment/9e3640a46a17b08e6c9bc506d697b81fd32b5c32'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"anova-helm-chart-test","namespace":"transcend"}
I0701 10:45:25.600013       1 trace.go:205] Trace[1120697951]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167 (01-Jul-2022 10:44:23.736) (total time: 61863ms):
Trace[1120697951]: ---"Objects listed" error:<nil> 61861ms (10:45:25.597)
Trace[1120697951]: [1m1.863913038s] [1m1.863913038s] END
{"level":"error","ts":"2022-07-01T10:45:25.768Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}

pjbgf · 2022-07-01T13:10:50Z

@mfamador thanks for sharing the above, I will try to reproduce the issue and get back to you.

By the bye, the changes in rc-a00d0edc are now included in the official releases from v0.25.6+, feel free to upgrade to the latest, so you can have the same version across your fleet.

pjbgf added the bug Something isn't working label Jul 1, 2022

pjbgf removed the bug Something isn't working label Jul 1, 2022

stefanprodan closed this as not planned Won't fix, can't repro, duplicate, stale Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lost leader election may be leading to restarts #810

Lost leader election may be leading to restarts #810

pjbgf commented Jul 1, 2022 •

edited

Loading

hiddeco commented Jul 1, 2022

pjbgf commented Jul 1, 2022

pjbgf commented Jul 1, 2022

mfamador commented Jul 1, 2022

pjbgf commented Jul 1, 2022

Lost leader election may be leading to restarts #810

Lost leader election may be leading to restarts #810

Comments

pjbgf commented Jul 1, 2022 • edited Loading

hiddeco commented Jul 1, 2022

pjbgf commented Jul 1, 2022

pjbgf commented Jul 1, 2022

mfamador commented Jul 1, 2022

pjbgf commented Jul 1, 2022

pjbgf commented Jul 1, 2022 •

edited

Loading