Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure DevOps: Source controller getting stuck #402

Closed
Tracked by #2593
mfamador opened this issue Jul 20, 2021 · 47 comments · Fixed by #606 or #649
Closed
Tracked by #2593

Azure DevOps: Source controller getting stuck #402

mfamador opened this issue Jul 20, 2021 · 47 comments · Fixed by #606 or #649
Assignees
Labels
bug Something isn't working
Milestone

Comments

@mfamador
Copy link

mfamador commented Jul 20, 2021

Hello.

We have 3 AKS clusters, all running the exact same versions of flux (0.16.1) in two different Azure regions (North Europe and East US).

The source-controller version is 0.15.3.

❯ k describe deploy source-controller -n flux-system --context aks-stag-eun | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:v0.15.3
❯ k describe deploy source-controller -n flux-system --context aks-stag-ue | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:v0.15.3

Both clusters are synching with the same Azure DevOps git repositories (gitImplementation: libgit2).

Everything is working great on East US clusters but in North Europe source-controller gets stuck multiple times a day and only killing it seems to make the sources to reconcile again (we've created a cronjob to restart source-controllerevery half a hour).

Even restarting every half a hour we're still getting a lot of gaps where there's no source reconciliation.

Screenshot 2021-07-20 at 09 46 51

In this state, any manual reconciliation also gets stuck and never finishes:

>  flux reconcile source git core -n core --context aks-stag-eun

► annotating GitRepository core in core namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation

There's no logs on source-controller when it's in this lock state ...

I'm pretty sure it's a connectivity problem to Azure DevOps or something not directly related to source-controller, but maybe it should recover or timeout from whatever it's trying to do (?)

I've also increased concurrent from the default 2 to 6 but it seems to not be doing anything differently.

Thanks!

@o-otte
Copy link

o-otte commented Jul 21, 2021

We are facing the same behavior with our AKS clusters running in Azure west europe and china east 2. The problem only occurs in the china region.

@mfamador
Copy link
Author

Hey @o-otte, are you also using Azure DevOps git source (libgit2) or any other git like Github?

@o-otte
Copy link

o-otte commented Jul 21, 2021

Hi @mfamador, yes we also use Azure DevOps

@gmaiztegi
Copy link

Hello,

We just had the same issue in one of our clusters (AWS EKS in us-east-2), which is synchronising from Github. We updated to the last version of Flux last Friday and we have only seen this behaviour this time.

Screenshot 2021-08-02 at 15 28 35

@kingdonb kingdonb added the bug Something isn't working label Aug 19, 2021
@kingdonb kingdonb self-assigned this Aug 20, 2021
@natarajmb
Copy link

Hello,
Seeing the same issue with AzureDevOps hosted git repo with libgit2. It has worked flawlessly for the last 6 months until now. The same git repo called from another cluster works fine. Any workarounds?

Git source-controller same as above v0.15.3

@kingdonb
Copy link
Member

kingdonb commented Mar 1, 2022

The latest source-controller is v0.21.2 from Flux CLI v0.27.2. There have been substantial changes in all parts of Flux since June, when the source-controller v0.15.3 was released. Some of the most recent changes are updates that targeted issues like these, and those updates might have already resolved this issue for the original poster.

Are you able to reproduce it consistently @natarajmb, or does it go away when you restart? (Are you able to try an upgrade?)

If you can reproduce it with a current release, then we can dedicate some resources to trying to reproduce it again. I have a feeling this issue is either solved now, or it will be solved soon; but it seems tricky to reproduce. It will not be possible to investigate effectively based on reports for an older version, if we do not have a report stating for sure if this issue remains with the current version of Flux.

@natarajmb
Copy link

@kingdonb I now restarted source-controller and it started to pull the latest version of the code. FYI, we run flux from single repo based on path separation for multiple environments/clusters. I think this has occurred as we had a git revert on the path specific to this cluster after which it never pulled. Restarting source-controller fixed it. Thanks for the heads up will upgrade to the latest Flux.

@pjbgf
Copy link
Member

pjbgf commented Mar 28, 2022

Re-opening until we get confirmation that the issue has been fixed.

@mfamador @natarajmb do you mind trying the fixes included in the new experimental Managed Transport and let us know whether that fixes your issue please?

@mfamador
Copy link
Author

@pjbgf, will do that, thanks. Removed the cronjobs that are restarting the source-controller every 30 mins and will monitor and let you know if the issue is gone.

@pjbgf pjbgf self-assigned this Mar 29, 2022
@pjbgf pjbgf moved this to In Progress in Maintainers' Focus Mar 29, 2022
@mfamador
Copy link
Author

mfamador commented Mar 29, 2022

If forgot to apply the patch with the new env var with the experimental managed transport. Until adding it, the version v0.22.4 actually worked pretty nice and didn't block for almost 4 hours.
After applying the patch with the EXPERIMENTAL_GIT_TRANSPORT env var it's breaking now:

❯ k describe deploy source-controller -n flux-system | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:v0.22.4
❯ stern source-controller -n flux-system --tail 0

+ source-controller-7776dc897b-npppk › manager
source-controller-7776dc897b-npppk manager panic: runtime error: invalid memory address or nil pointer dereference
source-controller-7776dc897b-npppk manager [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a7cf53]
source-controller-7776dc897b-npppk manager
source-controller-7776dc897b-npppk manager goroutine 534 [running]:
source-controller-7776dc897b-npppk manager github.com/fluxcd/source-controller/pkg/git/libgit2/managed.(*sshSmartSubtransport).Close(0xc00efd3900)
source-controller-7776dc897b-npppk manager 	github.com/fluxcd/source-controller/pkg/git/libgit2/managed/ssh.go:268 +0x93
source-controller-7776dc897b-npppk manager github.com/libgit2/git2go/v33.smartSubtransportCloseCallback(0x404e06, 0xc000603ba0)
source-controller-7776dc897b-npppk manager 	github.com/libgit2/git2go/v33@v33.0.9/transport.go:409 +0x6f
source-controller-7776dc897b-npppk manager github.com/libgit2/git2go/v33._Cfunc_git_clone(0xc000a5edb8, 0x7fd51cfefbb0, 0x7fd51cfefc00, 0xc005d79520)
source-controller-7776dc897b-npppk manager 	_cgo_gotypes.go:3244 +0x4c
source-controller-7776dc897b-npppk manager github.com/libgit2/git2go/v33.Clone.func3(0xc005d79520, 0xc001b3ac60, 0xc0192420c0, 0x1b4db45)
source-controller-7776dc897b-npppk manager 	github.com/libgit2/git2go/v33@v33.0.9/clone.go:43 +0x91
source-controller-7776dc897b-npppk manager github.com/libgit2/git2go/v33.Clone({0xc0008b50c0, 0xc009690a80}, {0xc002322d40, 0x3d}, 0xc001b3ac60)
source-controller-7776dc897b-npppk manager 	github.com/libgit2/git2go/v33@v33.0.9/clone.go:43 +0x19e
source-controller-7776dc897b-npppk manager github.com/fluxcd/source-controller/pkg/git/libgit2.(*CheckoutTag).Checkout(0xc019242060, {0x27ca660, 0xc009690a80}, {0xc002322d40, 0x3d}, {0xc0008b50c0, 0x3e}, 0x0)
source-controller-7776dc897b-npppk manager 	github.com/fluxcd/source-controller/pkg/git/libgit2/checkout.go:97 +0x1e5
source-controller-7776dc897b-npppk manager github.com/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).reconcileSource(0xc0006c0d70, {0x27ca698, 0xc00272c750}, 0xc001a15200, 0xc001ff95f0, 0x18, {0xc002322d40, 0x3d})
source-controller-7776dc897b-npppk manager 	github.com/fluxcd/source-controller/controllers/gitrepository_controller.go:404 +0x99f
source-controller-7776dc897b-npppk manager github.com/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).reconcile(0x2834958, {0x27ca698, 0xc00272c750}, 0xc001a15200, {0xc001225be8, 0x4, 0x40e494})
source-controller-7776dc897b-npppk manager 	github.com/fluxcd/source-controller/controllers/gitrepository_controller.go:244 +0x3d5
source-controller-7776dc897b-npppk manager github.com/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).Reconcile(0xc0006c0d70, {0x27ca698, 0xc00272c750}, {{{0xc000a5fb97, 0x2384b60}, {0xc000a20d08, 0x30}}})
source-controller-7776dc897b-npppk manager 	github.com/fluxcd/source-controller/controllers/gitrepository_controller.go:205 +0x4bb
source-controller-7776dc897b-npppk manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc00062a2c0, {0x27ca698, 0xc00272c180}, {{{0xc000a5fb97, 0x2384b60}, {0xc000a20d08, 0x415034}}})
source-controller-7776dc897b-npppk manager 	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114 +0x26f
source-controller-7776dc897b-npppk manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00062a2c0, {0x27ca5f0, 0xc000434600}, {0x2226280, 0xc014bb97a0})
source-controller-7776dc897b-npppk manager 	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311 +0x33e
source-controller-7776dc897b-npppk manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00062a2c0, {0x27ca5f0, 0xc000434600})
source-controller-7776dc897b-npppk manager 	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266 +0x205
source-controller-7776dc897b-npppk manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
source-controller-7776dc897b-npppk manager 	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227 +0x85
source-controller-7776dc897b-npppk manager created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
source-controller-7776dc897b-npppk manager 	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:223 +0x357
- source-controller-7776dc897b-npppk › manager
+ source-controller-7776dc897b-npppk › manager

The source-controller deployment:

❯ k get deploy source-controller -oyaml | grep -C10 EXP
      - args:
        - --concurrent=6
        - --events-addr=http://notification-controller.flux-system.svc.cluster.local./
        - --watch-all-namespaces=true
        - --log-level=info
        - --log-encoding=json
        - --enable-leader-election
        - --storage-path=/data
        - --storage-adv-addr=source-controller.$(RUNTIME_NAMESPACE).svc.cluster.local.
        env:
        - name: EXPERIMENTAL_GIT_TRANSPORT
          value: "true"
        - name: RUNTIME_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        image: ghcr.io/fluxcd/source-controller:v0.22.4
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3

@mfamador
Copy link
Author

I have most of my GitRepositories with gitImplementation: libgit2 but also have a few with gitImplementation: go-git, maybe that's the reason for source-controller and image-automation-controller crashing when EXPERIMENTAL_GIT_TRANSPORT is set to true (?)

@pjbgf pjbgf added this to the GA milestone Mar 30, 2022
@pjbgf
Copy link
Member

pjbgf commented Mar 30, 2022

@mfamador Thank you so much for providing this information. I am investigating further into this and can see a few improvements to be done already.

Would you be able to change your log-level to trace and provide the logs preceding the error so I can make sure I am fixing your specific problem? Feel free to redact repository names.

Are you experiencing the error at every reconcile or is this happening intermittently? In order words, does the GitRepository that leads to the error ever gets reconciled correctly? (the trace log-level may help answering this)

@mfamador
Copy link
Author

@pjbgf, I've set trace log level and EXPERIMENTAL_GIT_TRANSPORT to true.

It seems that source-controller is crashing (most of the times, not 100%) when I reconcile this libg2 GitRepository, despite the reconcile command returning success:

❯ flux reconcile source git core -n core
► annotating GitRepository core in core namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation
✔ fetched revision master/e350b0cd214caec23994de0c41dd5e0491904654

Some logs when reconciling:




source-controller-6bf8dc44c7-z4ngs manager {"level":"Level(-2)","ts":"2022-03-30T11:44:11.063Z","logger":"managed-transport","msg":"[ssh]: sshSmartSubtransport.Close()"}
source-controller-6bf8dc44c7-z4ngs manager {"level":"Level(-2)","ts":"2022-03-30T11:44:11.063Z","logger":"managed-transport","msg":"[ssh]: cache hit","remoteAddress":"ssh.dev.azure.com:22"}
source-controller-6bf8dc44c7-z4ngs manager {"level":"Level(-2)","ts":"2022-03-30T11:44:11.063Z","logger":"managed-transport","msg":"[ssh]: creating new ssh session"}
source-controller-6bf8dc44c7-z4ngs manager {"level":"Level(-2)","ts":"2022-03-30T11:44:11.063Z","logger":"managed-transport","msg":"[ssh]: discard cached ssh client"}
source-controller-6bf8dc44c7-z4ngs manager {"level":"Level(-2)","ts":"2022-03-30T11:44:11.063Z","logger":"managed-transport","msg":"[ssh]: sshSmartSubtransport.Close()"}
source-controller-6bf8dc44c7-z4ngs manager panic: runtime error: invalid memory address or nil pointer dereference
source-controller-6bf8dc44c7-z4ngs manager [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a7cf53]
source-controller-6bf8dc44c7-z4ngs manager
source-controller-6bf8dc44c7-z4ngs manager goroutine 579 [running]:
source-controller-6bf8dc44c7-z4ngs manager github.com/fluxcd/source-controller/pkg/git/libgit2/managed.(*sshSmartSubtransport).Close(0xc002889ea0)
source-controller-6bf8dc44c7-z4ngs manager 	github.com/fluxcd/source-controller/pkg/git/libgit2/managed/ssh.go:268 +0x93
source-controller-6bf8dc44c7-z4ngs manager github.com/libgit2/git2go/v33.smartSubtransportCloseCallback(0x404e06, 0xc000892b60)
source-controller-6bf8dc44c7-z4ngs manager 	github.com/libgit2/git2go/v33@v33.0.9/transport.go:409 +0x6f
source-controller-6bf8dc44c7-z4ngs manager github.com/libgit2/git2go/v33._Cfunc_git_clone(0xc00577ace0, 0x7f22ed276700, 0x7f22ed221b60, 0xc0013e3040)
source-controller-6bf8dc44c7-z4ngs manager 	_cgo_gotypes.go:3244 +0x4c
source-controller-6bf8dc44c7-z4ngs manager github.com/libgit2/git2go/v33.Clone.func3(0xc00577abe4, 0x6, 0xc00390f3c0, 0x1b4db45)
source-controller-6bf8dc44c7-z4ngs manager 	github.com/libgit2/git2go/v33@v33.0.9/clone.go:43 +0x91
source-controller-6bf8dc44c7-z4ngs manager github.com/libgit2/git2go/v33.Clone({0xc002bca6c0, 0xc00528d740}, {0xc0028d0120, 0x27}, 0xc00580eb40)
source-controller-6bf8dc44c7-z4ngs manager 	github.com/libgit2/git2go/v33@v33.0.9/clone.go:43 +0x19e
source-controller-6bf8dc44c7-z4ngs manager github.com/fluxcd/source-controller/pkg/git/libgit2.(*CheckoutBranch).Checkout(0xc00390f360, {0x27ca660, 0xc00528d740}, {0xc0028d0120, 0x27}, {0xc002bca6c0, 0x3e}, 0x0)
source-controller-6bf8dc44c7-z4ngs manager 	github.com/fluxcd/source-controller/pkg/git/libgit2/checkout.go:64 +0x22d
source-controller-6bf8dc44c7-z4ngs manager github.com/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).reconcileSource(0xc0007e8820, {0x27ca698, 0xc00d7b2c30}, 0xc005365200, 0xc00375b860, 0x18, {0xc0028d0120, 0x27})
source-controller-6bf8dc44c7-z4ngs manager 	github.com/fluxcd/source-controller/controllers/gitrepository_controller.go:404 +0x99f
source-controller-6bf8dc44c7-z4ngs manager github.com/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).reconcile(0x2834958, {0x27ca698, 0xc00d7b2c30}, 0xc005365200, {0xc0011ddbe8, 0x4, 0x32c0033d1d40})
source-controller-6bf8dc44c7-z4ngs manager 	github.com/fluxcd/source-controller/controllers/gitrepository_controller.go:244 +0x3d5
source-controller-6bf8dc44c7-z4ngs manager github.com/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).Reconcile(0xc0007e8820, {0x27ca698, 0xc00d7b2c30}, {{{0xc00577abbc, 0x2384b60}, {0xc00577abb8, 0x30}}})
source-controller-6bf8dc44c7-z4ngs manager 	github.com/fluxcd/source-controller/controllers/gitrepository_controller.go:205 +0x4bb
source-controller-6bf8dc44c7-z4ngs manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc000276bb0, {0x27ca698, 0xc00d7b29f0}, {{{0xc00577abbc, 0x2384b60}, {0xc00577abb8, 0x415034}}})
source-controller-6bf8dc44c7-z4ngs manager 	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114 +0x26f
source-controller-6bf8dc44c7-z4ngs manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000276bb0, {0x27ca5f0, 0xc000358f40}, {0x2226280, 0xc00de68860})
source-controller-6bf8dc44c7-z4ngs manager 	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311 +0x33e
source-controller-6bf8dc44c7-z4ngs manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000276bb0, {0x27ca5f0, 0xc000358f40})
source-controller-6bf8dc44c7-z4ngs manager 	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266 +0x205
source-controller-6bf8dc44c7-z4ngs manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
source-controller-6bf8dc44c7-z4ngs manager 	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227 +0x85
source-controller-6bf8dc44c7-z4ngs manager created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
source-controller-6bf8dc44c7-z4ngs manager 	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:223 +0x357
- source-controller-6bf8dc44c7-z4ngs › manager

@mfamador
Copy link
Author

mfamador commented Mar 30, 2022

I guess it crashes when reconciling for the second time only. (sorry, it's hard to be deterministic, I have a lot of GitRepos on this cluster)

Repository owner moved this from In Progress to Done in Maintainers' Focus Mar 30, 2022
@pjbgf pjbgf reopened this Mar 30, 2022
@pjbgf
Copy link
Member

pjbgf commented Mar 30, 2022

@mfamador that's absolutely fine, thank you for all the information. We will be releasing a minor patch today including a potential fix for this under version v0.22.5.

@mfamador
Copy link
Author

mfamador commented Mar 30, 2022

@pjbgf I've just tried v0.22.5 and despite not crashing I'm not being able to sync the libgit2 GitRepos.

source-controller-64b99775d6-6wc46 manager {"level":"Level(-2)","ts":"2022-03-30T17:16:48.516Z","logger":"managed-transport","msg":"[ssh]: sshSmartSubtransport.Close()"}
source-controller-64b99775d6-6wc46 manager {"level":"Level(-2)","ts":"2022-03-30T17:16:48.516Z","logger":"managed-transport","msg":"[ssh]: cache hit","remoteAddress":"ssh.dev.azure.com:22"}
source-controller-64b99775d6-6wc46 manager {"level":"Level(-2)","ts":"2022-03-30T17:16:48.516Z","logger":"managed-transport","msg":"[ssh]: creating new ssh session"}
source-controller-64b99775d6-6wc46 manager {"level":"Level(-2)","ts":"2022-03-30T17:16:48.516Z","logger":"managed-transport","msg":"[ssh]: discard cached ssh client"}
source-controller-64b99775d6-6wc46 manager {"level":"Level(-2)","ts":"2022-03-30T17:16:48.516Z","logger":"managed-transport","msg":"[ssh]: sshSmartSubtransport.Close()"}
source-controller-64b99775d6-6wc46 manager {"level":"Level(-2)","ts":"2022-03-30T17:16:48.516Z","logger":"managed-transport","msg":"[ssh]: sshSmartSubtransport.Free()"}
source-controller-64b99775d6-6wc46 manager {"level":"error","ts":"2022-03-30T17:16:48.516Z","msg":"failed to checkout and determine revision: unable to clone 'ssh://git@ssh.dev.azure.com/v3/anovateam/Mapleleaf/gitops-core': EOF","name":"core","namespace":"core","reconciler kind":"GitRepository","annotations":null,"error":"GitOperationFailed","stacktrace":"github.com/fluxcd/pkg/runtime/events.(*Recorder).Eventf\n\tgit.luolix.top/fluxcd/pkg/runtime@v0.13.4/events/recorder.go:113\ngit.luolix.top/fluxcd/source-controller/internal/reconcile/summarize.RecordContextualError\n\tgit.luolix.top/fluxcd/source-controller/internal/reconcile/summarize/processor.go:47\ngit.luolix.top/fluxcd/source-controller/internal/reconcile/summarize.(*Helper).SummarizeAndPatch\n\tgit.luolix.top/fluxcd/source-controller/internal/reconcile/summarize/summary.go:180\ngit.luolix.top/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).Reconcile.func1\n\tgit.luolix.top/fluxcd/source-controller/controllers/gitrepository_controller.go:179\ngit.luolix.top/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).Reconcile\n\tgit.luolix.top/fluxcd/source-controller/controllers/gitrepository_controller.go:208\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227"}
source-controller-64b99775d6-6wc46 manager {"level":"debug","ts":"2022-03-30T17:16:48.517Z","logger":"events","msg":"Warning","object":{"kind":"GitRepository","namespace":"core","name":"core","uid":"28c9b056-ee19-4846-9011-eaee25ebccc5","apiVersion":"source.toolkit.fluxcd.io/v1beta2","resourceVersion":"204627799"},"reason":"GitOperationFailed","message":"failed to checkout and determine revision: unable to clone 'ssh://git@ssh.dev.azure.com/v3/anovateam/Mapleleaf/gitops-core': EOF"}
source-controller-64b99775d6-6wc46 manager {"level":"error","ts":"2022-03-30T17:16:48.537Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"core","namespace":"core","error":"failed to checkout and determine revision: unable to clone 'ssh://git@ssh.dev.azure.com/v3/anovateam/Mapleleaf/gitops-core': EOF","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227"}
source-controller-64b99775d6-6wc46 manager {"level":"Level(-2)","ts":"2022-03-30T17:16:48.544Z","logger":"managed-transport","msg":"[ssh]: sshSmartSubtransport.Close()"}
source-controller-64b99775d6-6wc46 manager {"level":"Level(-2)","ts":"2022-03-30T17:16:48.544Z","logger":"managed-transport","msg":"[ssh]: cache miss","remoteAddress":"ssh.dev.azure.com:22"}
source-controller-64b99775d6-6wc46 manager {"level":"Level(-2)","ts":"2022-03-30T17:16:48.717Z","logger":"managed-transport","msg":"[ssh]: creating new ssh session"}
source-controller-64b99775d6-6wc46 manager {"level":"Level(-2)","ts":"2022-03-30T17:16:48.727Z","logger":"managed-transport","msg":"[ssh]: run on remote","cmd":"git-upload-pack '/v3/anovateam/Mapleleaf/gitops-core'"}
source-controller-64b99775d6-6wc46 manager {"level":"Level(-2)","ts":"2022-03-30T17:16:52.734Z","logger":"managed-transport","msg":"[ssh]: sshSmartSubtransportStream.Free()"}
source-controller-64b99775d6-6wc46 manager {"level":"Level(-2)","ts":"2022-03-30T17:16:52.734Z","logger":"managed-transport","msg":"[ssh]: sshSmartSubtransport.Close()"}
source-controller-64b99775d6-6wc46 manager {"level":"Level(-2)","ts":"2022-03-30T17:16:52.734Z","logger":"managed-transport","msg":"[ssh]: skipping session.wait"}
source-controller-64b99775d6-6wc46 manager {"level":"Level(-2)","ts":"2022-03-30T17:16:52.734Z","logger":"managed-transport","msg":"[ssh]: session.Close()"}
source-controller-64b99775d6-6wc46 manager {"level":"Level(-2)","ts":"2022-03-30T17:16:52.748Z","logger":"managed-transport","msg":"[ssh]: sshSmartSubtransport.Close()"}
source-controller-64b99775d6-6wc46 manager {"level":"Level(-2)","ts":"2022-03-30T17:16:52.748Z","logger":"managed-transport","msg":"[ssh]: sshSmartSubtransport.Free()"}
source-controller-64b99775d6-6wc46 manager {"level":"debug","ts":"2022-03-30T17:16:52.748Z","logger":"controller.gitrepository","msg":"git repository checked out","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"core","namespace":"core","url":"ssh://git@ssh.dev.azure.com/v3/MYREPOgitops-core","revision":"master/5d7fd027616e468ab29846558fb5d91b6083e17d"}
source-controller-64b99775d6-6wc46 manager {"level":"info","ts":"2022-03-30T17:16:52.748Z","logger":"controller.gitrepository","msg":"artifact up-to-date with remote revision: 'master/5d7fd027616e468ab29846558fb5d91b6083e17d'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"core","namespace":"core"}
source-controller-64b99775d6-6wc46 manager {"level":"info","ts":"2022-03-30T17:16:52.748Z","msg":"artifact up-to-date with remote revision: 'master/5d7fd027616e468ab29846558fb5d91b6083e17d'","name":"core","namespace":"core","reconciler kind":"GitRepository","reason":"ArtifactUpToDate","annotations":null}
source-controller-64b99775d6-6wc46 manager {"level":"debug","ts":"2022-03-30T17:16:52.748Z","logger":"events","msg":"Normal","object":{"kind":"GitRepository","namespace":"core","name":"core","uid":"28c9b056-ee19-4846-9011-eaee25ebccc5","apiVersion":"source.toolkit.fluxcd.io/v1beta2","resourceVersion":"204627804"},"reason":"ArtifactUpToDate","message":"artifact up-to-date with remote revision: 'master/5d7fd027616e468ab29846558fb5d91b6083e17d'"}

❯ flux reconcile source git core -n core
► annotating GitRepository core in core namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation
✗ GitRepository reconciliation failed: 'failed to checkout and determine revision: unable to clone 'ssh://git@ssh.dev.azure.com/v3/MYREPO/gitops-core': EOF'

Removing the EXPERIMENTAL_GIT_TRANSPORT env var it reconciles just fine.

@pjbgf
Copy link
Member

pjbgf commented Apr 14, 2022

@mfamador thank you for reporting back. I have found a few scenarios in which this happens that are linked to some upstream issues (links below). In most cases I reproduced (with managed transport) it was related to concurrency. On my tests, depending at which point a second ssh connection is created, it gets stuck before completing its handshake.

Would you mind trying to reduce your controller's concurrency to 1 please and check again? This will reduce the concurrency, which hopefully will decrease the likelihood of the controller hanging.

Do you mind telling me how many repositories do you have running in your setup? Are all of them on a 5min interval?

golang/go#27140
golang/go#51926

@mfamador
Copy link
Author

mfamador commented Apr 14, 2022

Sure, I'll try to reduce the concurrency and let it run over the weekend.

On each cluster, we have around 36 AzureDevOps repos. Most of them have 5m interval but I'm seeing some that have 1m which is not needed, I'll increase the interval as well.

❯ k get gitrepository -A | grep azure | wc -l
      36

My current concurrency is 6, btw.

@mfamador
Copy link
Author

mfamador commented Apr 22, 2022

Got stuck again with this configuration
Screenshot 2022-04-22 at 17 14 50

@mfamador
Copy link
Author

And always in the same region, north europe. On other regions everything's working just fine.

@stefanprodan stefanprodan changed the title Source controller getting stuck Azure DevOps: Source controller getting stuck Apr 28, 2022
@pjbgf
Copy link
Member

pjbgf commented Apr 28, 2022

@mfamador would you be able to check what is latency like between the clusters that have the issue vs the ones that don't?

@mfamador
Copy link
Author

mfamador commented May 3, 2022

@pjbgf I didn't make any formal test, but I'm pretty sure that the latency is bigger on the ones failing, yes.
All the source Git repositories are on Azure DevOps located in Canada, the source-controllers not getting stuck are on East US and the ones getting stuck are all in Nothern Europe.

I've created a bastion container on every cluster and cloned the git repos a few times from there, and despite the latency being usually bigger in Europe I wouldn't consider it relevant though, it clones the repos pretty fast.

US (not getting stuck):

Receiving objects: 100% (13850/13850), 5.32 MiB | 22.63 MiB/s, done.

Europe (getting stuck):

Receiving objects: 100% (13850/13850), 5.32 MiB | 13.43 MiB/s, done.

@pjbgf
Copy link
Member

pjbgf commented May 12, 2022

@mfamador thank you for sharing this. I think the latency may be causing the issue, the reason for that is that the crypto library has a known behaviour that if the stdout of a SSH session is not serviced fast enough, it may lead to the SSH connection to block.

Here's a release candidate version that I would be keen to see whether resolves the issues you are observing:
ghcr.io/fluxcd/source-controller:rc-6d517589

Quick summary of the changes vs the current last published image:

  • New approach to establish SSH connections that ensures that session stdout is serviced as soon as possible.
  • Optimised git clones - if there was no new commit since last reconciliation, the clone operation is skipped. Expect decreased network usage and faster reconciliations.
  • Panic recovery from libgit2/git2go - no more crashes.
  • No SSH cached connection. This should decrease number of long-running active TCP connections and eventual errors.

@mfamador
Copy link
Author

mfamador commented May 12, 2022

@pjbgf, thanks, I'll use this version on the problematic regions for a while.

❯ k describe deploy source-controller -n flux-system | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:rc-6d517589

❯ flux reconcile source git data -n data
► annotating GitRepository data in data namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation
✔ fetched revision master/1b43817fee880c71f86c40aa7e6e80f0526d399d

I'm using "--concurrent=1" as you've recommended a while ago. Do you think I can change it back to 4 or 6?

@pjbgf
Copy link
Member

pjbgf commented May 13, 2022

@mfamador yes, with the release candidate version you can revert to 4-6 as appropriate.

@mfamador
Copy link
Author

mfamador commented May 14, 2022

@pjbgf, unfortunately, it seems to get stuck the same.

I've been running the release candidate in both north European regions since yesterday.

❯ k describe deploy source-controller -n flux-system --context stag-eun | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:rc-6d517589

While most of them are still reconciling:

❯ date
Sat May 14 09:16:48 WEST 2022
❯ flux reconcile source git core -n core --context stag-eun
► annotating GitRepository core in core namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation
✔ fetched revision master/b05a155ca571621bb1e64ee30702c8c6120dfd49
❯ date
Sat May 14 09:17:09 WEST 2022

one of them is stuck and never reconciles:

❯ date
Sat May 14 09:11:01 WEST 2022
❯ flux reconcile source git data -n data --context stag-eun
► annotating GitRepository data in data namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation

✗ context deadline exceeded
❯
❯ flux reconcile source git data -n data --context stag-eun
► annotating GitRepository data in data namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation
✗ context deadline exceeded

Nothing shows up in the logs when I issue the reconcile cmd on the stuck git repo, but when reconciling the other ones it already shows up:

source-controller-6cc67bddbf-r8zx7 manager {"level":"info","ts":"2022-05-14T08:24:55.471Z","logger":"controller.gitrepository","msg":"reconciliation waiting","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"core","namespace":"core","reason":"no changes since last reconciliation: observed revision 'master/b05a155ca571621bb1e64ee30702c8c6120dfd49'","duration":300}

As usual, restarting the source-controller makes everything work again:

❯ k rollout restart deploy source-controller -n flux-system --context stag-eun
deployment.apps/source-controller restarted
❯ flux reconcile source git data -n data --context stag-eun
► annotating GitRepository data in data namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation
✔ fetched revision master/26538ab36319721a896f5f67405ee22926e8f4a2

@mfamador
Copy link
Author

After a few days I got a different error:

❯ flux reconcile source git data -n data --context stag-eun
► annotating GitRepository data in data namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation
✗ GitRepository reconciliation failed: 'failed to checkout and determine revision: unable to fetch-connect to remote 'ssh://git@ssh.dev.azure.com/v3/<MYREPO>/gitops-data': ssh: handshake failed: EOF'
❯ k rollout restart deploy source-controller -n flux-system --context stag-eun
deployment.apps/source-controller restarted
❯ flux reconcile source git data -n data --context stag-ue
► annotating GitRepository data in data namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation
✔ fetched revision master/26538ab36319721a896f5f67405ee22926e8f4a2

@pjbgf
Copy link
Member

pjbgf commented May 27, 2022

@mfamador Thanks again for always following through with the tests. 🙇

We are releasing a new version soon with a bunch of improvements for libgit2. Would you mind giving it a go on the release candidate to confirm whether this fixes your problem? ghcr.io/fluxcd/source-controller:rc-4b3e0f9a

@mfamador
Copy link
Author

@pjbgf 2 days running with this new rc and didn't get stuck yet

@mfamador
Copy link
Author

mfamador commented Jun 2, 2022

@pjbgf 5 days running with this new rc and didn't get stuck yet.
I've noticed though that it has a few restarts, not sure why yet. We have high resource limits so it shouldn't be OOM.

❯ k get pod -n flux-system -l app=source-controller 
NAME                                 READY   STATUS    RESTARTS   AGE
source-controller-78675947c6-r89wn   1/1     Running   18         5d18h

❯ k -n flux-system top pod -l app=source-controller
NAME                                 CPU(cores)   MEMORY(bytes)
source-controller-7498d49b68-wrkwl   29m          711Mi

❯ k -n flux-system get deploy source-controller -oyaml | grep -A5 limits
          limits:
            cpu: "1"
            memory: 1500Mi
          requests:
            cpu: 50m
            memory: 350Mi

@mfamador
Copy link
Author

mfamador commented Jun 2, 2022

We're only using the rc in the "problematic" region (North Europe, multiple clusters).
On the other ones, we're still using 0.24.4 which gets stuck in North Europe but not in the US regions, for instance.

The rc not getting stuck it appears to be now only the symptom because it's restarting from time to time, but it seems that is way less stable than the other versions: 23 restarts in 6 days vs 0 restarts in 5 days.

❯ k describe deploy source-controller -n flux-system --context stag-eun | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:rc-4b3e0f9a
    
❯ k describe deploy source-controller -n flux-system --context stag-ue | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:v0.24.4
❯ k get pod -n flux-system -l app=source-controller --context stag-eun
NAME                                 READY   STATUS    RESTARTS   AGE
source-controller-78675947c6-r89wn   1/1     Running   23         6d6h

❯ k get pod -n flux-system -l app=source-controller --context stag-ue
NAME                                 READY   STATUS    RESTARTS   AGE
source-controller-7498d49b68-wrkwl   1/1     Running   0          5d14h

@aryan9600
Copy link
Member

Hi @mfamador, would it be possible for you to post the logs of the controller before it restarts? It'd go a long way to help in understanding what's happening. Thanks!

@pjbgf
Copy link
Member

pjbgf commented Jun 6, 2022

@mfamador thanks again for helping us debug this. Would you mind to please run the command below against the clusters that are misbehaving?

kubectl logs -n flux-system -l app=source-controller --previous

@mfamador
Copy link
Author

mfamador commented Jun 6, 2022

@pjbgf

Here it goes:

❯ kubectl logs -n flux-system -l app=source-controller --previous

goroutine 1843 [chan receive]:
golang.org/x/crypto/ssh.(*handshakeTransport).readPacket(0xc00ca1e6e0)
	golang.org/x/crypto@v0.0.0-20220518034528-6f7dac969898/ssh/handshake.go:187 +0x39
golang.org/x/crypto/ssh.(*mux).onePacket(0xc000111500)
	golang.org/x/crypto@v0.0.0-20220518034528-6f7dac969898/ssh/mux.go:215 +0x2d
golang.org/x/crypto/ssh.(*mux).loop(0xc000111500)
	golang.org/x/crypto@v0.0.0-20220518034528-6f7dac969898/ssh/mux.go:190 +0x28
created by golang.org/x/crypto/ssh.newMux
	golang.org/x/crypto@v0.0.0-20220518034528-6f7dac969898/ssh/mux.go:128 +0x195

❯ k describe deploy source-controller -n flux-system --context stag-eun | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:rc-4b3e0f9a

and in another cluster in same problematic region:

❯ v0.0.0-20220518034528-6f7dac969898/ssh/handshake.go
❯ kubectl logs -n flux-system -l app=source-controller --previous
golang.org/x/crypto/ssh.(*handshakeTransport).readLoop(0xc002c662c0)
	golang.org/x/crypto@v0.0.0-20220518034528-6f7dac969898/ssh/handshake.go:197 +0x45
created by golang.org/x/crypto/ssh.newClientTransport
	golang.org/x/crypto@v0.0.0-20220518034528-6f7dac969898/ssh/handshake.go:134 +0x1fb

goroutine 1734 [select]:
golang.org/x/crypto/ssh.(*handshakeTransport).kexLoop(0xc002c662c0)
	golang.org/x/crypto@v0.0.0-20220518034528-6f7dac969898/ssh/handshake.go:268 +0x485
created by golang.org/x/crypto/ssh.newClientTransport
	golang.org/x/crypto@v0.0.0-20220518034528-6f7dac969898/ssh/handshake.go:135 +0x23d

❯ k describe deploy source-controller -n flux-system --context prod-eun | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:rc-4b3e0f9a

@pjbgf
Copy link
Member

pjbgf commented Jun 9, 2022

@mfamador after some additional changes I think we have a RC that may also mitigate the restarting issue.
The changes improve the connection management and resolves a leak that we were experiencing on specific scenarios.
Would you mind giving it a try please?

ghcr.io/fluxcd/source-controller:rc-a00d0edc

@mfamador
Copy link
Author

@pjbgf after two days, we got 0 restarts and still reconciling with no issues.

❯ k describe deploy source-controller -n flux-system --context stag-eun | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:rc-a00d0edc
❯ k describe deploy source-controller -n flux-system --context prod-eun | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:rc-a00d0edc

❯ k get pod -n flux-system -l app=source-controller --context stag-eun
NAME                                 READY   STATUS    RESTARTS   AGE
source-controller-57bff99765-lz9zq   1/1     Running   0          2d20h
❯ k get pod -n flux-system -l app=source-controller --context prod-eun
NAME                                 READY   STATUS    RESTARTS   AGE
source-controller-57bff99765-pjt4q   1/1     Running   0          2d20h

❯ flux reconcile source git data -n data --context stag-eun
► annotating GitRepository data in data namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation
✔ fetched revision master/fd49c50dadae2a2ca451b12b98600651903b8e7e
❯ flux reconcile source git data -n data --context prod-eun
► annotating GitRepository data in data namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation
✔ fetched revision master/fd49c50dadae2a2ca451b12b98600651903b8e7e

@pjbgf
Copy link
Member

pjbgf commented Jun 13, 2022

@mfamador that's great news, thank you for helping us through this. 🙇

We will release a new patch with this fix later on this week.

@mfamador
Copy link
Author

@pjbgf after 4d had a few restarts, adding the logs:

❯ kubectl logs -n flux-system -l app=source-controller --previous --context stag-eun
golang.org/x/crypto/ssh.(*handshakeTransport).kexLoop(0xc007f48420)
	golang.org/x/crypto@v0.0.0-20220525230936-793ad666bf5e/ssh/handshake.go:268 +0x485
created by golang.org/x/crypto/ssh.newClientTransport
	golang.org/x/crypto@v0.0.0-20220525230936-793ad666bf5e/ssh/handshake.go:135 +0x23d

goroutine 4740 [select]:
golang.org/x/crypto/ssh.(*handshakeTransport).kexLoop(0xc004a066e0)
	golang.org/x/crypto@v0.0.0-20220525230936-793ad666bf5e/ssh/handshake.go:268 +0x485
created by golang.org/x/crypto/ssh.newClientTransport
	golang.org/x/crypto@v0.0.0-20220525230936-793ad666bf5e/ssh/handshake.go:135 +0x23d

@mfamador
Copy link
Author

mfamador commented Jun 15, 2022

0 restarts since yesterday, and in the other cluster, same region, 0 restarts over the last 5 days, pretty stable and working fine:

❯ k get pod -n flux-system -l app=source-controller --context prod-eun
NAME                                 READY   STATUS    RESTARTS   AGE
source-controller-57bff99765-pjt4q   1/1     Running   0          5d3h

❯ flux reconcile source git data -n data --context prod-eun
► annotating GitRepository data in data namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation
✔ fetched revision master/a751b3d9e3851afd41859e509f376a8cf35b6127

@pjbgf
Copy link
Member

pjbgf commented Jun 30, 2022

@mfamador have the restarts kept at 0 now we are a few weeks in?

@mfamador
Copy link
Author

mfamador commented Jun 30, 2022

@pjbgf there are a few restarts in both problematic north european clusters:

❯ k get pod -n flux-system -l app=source-controller --context stag-eun
NAME                                 READY   STATUS    RESTARTS   AGE
source-controller-57bff99765-lz9zq   1/1     Running   73         20d

❯ k get pod -n flux-system -l app=source-controller --context prod-eun
NAME                                 READY   STATUS    RESTARTS   AGE
source-controller-57bff99765-pjt4q   1/1     Running   7          20d

Not sure if it helps but here are some logs:

❯ kubectl logs -n flux-system -l app=source-controller --previous --context stag-eun
{"level":"info","ts":"2022-06-30T16:20:54.545Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'refs/tags/v2.7.1-anova-k8s-deployment/be9df238f8324b0f62c2338a991a50164fe66500'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"anova-helm-chart-test","namespace":"marketer"}
{"level":"info","ts":"2022-06-30T16:20:54.825Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/cb507d8bcd12660b03286f04f4cec3e4e5048fed'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"ml","namespace":"ml"}
{"level":"info","ts":"2022-06-30T16:20:55.236Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/d695fe4c1576eafe37e1bcdc5c85ce7781bb2283'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"transcend","namespace":"transcend"}
{"level":"info","ts":"2022-06-30T16:21:02.600Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/dee5837abc258ed1e920429cef70dd1484cd3dae'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"data","namespace":"data"}
{"level":"info","ts":"2022-06-30T16:21:05.469Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/6e3d4626c10d9fc02cb6876546740eb66ec6db2d'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system"}
{"level":"info","ts":"2022-06-30T16:21:12.310Z","logger":"controller.gitrepository","msg":"garbage collected 2 artifacts","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"core","namespace":"core"}
{"level":"info","ts":"2022-06-30T16:21:14.302Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/3474c21267453b67cbac67f2eae727321bf23f46'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"core","namespace":"core"}
E0630 16:21:15.031656       1 leaderelection.go:367] Failed to update lock: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/flux-system/leases/source-controller-leader-election": context deadline exceeded
I0630 16:21:15.031717       1 leaderelection.go:283] failed to renew lease flux-system/source-controller-leader-election: timed out waiting for the condition
{"level":"error","ts":"2022-06-30T16:21:23.502Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}

❯ kubectl logs -n flux-system -l app=source-controller --previous --context prod-eun
k8s.io/apimachinery/pkg/util/framer.(*jsonFrameReader).Read(0xc015771020, {0xc001b88400, 0x400, 0x400})
	k8s.io/apimachinery@v0.24.1/pkg/util/framer/framer.go:152 +0x19c
k8s.io/apimachinery/pkg/runtime/serializer/streaming.(*decoder).Decode(0xc0170f2140, 0x2, {0x2a36f38, 0xc0026bc780})
	k8s.io/apimachinery@v0.24.1/pkg/runtime/serializer/streaming/streaming.go:77 +0xa7
k8s.io/client-go/rest/watch.(*Decoder).Decode(0xc0055643a0)
	k8s.io/client-go@v0.24.1/rest/watch/decoder.go:49 +0x4f
k8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive(0xc0026bc740)
	k8s.io/apimachinery@v0.24.1/pkg/watch/streamwatcher.go:105 +0x11c
created by k8s.io/apimachinery/pkg/watch.NewStreamWatcher
	k8s.io/apimachinery@v0.24.1/pkg/watch/streamwatcher.go:76 +0x135

Anyway, there's no stuck controller now which mitigates our initial issue. Thanks for your help

@pjbgf
Copy link
Member

pjbgf commented Jul 1, 2022

@mfamador as you mentioned, the restarts are orthogonal to the initially reported issue so I created a new issue for that one, whilst I will be closing this one.

Thank you so much for all the help getting this resolved.

@pjbgf pjbgf closed this as completed Jul 1, 2022
Repository owner moved this from In Progress to Done in Maintainers' Focus Jul 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment