Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source-controller appear to be hanging, checking a git repo over ssh #1154

Open
maargenton opened this issue Jul 5, 2023 · 5 comments
Open

Comments

@maargenton
Copy link

I have the source-controller configured to watch a single git repo over ssh, with an interval of 1 minute and no explicit timeout (should default to 60s). After a little while (about 10 minutes since reboot in my latest case), the source controller stops checking the repo, stops logging anything (logging bumped to debug to investigate), and never recovers from that state.

The kustomize-controller, configured to reconcile every 10 minutes keeps working / logging properly, but never sees any update after that point.

$ flux version
flux: v2.0.0-rc.5
helm-controller: v0.34.1
kustomize-controller: v1.0.0-rc.4
notification-controller: v1.0.0-rc.4
source-controller: v1.0.0-rc.5

from http://...:8080/metrics:

# HELP workqueue_unfinished_work_seconds How many seconds of work has been done that is in progress and hasn't been observed by work_duration. Large values indicate stuck threads. One can deduce the number of stuck threads by observing the rate at which this increases.
# TYPE workqueue_unfinished_work_seconds gauge
workqueue_unfinished_work_seconds{name="bucket"} 0
workqueue_unfinished_work_seconds{name="gitrepository"} 4992.46690147
workqueue_unfinished_work_seconds{name="helmchart"} 0
workqueue_unfinished_work_seconds{name="helmrepository"} 0
workqueue_unfinished_work_seconds{name="ocirepository"} 0

Additional context:

  • This is running on an Intel MacBook Pro, using vagrant to run a Ubuntu 2204 vm, itself running a single-node k3s Kubernetes that I use for development and experimentation.
  • I initially though this could be caused by some clock synchronization issue when the host goes to sleep, but in the latest instance, no sleep has occurred since vagrant up
  • The repository is a private repo on GitHub. Synchronizations and reconciliations are working fine most of the time.
  • I experienced some upstream connectivity issues earlier today, with some instances of failed reconciliations and timeouts. This could be related, with some code paths handling connectivity issues hanging instead of timing out. I had this setup (these versions) running for a couple of weeks, and it was working properly until recently, as far as I can tell.

I'll be happy to provide any further details if needed.
Please let me know how I can help resolve this issue.

Thanks

@aryan9600
Copy link
Member

what happens when you reconcile the GitRepository manually by running flux reconcile source git <gitrepo-name>?

@maargenton
Copy link
Author

I tired that before; it was hanging as well, on ◎ waiting for GitRepository reconciliation.

@maargenton
Copy link
Author

I rebooted my router, which killed the hanging connection and generated two error messages with stack-trace; maybe that can help:

{
  "level": "error",
  "ts": "2023-07-05T08:16:47.392Z",
  "msg": "failed to checkout and determine revision: unable to list remote for 'ssh://git@github.com/...': ssh: handshake failed: read tcp 10.42.0.15:56574->140.82.114.4:22: read: connection reset by peer",
  "controller": "gitrepository",
  "controllerGroup": "source.toolkit.fluxcd.io",
  "controllerKind": "GitRepository",
  "GitRepository": {
    "name": "flux-system",
    "namespace": "flux-system"
  },
  "namespace": "flux-system",
  "name": "flux-system",
  "reconcileID": "b19c779d-28aa-4163-aa59-1cb7ed4f3373",
  "error": "failed to checkout and determine revision: unable to list remote for 'ssh://git@github.com/...': ssh: handshake failed: read tcp 10.42.0.15:56574->140.82.114.4:22: read: connection reset by peer",
  "stacktrace": "github.com/fluxcd/source-controller/internal/reconcile/summarize.logError\n\tgit.luolix.top/fluxcd/source-controller/internal/reconcile/summarize/processor.go:99\ngit.luolix.top/fluxcd/source-controller/internal/reconcile/summarize.ErrorActionHandler\n\tgit.luolix.top/fluxcd/source-controller/internal/reconcile/summarize/processor.go:77\ngit.luolix.top/fluxcd/source-controller/internal/reconcile/summarize.(*Helper).SummarizeAndPatch\n\tgit.luolix.top/fluxcd/source-controller/internal/reconcile/summarize/summary.go:193\ngit.luolix.top/fluxcd/source-controller/internal/controller.(*GitRepositoryReconciler).Reconcile.func1\n\tgit.luolix.top/fluxcd/source-controller/internal/controller/gitrepository_controller.go:204\ngit.luolix.top/fluxcd/source-controller/internal/controller.(*GitRepositoryReconciler).Reconcile\n\tgit.luolix.top/fluxcd/source-controller/internal/controller/gitrepository_controller.go:240\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226"
}
{
  "level": "debug",
  "ts": "2023-07-05T08:16:47.393Z",
  "logger": "events",
  "msg": "failed to checkout and determine revision: unable to list remote for 'ssh://git@github.com/...': ssh: handshake failed: read tcp 10.42.0.15:56574->140.82.114.4:22: read: connection reset by peer",
  "type": "Warning",
  "object": {
    "kind": "GitRepository",
    "namespace": "flux-system",
    "name": "flux-system",
    "uid": "df67c776-a9e3-4d82-8534-30823b917661",
    "apiVersion": "source.toolkit.fluxcd.io/v1",
    "resourceVersion": "293766"
  },
  "reason": "GitOperationFailed"
}
{
  "level": "error",
  "ts": "2023-07-05T08:16:47.412Z",
  "msg": "Reconciler error",
  "controller": "gitrepository",
  "controllerGroup": "source.toolkit.fluxcd.io",
  "controllerKind": "GitRepository",
  "GitRepository": {
    "name": "flux-system",
    "namespace": "flux-system"
  },
  "namespace": "flux-system",
  "name": "flux-system",
  "reconcileID": "b19c779d-28aa-4163-aa59-1cb7ed4f3373",
  "error": "failed to checkout and determine revision: unable to list remote for 'ssh://git@github.com/...': ssh: handshake failed: read tcp 10.42.0.15:56574->140.82.114.4:22: read: connection reset by peer",
  "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226"
}

@aryan9600
Copy link
Member

ssh: handshake failed: read tcp 10.42.0.15:56574->140.82.114.4:22: read: connection reset by peer

this error combined with the fact that the thread essentially gets stuck leads me to believe that this issue is the result of connection issues where the connection just gets stuck forever without completing or terminating and then when the router is rebooted the connection is dropped

@maargenton
Copy link
Author

That sounds like a reasonable explanation.
But shouldn't that be covered by the default 60s timeout?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants