Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller stops reconciling, needs restart #282

Closed
Tracked by #2593
stefanprodan opened this issue Dec 17, 2021 · 38 comments
Closed
Tracked by #2593

Controller stops reconciling, needs restart #282

stefanprodan opened this issue Dec 17, 2021 · 38 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@stefanprodan
Copy link
Member

stefanprodan commented Dec 17, 2021

Reported here: fluxcd/flux2#2219

Having an automation that should reconcile every 7 minutes:

apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageUpdateAutomation
metadata:
  name: flux-system
  namespace: flux-system
spec:
  git:
    checkout:
      ref:
        branch: master
    commit:
      author:
        email: me@example.com
        name: me
      messageTemplate: '{{range .Updated.Images}}{{println .}}{{end}}'
    push:
      branch: master
  interval: 7m0s
  sourceRef:
    kind: GitRepository
    name: flux-system
  update:
    path: ./staging
    strategy: Setters

The reconciliation stoped two days ago for unknown reasons:

$ date
Fri Dec 17 16:25:48 EET 2021

$ flux get image update

NAME       	READY	MESSAGE                                                     	LAST RUN                 	SUSPENDED
flux-system	True 	no updates made; last commit 8574614 at 2021-12-14T22:47:08Z	2021-12-15T08:15:01-07:00	False
@stefanprodan stefanprodan added the bug Something isn't working label Dec 17, 2021
@jwerre
Copy link

jwerre commented Dec 17, 2021

This coincided with outages at AWS in us-west-2 where I'm running Kubernetes on EKS. My entire cluster went down on that day for about 10 minutes and when it came back up I ran into this issue.

@bondido
Copy link

bondido commented Dec 30, 2021

Hi,
While waiting for permanent fix, are there any advice how to detect "stucked" image-automation-controller case?
Are there any metrics or logs we should monitor?
We have been suffering the issue (~couple of times per week) for some time and the only alert is from users who pushed their images to container registry and didn't notice cluster deployment for quite some time.

Another observation is that in most cases restarting image-automation-controller is sufficient, but there where two times when we needed to also restart source-controller.

@jwerre
Copy link

jwerre commented Dec 30, 2021

@bondido how about something like this:

kubectl get pod -n flux-system

NAME                                          READY   STATUS    RESTARTS   AGE
helm-controller-96dc99bfd-p9g4q               1/1     Running   0          9d
image-automation-controller-64c955c59-ckbft   1/1     Running   0          9d  <<< ------
image-reflector-controller-55fb7f765d-cr8zn   1/1     Running   0          9d
kustomize-controller-7bc878f768-xf2xb         1/1     Running   0          9d
notification-controller-945795558-q8nht       1/1     Running   0          9d
source-controller-65665fd68f-n7qqz            1/1     Running   0          9d

@bondido
Copy link

bondido commented Dec 30, 2021

@jwerre
I'm afraid it won't help, as long running controller is no problem.
The problem is that it's running but not doing it's job and not reflecting it in any log or metric (I still hope I missed something here :-) )

@jwerre
Copy link

jwerre commented Dec 31, 2021

Ah yes.. I had the same problem you need to restart the pod e.g.:

kubectl delete pods image-automation-controller-64c955c59-ckbft -n flux-system

@bondido
Copy link

bondido commented Jan 3, 2022

Sure @jwerre , I know. Thanks :-)
As I mentioned earlier, sometime restarting image-automation-controller is not enough and we need to restart source-controller pod as well.

As temporary kind of "automatic mitigation" we introduced a scheduled external script checking .status.lastAutomationRunTime of imageupdateautomation resource (

) and invoking 'kubectl pod delete' commands if it seems to stuck for too long.

We'd love to see this bug fixed soon, anyway :-)

@jwerre
Copy link

jwerre commented Jan 4, 2022

@bondido, I misunderstood your question. Sorry for the confusion.

@squaremo
Copy link
Member

squaremo commented Jan 5, 2022

sometime restarting image-automation-controller is not enough and we need to restart source-controller pod as well.

@bondido Do you mean that source-controller has the same problem; or, that to get image-automation-controller to start working again, you needed to restart source-controller?

@squaremo squaremo self-assigned this Jan 5, 2022
@bondido
Copy link

bondido commented Jan 5, 2022

@bondido Do you mean that source-controller has the same problem; or, that to get image-automation-controller to start working again, you needed to restart source-controller?

We have to restart source-controller for image-automation-controller to start working. In fact we restart both - first image-automation-controller, and if we don't see any improvement in a couple of minutes time - source-controller.

So far, we haven't try to restart just source-controller.

@squaremo
Copy link
Member

squaremo commented Jan 5, 2022

@bondido Thanks for elaborating! On the face of it, I wouldn't expect restarting source-controller to have any effect on image-automation-controller, because it works independently: it only coincidentally refers to the same GitRepository objects, and doesn't alter anything at the upstream git repository (that might "unlock" image-automation-controller). Do you have good evidence that restarting source-controller is exactly what unblocks image-automation-controller; or could it be a sort of "reliable coincidence"?

@stefanprodan
Copy link
Member Author

@bondido @jwerre can you please exec into the controller pod and see if there is anything left in tmp by running ls -lah /tmp and du -sh /tmp/*. Thanks!

@bondido
Copy link

bondido commented Jan 5, 2022

Do you have good evidence that restarting source-controller is exactly what unblocks image-automation-controller; or could it be a sort of "reliable coincidence"?

I can't be 100% sure as I couldn't get to any logs or metric confirming what was actually happening.

At first two cases restarting image-automation-controller was enough and new images were applied to cluster just seconds after the restart.
But then - on the third occurence - we restarted image-automation-controller and nothing happened for over 10 minutes. So it was just blind shot by my colleague to restart source-controller, after which everything started working.

The situation repeated exactly like the above for one more time.

@bondido
Copy link

bondido commented Jan 5, 2022

@bondido @jwerre can you please exec into the controller pod and see if there is anything left in tmp by running ls -lah /tmp and du -sh /tmp/*. Thanks!

You mean checking the directory contents inside "stucked" controller pod, don't you?
(in "working OK" pod I've checked it and it's empty).

@jwerre
Copy link

jwerre commented Jan 6, 2022

@jwerre can you please exec into the controller pod and see if there is anything left in tmp by running ls -lah /tmp and du -sh /tmp/*. Thanks!

ls -lah /tmp
total 0
drwxrwsrwx 2 root 1337  6 Jan  6 15:22 .
drwxr-xr-x 1 root root 17 Jan  4 16:20 ..
du -sh /tmp/*
du: cannot access '/tmp/AlTest1.err': No such file or directory
du: cannot access '/tmp/AlTest1.out': No such file or directory
du: cannot access '/tmp/adobegc.log': No such file or directory
du: cannot access '/tmp/com.apple.launchd.yGWoPxlEXv': No such file or directory
du: cannot access '/tmp/com.google.Keystone': No such file or directory
du: cannot access '/tmp/fseventsd-uuid': No such file or directory
du: cannot access '/tmp/powerlog': No such file or directory
command terminated with exit code 1

@jwerre
Copy link

jwerre commented Jan 6, 2022

I should mention that I haven't had any problems since I restated the controller pod the first time.

@ahisette
Copy link

ahisette commented Jan 10, 2022

Hello,

can you please exec into the controller pod and see if there is anything left in tmp by running `ls -lah /tmp

In my case, on stuck controller, in /tmp , I have a directory named like GitRepository source of frozen ImageUpdateAutomation.

And a simple restart of the automation controller is enough to unblock the frozen ImageUpdateAutomation.

@kingdonb
Copy link
Member

kingdonb commented Jan 11, 2022

I've gone to some lengths to try reproducing this issue, I ran image-automation-controller with a larger than average gitrepo (stuffed with several mp4 video files), and ramped up all of the unfavorable network conditions (packet loss, latency) with Chaos Mesh, reconfigured liveness checks so that image-automation-controller wouldn't be restarted due to network reasons, (which was tricky because it actually needs the network in order to perform the leader election)

With all webhooks configured as receivers for image and git events to make sure everything happens quickly after each commit/image release, ran this for several hours with updates every 45 seconds, and I wasn't able to get the image-automation-controller into any stuck or hanging state. I was able to cause it to stop working due to heavy packet loss, but nothing I did seemed to induce any sort of hanging behavior. (When the unfavorable conditions abated, the controller always recovered and went back to committing and pushing changes for me.)

If anyone knows what type of network issue or abnormal response from GitHub triggers the condition, then surely I can reproduce it and make progress on this issue, but right now I have not made significant progress on it.

@bondido
Copy link

bondido commented Jan 20, 2022

@bondido @jwerre can you please exec into the controller pod and see if there is anything left in tmp by running ls -lah /tmp and du -sh /tmp/*. Thanks!

After over 16 days since last problems, controller has just "stuck" on one of our clusters.
The output of above commands run on the stucked pod is:

total 12K
drwxrwsrwx 3 root       1337 4.0K Jan 20 09:49 .
drwxr-xr-x 1 root       root 4.0K Jan  3 09:24 ..
drwx--S--- 6 controller 1337 4.0K Jan 20 09:49 flux-system-flux-system56439449

3.0M	/tmp

@ahisette
Copy link

An other example, on a cluster with 16 differents ressources ImageUpdateAutomation, I have 3 of them "stuck".

controller@image-automation-controller-65ccc7d5c6-7mgv9:/tmp$ ls -alh
total 20K
drwxrwsrwx 5 root       1337 4.0K Jan 21 12:37 .
drwxr-xr-x 1 root       root 4.0K Jan 17 09:21 ..
drwx--S--- 3 controller 1337 4.0K Jan 20 08:55 xxxx-xxxx-flux1428284464
drwx--S--- 3 controller 1337 4.0K Jan 20 10:17 xxxx-xxxx-flux3351794846
drwx--S--- 3 controller 1337 4.0K Jan 20 10:17 yyyy-yyyy-flux2915045790
controller@image-automation-controller-65ccc7d5c6-7mgv9:/tmp$ du -sh /tmp/*
2.0M	/tmp/xxxx-xxxx-flux1428284464
696K	/tmp/xxxx-xxxx-flux3351794846
3.5M	/tmp/yyyy-yyyy-flux2915045790

Can this image from #297 resolve this issue ?
Because I sometimes have errors like Unable to clone: Connection timed out or Unable to clone: failed to start SSH session: Unable to exchange encryption keys

@stefanprodan
Copy link
Member Author

@ahisette yes the libgit2 timeout callback could be the reason, please try out the image from #297 and see if it problem goes away.

@ahisette
Copy link

ahisette commented Feb 1, 2022

Hello,
I put new image on 2 of 4 clusters, after one week of testing.
I have no good news : One freeze for one cluster with new image and also one freeze for one "old" image.
With the same symptoms.

nobody@image-automation-controller-84bfbc78dc-89bqv:/$ ls -alh /tmp
total 12K
drwxrwsrwx 3 root   1337 4.0K Feb  1 08:54 .
drwxr-xr-x 1 root   root 4.0K Jan 24 13:25 ..
drwx--S--- 8 nobody 1337 4.0K Feb  1 07:13 xxxx-xxxx-flux2375737287
nobody@image-automation-controller-84bfbc78dc-89bqv:/$ du -sh /tmp/*
5.9M	/tmp/xxxx-xxxx-flux2375737287

@hiddeco
Copy link
Member

hiddeco commented Feb 7, 2022

With the release of Flux v0.26.2, we would like to kindly ask folks with issues to update to the latest image releases. Since we changed our build process around libgit2 for the source-controller and image-automation-controller, we have observed some of the issues as described to have vanished (and confirmed by others as per fluxcd/source-controller#439 (comment)).

@raress96
Copy link

raress96 commented Feb 8, 2022

@hiddeco Installed yesterday the new flux and today pushed an image, it seems to also work for me now.

@pjbgf pjbgf self-assigned this Mar 7, 2022
@pjbgf
Copy link
Member

pjbgf commented Mar 7, 2022

I managed to reproduce this locally. I am running against the latest IAC version v0.20.1, with a single SSH github repository using libgit2 with 1 minute intervals and the same for timeouts.

Just to help further investigations I will relay some of my observations/assumptions here as I progress.

By analysing the pprof endpoint, I noticed that the time difference in minutes between last IAC reconciliation log message, seems to match the running time of the thread/goroutine below. Leading me to think that libgit2 git_remote_fetch is hanging. This was previously reported upstream and fixed a few years back. However, the issues may be correlated.

goroutine 332 [syscall, 4209 minutes, locked to thread]:
github.com/libgit2/git2go/v33._Cfunc_git_remote_fetch(0x7f37efdf97c0, 0xc0006e9290, 0xc00073cd00, 0x0)
	_cgo_gotypes.go:6954 +0x4c
github.com/libgit2/git2go/v33.(*Remote).Fetch.func2(0xc00073cd00, 0xc00010fe60, 0x40e414, 0xab)
	github.com/libgit2/git2go/v33@v33.0.7/remote.go:1044 +0xa7
github.com/libgit2/git2go/v33.(*Remote).Fetch(0xc00056e720, {0xc000f98ec0, 0xc001331e60, 0x1}, 0x2, {0x0, 0x50b8a6})
	github.com/libgit2/git2go/v33@v33.0.7/remote.go:1044 +0x1e9
github.com/fluxcd/image-automation-controller/controllers.fetch({0x1fc1c40, 0xc00056e600}, {0xc0013c6690, 0x25}, {0xc000aac370, 0x4}, {0xc001331e60, {0xc000dbef80, 0xc000676200}})
	github.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:743 +0x32e
github.com/fluxcd/image-automation-controller/controllers.(*ImageUpdateAutomationReconciler).Reconcile(0xc0003f2680, {0x1fc1c78, 0xc0012cad80}, {{{0xc0001867e0, 0x1c92880}, {0xc0001867d0, 0x30}}})
	github.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:270 +0x1b15
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc00010c000, {0x1fc1c78, 0xc0012cacc0}, {{{0xc0001867e0, 0x1c92880}, {0xc0001867d0, 0x414fb4}}})
	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114 +0x26f
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00010c000, {0x1fc1bd0, 0xc0003f2540}, {0x1bb1ea0, 0xc0011343a0})
	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311 +0x33e
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00010c000, {0x1fc1bd0, 0xc0003f2540})
	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:223 +0x357

The container is still operational (I can exec into it) and other goroutines seem to be working as expected.

For an automatic restart, users could leverage the workqueue_longest_running_processor_seconds metric whilst we try to get to the bottom of the issue. Here's a dump of some of the metrics for my hung IAC:

workqueue_longest_running_processor_seconds{name="imageupdateautomation"} 256535.347854138
# HELP workqueue_queue_duration_seconds How long in seconds an item stays in workqueue before being requested
# TYPE workqueue_queue_duration_seconds histogram
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="1e-08"} 0
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="1e-07"} 0
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="1e-06"} 0
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="9.999999999999999e-06"} 4
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="9.999999999999999e-05"} 83
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="0.001"} 83
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="0.01"} 83
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="0.1"} 83
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="1"} 84
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="10"} 86
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="+Inf"} 91
workqueue_queue_duration_seconds_sum{name="imageupdateautomation"} 1000.9602119180001
workqueue_queue_duration_seconds_count{name="imageupdateautomation"} 91

@pjbgf
Copy link
Member

pjbgf commented Mar 22, 2022

The image-automation controller version v0.21.0 introduces an experimental transport that fixes the issue in which the controller stops working in some specific scenarios.

The experimental transport needs to be opted-in by setting the environment variable EXPERIMENTAL_GIT_TRANSPORT to true in the controller's Deployment. Once this feature has been tested extensively it may later become enabled by default.

Due to changes on other Flux components, it is recommended that all components are deployed on their latest versions. The recommended approach is via flux bootstrap using the flux cli version v0.28.0 which will be released tomorrow.

It would be great if users experiencing this issue could test it again with the experimental transport enabled and let us know whether the issue persists.

@maxbrunet
Copy link

maxbrunet commented Mar 22, 2022

Hi @pjbgf, thank you for the update, I deployed v0.21.0 as you were posting your message, and unfortunately it seems to persist.

Sample metrics:

workqueue_longest_running_processor_seconds{name="imageupdateautomation"} 3302.405028565
workqueue_queue_duration_seconds_bucket{le="+Inf", name="imageupdateautomation"} 3

EXPERIMENTAL_GIT_TRANSPORT is set:

$ kubectl --namespace=flux-system exec -ti image-automation-controller-7995f48c77-g99qd -- \
  printenv EXPERIMENTAL_GIT_TRANSPORT
true

Version:

$ kubectl --namespace=flux-system get pod image-automation-controller-7995f48c77-g99qd \
  --output=jsonpath='{.spec.containers[?(@.name=="manager")].image}'
ghcr.io/fluxcd/image-automation-controller:v0.21.0

Nothing specific in the logs. Our interval is 1m and it does commit on restarts.
The latest working version for us is still v0.19.0.

@pjbgf
Copy link
Member

pjbgf commented Mar 22, 2022

@maxbrunet thank you for the quick response. Would you be able to collect a profile and share either through here or slack please?

@maxbrunet
Copy link

maxbrunet commented Mar 23, 2022

Here is the output of /debug/pprof/goroutine?debug=2: https://gist.github.com/maxbrunet/8e304510d506e83be6ea2bb8a78ab42b

@pjbgf
Copy link
Member

pjbgf commented Mar 29, 2022

@maxbrunet thank you again for testing and providing the details so promptly.
We have since fixed a few more issues in the new transport and released a new minor version. Would you be able to give it another try and confirm whether that fixes the problems you were experiencing?

Here's more information on how to test: fluxcd/source-controller#636 (comment)

@maxbrunet
Copy link

Hi @pjbgf, I have tried to deploy the latest versions, image-automation-controller v0.21.2 and source-controller v0.22.4, but they both panic shortly after startup:

image-automation-controller - panic trace
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1471573]

goroutine 520 [running]:
github.com/fluxcd/source-controller/pkg/git/libgit2/managed.(*sshSmartSubtransport).Close(0xc000623450)
        github.com/fluxcd/source-controller@v0.22.4/pkg/git/libgit2/managed/ssh.go:268 +0x93
github.com/libgit2/git2go/v33.smartSubtransportCloseCallback(0x404da6, 0xc0002f1a00)
        github.com/libgit2/git2go/v33@v33.0.9/transport.go:409 +0x6f
github.com/libgit2/git2go/v33._Cfunc_git_clone(0xc000343120, 0x7ffa2c44e680, 0x7ffa2c42b520, 0xc00087f040)
        _cgo_gotypes.go:3244 +0x4c
github.com/libgit2/git2go/v33.Clone.func3(0xc000d1d938, 0x6, 0xc0006c4c70, 0x14eefc5)
        github.com/libgit2/git2go/v33@v33.0.9/clone.go:43 +0x91
github.com/libgit2/git2go/v33.Clone({0xc0009bb500, 0xc0008ab2c0}, {0xc0000479e0, 0x28}, 0xc0007639e0)
        github.com/libgit2/git2go/v33@v33.0.9/clone.go:43 +0x19e
github.com/fluxcd/source-controller/pkg/git/libgit2.(*CheckoutBranch).Checkout(0xc0006c4c10, {0x1fd7b40, 0xc0008ab2c0}, {0xc0000479e0, 0x28}, {0xc0009bb500, 0x3a}, 0x0)
        github.com/fluxcd/source-controller@v0.22.4/pkg/git/libgit2/checkout.go:64 +0x22d
github.com/fluxcd/image-automation-controller/controllers.cloneInto({0x1fd7b40, 0xc0008ab2c0}, {0xc0009d93b0, {0xc0009bb500, 0x22}}, 0xc0008ab260, {0xc0000479e0, 0x28})
        github.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:547 +0x1a3
github.com/fluxcd/image-automation-controller/controllers.(*ImageUpdateAutomationReconciler).Reconcile(0xc000772140, {0x1fd7b78, 0xc0009d6e40}, {{{0xc000d1d960, 0x1ca56a0}, {0xc000d1d950, 0x30}}})
        github.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:283 +0x1805
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc0001622c0, {0x1fd7b78, 0xc0009d6db0}, {{{0xc000d1d960, 0x1ca56a0}, {0xc000d1d950, 0x414fd4}}})
        sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114 +0x26f
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0001622c0, {0x1fd7ad0, 0xc000772080}, {0x1bc41e0, 0xc000465940})
        sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311 +0x33e
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0001622c0, {0x1fd7ad0, 0xc000772080})
        sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:223 +0x357
source-controller - panic trace
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a7cf53]

goroutine 529 [running]:
github.com/fluxcd/source-controller/pkg/git/libgit2/managed.(*sshSmartSubtransport).Close(0xc00036bc70)
        github.com/fluxcd/source-controller/pkg/git/libgit2/managed/ssh.go:268 +0x93
github.com/libgit2/git2go/v33.smartSubtransportCloseCallback(0x404e06, 0xc0009d9d40)
        github.com/libgit2/git2go/v33@v33.0.9/transport.go:409 +0x6f
github.com/libgit2/git2go/v33._Cfunc_git_clone(0xc0003512d0, 0x7ff2fa7a30e0, 0x7ff2fa7a3120, 0xc0008ef380)
        _cgo_gotypes.go:3244 +0x4c
github.com/libgit2/git2go/v33.Clone.func3(0xc000866e44, 0x6, 0xc00061aec0, 0x1b4db45)
        github.com/libgit2/git2go/v33@v33.0.9/clone.go:43 +0x91
github.com/libgit2/git2go/v33.Clone({0xc000c7fd40, 0xc000304ae0}, {0xc0004b0800, 0x36}, 0xc0005657a0)
        github.com/libgit2/git2go/v33@v33.0.9/clone.go:43 +0x19e
github.com/fluxcd/source-controller/pkg/git/libgit2.(*CheckoutBranch).Checkout(0xc00061ae60, {0x27ca660, 0xc000304ae0}, {0xc0004b0800, 0x36}, {0xc000c7fd40, 0x3a}, 0x0)
        github.com/fluxcd/source-controller/pkg/git/libgit2/checkout.go:64 +0x22d
github.com/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).reconcileSource(0xc00095f3b0, {0x27ca698, 0xc000771e90}, 0xc000876200, 0xc0009335f0, 0x18, {0xc0004b0800, 0x36})
        github.com/fluxcd/source-controller/controllers/gitrepository_controller.go:404 +0x99f
github.com/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).reconcile(0x2834958, {0x27ca698, 0xc000771e90}, 0xc000876200, {0xc00110dbe8, 0x4, 0x32000040e494})
        github.com/fluxcd/source-controller/controllers/gitrepository_controller.go:244 +0x3d5
github.com/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).Reconcile(0xc00095f3b0, {0x27ca698, 0xc000771e90}, {{{0xc000450f40, 0x2384b60}, {0xc000450f20, 0x30}}})
        github.com/fluxcd/source-controller/controllers/gitrepository_controller.go:205 +0x4bb
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc000118bb0, {0x27ca698, 0xc000771e00}, {{{0xc000450f40, 0x2384b60}, {0xc000450f20, 0x415034}}})
        sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114 +0x26f
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000118bb0, {0x27ca5f0, 0xc000a3bf00}, {0x2226280, 0xc000635340})
        sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311 +0x33e
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000118bb0, {0x27ca5f0, 0xc000a3bf00})
        sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:223 +0x357
GitRepository + Secret
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
  name: my-repo
  namespace: flux-system
spec:
  gitImplementation: libgit2
  interval: 1m0s
  ref:
    branch: master
  url: ssh://git@bitbucket.org/my-workspace/my-repo.git
  secretRef:
    name: flux-git-credentials
apiVersion: v1
kind: Secret
metadata:
  name: flux-git-credentials
  namespace: flux-system
stringData:
  identity: |
    -----BEGIN OPENSSH PRIVATE KEY-----
    ...
    -----END OPENSSH PRIVATE KEY-----
  known_hosts: bitbucket.org ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAubiN81eDcafrgMeLzaFPsw2kNvEcqTKl/VqLat/MaB33pZy0y3rJZtnqwR2qOOvbwKZYKiEO1O6VqNEBxKvJJelCq0dTXWT5pbO2gDXC6h6QDX
CaHo6pOHGPUy+YBaGQRGuSusMEASYiWunYN0vCAI8QaXnWMXNMdFP3jHAJH0eDsoiGnLPBlBp4TNm6rYI74nMzgz3B9IikW4WVK+dc8KZJZWYjAuORU3jc1c/NPskD2ASinf8v3xnfXeukU0sJ5N6m5E8VLjObPEO+mN2t/FZTMZLiFqPW
c/ALSqnMnnhwrNi2rbfg/rd/IpL8Le3pSBne8+seeFVBoGqzHM9yXw==
type: Opaque

FYI I had not realized I needed to change spec.gitImplementation to libgit2 before.

@pjbgf
Copy link
Member

pjbgf commented May 12, 2022

@maxbrunet we have made some improvements that may fix the issue you are experiencing. I have a release candidate for source-controller (below) that is based on a pending PR.

Can you please test the image below and let us know whether that fixes your problem?
ghcr.io/fluxcd/source-controller:rc-6d517589

@maxbrunet
Copy link

Hey @pjbgf, no, sorry, I used Flux with my previous employer, and I am not working with it at the moment

@pjbgf pjbgf added this to the GA milestone May 13, 2022
@pjbgf pjbgf moved this to In Progress in Maintainers' Focus May 27, 2022
@pjbgf
Copy link
Member

pjbgf commented May 27, 2022

@maxbrunet no worries, thank you for all the help so far.

@pjbgf
Copy link
Member

pjbgf commented May 27, 2022

This should be fixed as part of the managed transport improvements and the enforcement of context timeouts.
Release candidates for both controllers:

ghcr.io/fluxcd/source-controller:rc-4b3e0f9a
ghcr.io/fluxcd/image-automation-controller:rc-48bcca59

@pjbgf
Copy link
Member

pjbgf commented Jun 9, 2022

Closing this for lack of activity. Similarly reported issues have been confirmed to be fixed.

Now with Managed Transport enforcing timeouts for Git operations, this should be resolved.

If it reoccurs, given the sheer amount of changes that happened on the Git implementation in the last 6 months, we are better off creating a new issue, linking back to this one.

@pjbgf pjbgf closed this as completed Jun 9, 2022
Repository owner moved this from In Progress to Done in Maintainers' Focus Jun 9, 2022
@mickeahlinder
Copy link

Hi @pjbgf, I have tried to deploy the latest versions, image-automation-controller v0.21.2 and source-controller v0.22.4, but they both panic shortly after startup:

image-automation-controller - panic trace
source-controller - panic trace
GitRepository + Secret
FYI I had not realized I needed to change spec.gitImplementation to libgit2 before.

@maxbrunet how do you obtain these panic traces, I've portforwared pprof endpoints and cat get dubug-info at request. But how to get information when the process panics?

@maxbrunet
Copy link

Stacktraces are dump in the logs when the process panics, you can get the logs from the last restart with kubectl logs --previous POD_NAME

@mickeahlinder
Copy link

Stacktraces are dump in the logs when the process panics, you can get the logs from the last restart with kubectl logs --previous POD_NAME

Ah, ok, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests