-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
image-automation-controller not reconnecting after operation timed out #209
Comments
@minh-nguyenquang if this happens again can you please take a profile snapshot as described here: https://fluxcd.io/docs/gitops-toolkit/debugging/#endpoints I assume there is some socket leak and the container hits the file descriptors limit. Having a Go profile would helps us pin point to the library that has this leak. Thanks |
@stefanprodan Here is the attached file for profiles. Thanks |
@minh-nguyenquang did you took the profile when the controller was stuck? |
@stefanprodan I took profiles after 10-15m when controller was stuck |
@stefanprodan this time, I got issues on source-controller, I attach profiles here |
Here is the log
|
@minh-nguyenquang can you please post here the GitRepository object? |
Here is the GitRepository object
|
We ran into a similar issue today. After running Flux CD v2 for 24 days, the My heap files are attached (added ".zip" extension, so I can add them to GitHub)
|
Same issue with $ flux version
---
flux: v0.24.1
helm-controller: v0.14.1
image-automation-controller: v0.18.0
image-reflector-controller: v0.14.0
kustomize-controller: v0.18.2
notification-controller: v0.19.0
source-controller: v0.19.2 |
@demisx why would you restart source-controller? Image automation does not uses source-controller to push changes to Git. Does source-controller stops cloning? Can you please do #282 (comment) |
@demisx can you post your GitRepository and ImageUpdateAutomation manifests here please. |
You know, I don’t recall why exactly. Maybe I read it in some comment or restarting image-automation-controller alone didn’t work for me and I thought source-controller needed to be restarted as well. I can try restarting only image-source-controller next time.
Should I do this now or right after it stops working? |
Here they are: $ kubectl get GitRepository -n flux-system -oyaml
---
apiVersion: v1
items:
- apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: GitRepository
metadata:
annotations:
reconcile.fluxcd.io/requestedAt: "2021-12-28T11:06:55.648944-08:00"
creationTimestamp: "2021-12-03T16:00:21Z"
finalizers:
- finalizers.fluxcd.io
generation: 1
labels:
kustomize.toolkit.fluxcd.io/name: flux-system
kustomize.toolkit.fluxcd.io/namespace: flux-system
name: flux-system
namespace: flux-system
resourceVersion: "20344562"
uid: c9b44abf-e2c8-476a-81d9-98fc7147acbb
spec:
gitImplementation: go-git
interval: 1m0s
ref:
branch: main
secretRef:
name: flux-system
timeout: 20s
url: ssh://git@github.com/ChoiHoldings/infra
status:
artifact:
checksum: fed591398523fc39be00246e8ef4ec702e93e053837a3e33f6f3d4e4de0f2e37
lastUpdateTime: "2022-01-07T04:40:44Z"
path: gitrepository/flux-system/flux-system/0ab92a793f666e8043bbc90e44f3384304c8eee8.tar.gz
revision: main/0ab92a793f666e8043bbc90e44f3384304c8eee8
url: http://source-controller.flux-system.svc.cluster.local./gitrepository/flux-system/flux-system/0ab92a793f666e8043bbc90e44f3384304c8eee8.tar.gz
conditions:
- lastTransitionTime: "2022-01-06T18:03:59Z"
message: 'Fetched revision: main/0ab92a793f666e8043bbc90e44f3384304c8eee8'
reason: GitOperationSucceed
status: "True"
type: Ready
lastHandledReconcileAt: "2021-12-28T11:06:55.648944-08:00"
observedGeneration: 1
url: http://source-controller.flux-system.svc.cluster.local./gitrepository/flux-system/flux-system/latest.tar.gz
kind: List
metadata:
resourceVersion: ""
selfLink: "" $ kubectl get ImageUpdateAutomation -n flux-system -oyaml
---
apiVersion: v1
items:
- apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageUpdateAutomation
metadata:
creationTimestamp: "2021-12-03T16:00:35Z"
generation: 1
labels:
kustomize.toolkit.fluxcd.io/name: flux-system
kustomize.toolkit.fluxcd.io/namespace: flux-system
name: image-update-automation
namespace: flux-system
resourceVersion: "20348437"
uid: 392ea6d4-063d-43aa-8f3a-ba6aef4af4ec
spec:
git:
checkout:
ref:
branch: main
commit:
author:
email: fluxcdbot@users.noreply.github.com
name: fluxcdbot
messageTemplate: '{{range .Updated.Images}}{{println .}}{{end}}'
push:
branch: main
interval: 1m
sourceRef:
kind: GitRepository
name: flux-system
update:
path: ./k8s/stg
strategy: Setters
status:
conditions:
- lastTransitionTime: "2022-01-07T03:36:20Z"
message: no updates made; last commit 0ab92a7 at 2022-01-07T04:39:39Z
reason: ReconciliationSucceeded
status: "True"
type: Ready
lastAutomationRunTime: "2022-01-07T04:49:39Z"
lastPushCommit: 0ab92a793f666e8043bbc90e44f3384304c8eee8
lastPushTime: "2022-01-07T04:39:39Z"
observedGeneration: 1
kind: List
metadata:
resourceVersion: ""
selfLink: "" |
@stefanprodan Here is what I see now after the pod has been restarted (11 hours ago): $ kubectl exec -it -n flux-system deploy/image-automation-controller -- sh
$ ls -lah /tmp
---
drwxrwsrwx 2 root 1337 6 Jan 7 05:02 .
drwxr-xr-x 1 root root 17 Jan 6 18:03 ..
$ du -sh /tmp/*
du: cannot access '/tmp/*': No such file or directory |
@demisx thanks for providing us with all the details thus far. Would you be able to provide a trace profile as well next time you notice the freeze (and before killing the container)? This can be done by running the following: $ kubectl port-forward -n <namespace> deploy/<component> 8080
$ curl -Sk -v http://localhost:8080/debug/pprof/trace?seconds=10 > trace.out |
@hiddeco My pleasure. I will do that as soon as I notice the freeze again. Just to clarify, is the |
That's correct! |
Our heap and trace of image-automation-controller: See also: #296 (comment) |
FYI, we experienced privoxy |
Happened to us in production again. Had to bounce heap.image-automation-controller.out.zip $ flux version
---------------
flux: v0.24.1
helm-controller: v0.14.1
image-automation-controller: v0.18.0
image-reflector-controller: v0.14.0
kustomize-controller: v0.18.2
notification-controller: v0.19.0
source-controller: v0.19.2 |
It is worth keeping an eye out for #326, which if all goes according to plan, will be out next week. |
The image-automation controller version v0.21.0 introduces an experimental transport that fixes the issue in which the controller stops working in some specific scenarios. The experimental transport needs to be opted-in by setting the environment variable This will require a redeploy of all components so I would recommend doing so via Can you test it again with the experimental transport enabled and let us know how you get on please? |
We'll give it a try once |
@pjbgf Thank you. I just upgraded to the latest in our environments. I'll let you guys know if I run into any issues. This is what we are currently running after the upgrade: flux version
flux: v0.28.4
helm-controller: v0.18.2
image-automation-controller: v0.21.2
image-reflector-controller: v0.17.1
kustomize-controller: v0.22.2
notification-controller: v0.23.1
source-controller: v0.22.4 |
I've noticed this error popping up in the upgraded image-automation-controller log, even though the new images seem to get pulled and deployed in the cluster: {
"level":"error",
"ts":"2022-03-30T00:40:29.636Z",
"logger":"controller.imageupdateautomation",
"msg":"Reconciler error",
"reconciler group":"image.toolkit.fluxcd.io",
"reconciler kind":"ImageUpdateAutomation",
"name":"image-update-automation",
"namespace":"flux-system",
"error":"unable to clone 'ssh://git@github.com/<org>/<repo-name>': SSH could not read data: Error waiting on socket"
} |
@demisx thank you for reporting back. Do you mind confirming whether controller's deployment had the environment variable |
@pjbgf Oh, no. I missed that part. My understanding I should set the |
Just to make sure I enable it the right way. I am planning to edit apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- gotk-components.yaml
- gotk-sync.yaml
patches:
- patch: |
- op: add
path: /spec/template/spec/containers/0/env/0
value:
name: EXPERIMENTAL_GIT_TRANSPORT
value: "true"
target:
kind: Deployment
name: "(source-controller|image-automation-controller)" Am I missing anything? Do I need to do anything else besides this? |
@demisx that should be all, you can confirm that your
Once that is committed and pushed your into repository, also ensure that the correct version of your
And finally, we have released a new patch yesterday for both source-controller and image-automation-controller, so I would also recommend trying using the very latest version if possible. |
Done. Thank you very much for the detail instructions. I see most controller pods have been restarted. I will let you know if I run into any issues. This is what I have right now:
|
@demisx have you experienced the issue again since the upgrade? |
@pjbgf So far, so good. 🤞🏻 If I notice any issues, I'll make sure to post here right away. |
We have a new release candidate that further improve the controller: Two important changes a) Managed Transport is enabled by default and context timeouts are now enforced. |
Closing this based on similar reports from users that confirmed this is no longer happening. If that changes we can always reopen the issue. |
Describe the bug
image-automation-controller doesn't reconnect to github after operation timed out. I have to delete the pod to restart.
Below is the log from image-automation-controller.
Steps to reproduce
I don't know how to reproduce because operation timed out can happen anytime
Expected behavior
image-automation-controller can reconnect automatically.
Screenshots and recordings
No response
OS / Distro
N/A
Flux version
0.16.1
Flux check
► checking prerequisites
✔ kubectl 1.21.0 >=1.18.0-0
✔ Kubernetes 1.18.8-aliyun.1 >=1.16.0-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.11.1
✔ image-automation-controller: deployment ready
► ghcr.io/fluxcd/image-automation-controller:v0.14.0
✔ image-reflector-controller: deployment ready
► ghcr.io/fluxcd/image-reflector-controller:v0.11.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v0.13.2
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v0.15.0
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v0.15.3
✔ all checks passed
Git provider
github
Container Registry provider
No response
Additional context
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: