Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Goroutine leak in image-automation-controller with EXPERIMENTAL_GIT_TRANSPORT #334

Closed
Tracked by #2593
jiphex opened this issue Mar 25, 2022 · 8 comments · Fixed by #393
Closed
Tracked by #2593

Goroutine leak in image-automation-controller with EXPERIMENTAL_GIT_TRANSPORT #334

jiphex opened this issue Mar 25, 2022 · 8 comments · Fixed by #393
Assignees
Labels
bug Something isn't working
Milestone

Comments

@jiphex
Copy link

jiphex commented Mar 25, 2022

Before v0.21.0, there seemed to be an issue with Image Automation that meant that the source-controller ended up leaking goroutines which would have working (responding to ClientAliveInterval pings), active connections to the SSH server defined in the GitRepository source for the automation. One of the reasons I started to look into EXPERIMENTAL_GIT_TRANSPORT was to see if it'd fix this, because these connections remain open forever on our Gitlab server and eventually stop SSH checkouts from working.

With EXPERIMENTAL_GIT_TRANSPORT=true, it seems like this goroutine leak has now moved to the image-automation-controller. Here's a Prometheus graph of go_goroutines for the image-automation-controller:

image

At this point, the image-automation-controller pod has been up for 20 hours and there's 95 open SSH connections to the Gitlab server. To prove this is the culprit, I've restarted the image-automation-controller pod:

james@gitlab$ loginctl|grep git|wc -l
95
james@pc$ kubectl rollout restart deployment image-automation-controller
deployment.apps/image-automation-controller restarted
james@gitlab$ loginctl|grep git|wc -l
9

The resources (imageautomation,gitrepository) are the same as my previous issue, and available here: https://github.com/fluxcd/image-automation-controller/files/8333914/image-repo.yaml.txt

The image-automation-controller deployment is as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "22"
  creationTimestamp: "2022-03-08T13:23:21Z"
  generation: 30
  labels:
    app.kubernetes.io/instance: flux-system
    app.kubernetes.io/part-of: flux
    app.kubernetes.io/version: v0.28.0
    control-plane: controller
    kustomize.toolkit.fluxcd.io/name: flux-system
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: image-automation-controller
  namespace: flux-system
  resourceVersion: "268568182"
  uid: 8dd25eab-a5c6-4bab-81a8-b7d8c8f38c06
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: image-automation-controller
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/restartedAt: "2022-03-25T08:37:27Z"
        prometheus.io/port: "8080"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        app: image-automation-controller
    spec:
      containers:
      - args:
        - --events-addr=http://notification-controller.flux-system.svc.cluster.local./
        - --watch-all-namespaces=true
        - --log-level=info
        - --log-encoding=json
        - --enable-leader-election
        env:
        - name: RUNTIME_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: EXPERIMENTAL_GIT_TRANSPORT
          value: "true"
        image: ghcr.io/fluxcd/image-automation-controller:v0.21.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: healthz
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: manager
        ports:
        - containerPort: 8080
          name: http-prom
          protocol: TCP
        - containerPort: 9440
          name: healthz
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /readyz
            port: healthz
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
          requests:
            cpu: 100m
            memory: 64Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          seccompProfile:
            type: RuntimeDefault
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /tmp
          name: temp
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1337
      serviceAccount: image-automation-controller
      serviceAccountName: image-automation-controller
      terminationGracePeriodSeconds: 10
      volumes:
      - emptyDir: {}
        name: temp
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2022-03-23T17:20:44Z"
    lastUpdateTime: "2022-03-23T17:20:44Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2022-03-08T13:23:21Z"
    lastUpdateTime: "2022-03-25T08:37:34Z"
    message: ReplicaSet "image-automation-controller-69678f6d4c" has successfully
      progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 30
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1
@jiphex
Copy link
Author

jiphex commented Mar 25, 2022

(since writing this, I've upgraded to v0.28.2 and will report back if that makes a difference)

@hiddeco
Copy link
Member

hiddeco commented Mar 25, 2022

Related to fluxcd/source-controller#636

@jiphex
Copy link
Author

jiphex commented Mar 25, 2022

For now, this doesn't seem to be happening after the v0.28.2 upgrade, I'll re-open if it reoccurs:

flux: v0.28.2
helm-controller: v0.18.1
image-automation-controller: v0.21.1
image-reflector-controller: v0.17.1
kustomize-controller: v0.22.1
notification-controller: v0.23.1
source-controller: v0.22.2

@jiphex jiphex closed this as completed Mar 25, 2022
@jiphex
Copy link
Author

jiphex commented Mar 30, 2022

Screenshot 2022-03-30 at 16 20 10

This seems to be still ongoing, with

    - name: EXPERIMENTAL_GIT_TRANSPORT
      value: "true"

set on the pod

flux: v0.28.2
helm-controller: v0.18.1
image-automation-controller: v0.21.1
image-reflector-controller: v0.17.1
kustomize-controller: v0.22.1
notification-controller: v0.23.1
source-controller: v0.22.2

@jiphex jiphex reopened this Mar 30, 2022
@pjbgf pjbgf added this to the GA milestone Mar 31, 2022
@pjbgf
Copy link
Member

pjbgf commented May 27, 2022

@jiphex do you mind checking whether you can still reproduce this with out latest release candidate?
ghcr.io/fluxcd/image-automation-controller:rc-48bcca59

This RC is based on #369, therefore you no longer need to set the environment variable to enable the experimental transport, as Managed Transport will be enabled by default.

@pjbgf
Copy link
Member

pjbgf commented Jun 7, 2022

I have tested version ghcr.io/fluxcd/image-automation-controller:v0.23.0 and this seems to still be a problem. Although it seems like the goroutines are being released over time (12h period):

image

Upon further investigation, it becomes clear that the SSH connections are not being closed:

image

And therefore all the ssh handshake and connection servicing goroutines are kept running (note the matching occurrences compared to number of established connections):
image

This should never be the case considering there are only 3 automations configured:

flux get images update
NAME                    LAST RUN                        SUSPENDED       READY   MESSAGE
ssh-ecdsa-bitbucket     2022-05-27T20:32:45+01:00       True            True    no updates made; last commit 2c3cd78 at 2022-05-27T19:29:40Z
ssh-ed25519-bitbucket                                   True            False   waiting to be reconciled
ssh-rsa-bitbucket                                       True            False   waiting to be reconciled

@pjbgf pjbgf added the bug Something isn't working label Jun 7, 2022
@pjbgf
Copy link
Member

pjbgf commented Jun 8, 2022

A fix is now being tested, should be released soon. The goroutines number for the same setup are now steady under 100, which is a great improvements from the 1500+ on the previous graph.

image

@pjbgf
Copy link
Member

pjbgf commented Jun 10, 2022

We have a release candidate version with the fix:
ghcr.io/fluxcd/image-automation-controller:rc-843074dd

pjbgf pushed a commit that referenced this issue Jun 14, 2022
This version of source-controller introduces a fix for the
SSH connections leak issue reported at:
#334

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
pjbgf pushed a commit that referenced this issue Jun 14, 2022
This version of source-controller introduces a fix for the
SSH connections leak issue reported at:
#334

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
pjbgf pushed a commit to pjbgf/image-automation-controller that referenced this issue Jun 16, 2022
This version of source-controller introduces a fix for the
SSH connections leak issue reported at:
fluxcd#334

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
pjbgf pushed a commit that referenced this issue Jun 16, 2022
This version of source-controller introduces a fix for the
SSH connections leak issue reported at:
#334

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
pjbgf pushed a commit that referenced this issue Jun 16, 2022
This version of source-controller introduces a fix for the
SSH connections leak issue reported at:
#334

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
pjbgf pushed a commit that referenced this issue Jun 16, 2022
This version of source-controller introduces a fix for the
SSH connections leak issue reported at:
#334

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
darkowlzz pushed a commit that referenced this issue Jun 21, 2022
This version of source-controller introduces a fix for the
SSH connections leak issue reported at:
#334

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
darkowlzz pushed a commit that referenced this issue Jun 22, 2022
This version of source-controller introduces a fix for the
SSH connections leak issue reported at:
#334

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
souleb pushed a commit to souleb/image-automation-controller that referenced this issue Mar 12, 2024
This version of source-controller introduces a fix for the
SSH connections leak issue reported at:
fluxcd#334

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants