Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[argo-cd] Enabling ha with autoscaling results in redis-ha-haproxy crashing with OOMKilled #1958

Closed
lknite opened this issue Apr 12, 2023 · 9 comments
Labels
argo-cd bug Something isn't working

Comments

@lknite
Copy link

lknite commented Apr 12, 2023

Describe the bug

Set values.yaml as described to enable ha with autoscaling here:
https://github.com/argoproj/argo-helm/tree/main/charts/argo-cd

The redis-ha-haproxy pods are crashing with OOMKilled.

AME                                                    READY   STATUS             RESTARTS        AGE
pod/argocd-application-controller-0                     1/1     Running            0               14m
pod/argocd-applicationset-controller-5797cd75dd-dbwcm   1/1     Running            0               12m
pod/argocd-applicationset-controller-5797cd75dd-m5r2t   1/1     Running            0               14m
pod/argocd-notifications-controller-5fc57946c7-fjvpx    1/1     Running            0               12m
pod/argocd-redis-ha-haproxy-d67fc9b6-gsddn              0/1     CrashLoopBackOff   7 (4m40s ago)   12m
pod/argocd-redis-ha-haproxy-d67fc9b6-gttv9              0/1     CrashLoopBackOff   9 (112s ago)    17m
pod/argocd-redis-ha-haproxy-d67fc9b6-vdgrw              0/1     CrashLoopBackOff   9 (18s ago)     14m
pod/argocd-redis-ha-server-0                            3/3     Running            0               11m
pod/argocd-redis-ha-server-1                            3/3     Running            0               16m
pod/argocd-redis-ha-server-2                            3/3     Running            0               13m
pod/argocd-repo-server-59d6fd5d45-f2fnp                 1/1     Running            0               12m
pod/argocd-repo-server-59d6fd5d45-vnwc7                 1/1     Running            0               14m
pod/argocd-server-67cfc6c877-rnkrf                      1/1     Running            0               12m
pod/argocd-server-67cfc6c877-zdns4                      1/1     Running            0               14m

Related helm chart

argo-cd

Helm chart version

5.28.2

To Reproduce

argo-cd:

  redis-ha:
    enabled: true

  controller:
    replicas: 1

    metrics:
      enabled: true

  dex:
    enabled: false

  server:
    autoscaling:
      enabled: true
      minReplicas: 2

    extraArgs:
    - --insecure

    ingress:
      enabled: true
      ingressClassName: nginx
      hosts:
      - argocd.k.home.net
      tls:
      - secretName: argocd.k.home.net-tls
        hosts:
        - argocd.k.home.net
      annotations:
        cert-manager.io/issuer: "cluster-adcs-issuer"                   #use specific name of issuer
        cert-manager.io/issuer-kind: "ClusterAdcsIssuer"                #or ClusterAdcsIssuer
        cert-manager.io/issuer-group: "adcs.certmanager.csf.nokia.com"
        nginx.ingress.kubernetes.io/rewrite-target: /
        nginx.ingress.kubernetes.io/proxy-body-size: 1000m
        nginx.ingress.kubernetes.io/proxy-buffer-size: 16k

    volumeMounts:
    - mountPath: "/etc/ssl/certs"
      name: ca-bundle
    volumes:
    - name: ca-bundle
      secret:
        secretName: ca-bundle

    config:
      url: "https://argocd.k.home.net"
      oidc.config: |
        name: Azure
        issuer: https://login.microsoftonline.com/<snip>/v2.0
        clientID: <snip>
        clientSecret: <snip>
        requestedIDTokenClaims:
          groups:
            essential: true
        requestedScopes:
        - openid
        - profile
        - email
        - offline_access

    rbacConfig:
      policy.csv: |
        # Grant all members of the group 'my-org:team-alpha; the ability to sync apps in 'my-project'
        #p, my-org:team-alpha, applications, sync, my-project/*, allow
        # Grant all members of 'my-org:team-beta' admins
        g, k-app-argocd-admin, role:admin

  repoServer:
    autoscaling:
      enabled: true
      minReplicas: 2

    volumeMounts:
    - mountPath: "/etc/ssl/certs"
      name: ca-bundle
    volumes:
    - name: ca-bundle
      secret:
        secretName: ca-bundle

  applicationSet:
    replicaCount: 2

Expected behavior

argocd stood up with ha configuration

Screenshots

image

Additional context

kubernetes v1.25.8
os redhat 9
k8s installed via kubeadm

I boosted the memory by 4 gb at a time up to 48 on each of 3 worker nodes, this is a newly setup cluster with this argocd deployment pretty much the only thing running. If I ssh into each of the worker nodes it shows haproxy using up all the cpu and memory.

@lknite lknite added the bug Something isn't working label Apr 12, 2023
@lknite lknite changed the title enabling ha with autoscaling results in haproxy crashing with OOMKilled Enabling ha with autoscaling results in redis-ha-haproxy crashing with OOMKilled Apr 12, 2023
@lknite lknite changed the title Enabling ha with autoscaling results in redis-ha-haproxy crashing with OOMKilled (argo-cd) Enabling ha with autoscaling results in redis-ha-haproxy crashing with OOMKilled Apr 12, 2023
@lknite lknite changed the title (argo-cd) Enabling ha with autoscaling results in redis-ha-haproxy crashing with OOMKilled [argo-cd] Enabling ha with autoscaling results in redis-ha-haproxy crashing with OOMKilled Apr 12, 2023
@lknite
Copy link
Author

lknite commented Apr 12, 2023

Could it be haproxy specific? What image is being used for redis-ha-haproxy?

Maybe this?
haproxy/haproxy#1834

@mkilchhofer
Copy link
Member

mkilchhofer commented Apr 13, 2023

Hmm your mentioned chart version include haproxy 2.6.4, the issue you mentioned targets 2.6.3.
But OOM kills are due to memory, not CPU.

Can you try to add some limits on the haproxy pods? Eg.

redis-ha:
  haproxy:
    resources:
      limits:
        cpu: 1
        memory: 512Mi

@lknite
Copy link
Author

lknite commented Apr 13, 2023

With limits it crashes much faster now:
image

@lknite
Copy link
Author

lknite commented Apr 13, 2023

Am trying out different versions of haproxy using:

    haproxy:
      # -- HAProxy tag
      image:
        tag: 2.7.0

Reference:
https://github.com/DandyDeveloper/charts/tree/master/charts/redis-ha

@lknite
Copy link
Author

lknite commented Apr 13, 2023

maybe? docker-library/haproxy#194 (comment)

@lknite
Copy link
Author

lknite commented Apr 14, 2023

Looks like kubernetes relies on this to be fixed at the container service level, in my case this is containerd fixed like this:

# sed -i 's/LimitNOFILE=infinity/LimitNOFILE=65535/' /usr/lib/systemd/system/containerd.service
# systemctl daemon-reload
# systemctl restart containerd
# k delete deployment <asdf>

@sspreitzer
Copy link

Looks like kubernetes relies on this to be fixed at the container service level, in my case this is containerd fixed like this:

# sed -i 's/LimitNOFILE=infinity/LimitNOFILE=65535/' /usr/lib/systemd/system/containerd.service
# systemctl daemon-reload
# systemctl restart containerd
# k delete deployment <asdf>

I do not understand how «out of memory» is related to «limit of number of open files» ?
And how setting a lower than infinite number of open files results in fixing the out of memory killing?
Can someone elaborate?

@sspreitzer
Copy link

I do not understand how «out of memory» is related to «limit of number of open files» ? And how setting a lower than infinite number of open files results in fixing the out of memory killing? Can someone elaborate?

Adding this comment reference as elaboration source: kubernetes/kubernetes#3595 (comment)

@sspreitzer
Copy link

Looks like kubernetes relies on this to be fixed at the container service level, in my case this is containerd fixed like this:

# sed -i 's/LimitNOFILE=infinity/LimitNOFILE=65535/' /usr/lib/systemd/system/containerd.service
# systemctl daemon-reload
# systemctl restart containerd
# k delete deployment <asdf>

Creating a systemd drop-in via ansible relaxed the situation for me. Setting the process defaults of soft 1024 and hard 524288 for the containerd service.

- name: Set RHEL9 ulimit
  hosts: all
  tasks:
    - name: Create dropin directory if not exists
      ansible.builtin.file:
        path: /etc/systemd/system/containerd.service.d
        state: directory
      when:
        - ansible_os_family == "RedHat"
        - ansible_distribution_major_version == "9"
    - name: Add ulimits dropin
      ansible.builtin.copy:
        dest: /etc/systemd/system/containerd.service.d/ulimits.conf
        content: |
          [Service]
          LimitNOFILE=
          LimitNOFILE=1024:524288
      when:
        - ansible_os_family == "RedHat"
        - ansible_distribution_major_version == "9"
      notify:
        - Restart containerd
  handlers:
    - name: Restart containerd
      ansible.builtin.systemd_service:
        daemon_reload: true
        name: containerd
        state: restarted
        enabled: true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
argo-cd bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants