[argo-cd] Enabling ha with autoscaling results in redis-ha-haproxy crashing with OOMKilled #1958

lknite · 2023-04-12T19:11:19Z

Describe the bug

Set values.yaml as described to enable ha with autoscaling here:
https://github.com/argoproj/argo-helm/tree/main/charts/argo-cd

The redis-ha-haproxy pods are crashing with OOMKilled.

AME                                                    READY   STATUS             RESTARTS        AGE
pod/argocd-application-controller-0                     1/1     Running            0               14m
pod/argocd-applicationset-controller-5797cd75dd-dbwcm   1/1     Running            0               12m
pod/argocd-applicationset-controller-5797cd75dd-m5r2t   1/1     Running            0               14m
pod/argocd-notifications-controller-5fc57946c7-fjvpx    1/1     Running            0               12m
pod/argocd-redis-ha-haproxy-d67fc9b6-gsddn              0/1     CrashLoopBackOff   7 (4m40s ago)   12m
pod/argocd-redis-ha-haproxy-d67fc9b6-gttv9              0/1     CrashLoopBackOff   9 (112s ago)    17m
pod/argocd-redis-ha-haproxy-d67fc9b6-vdgrw              0/1     CrashLoopBackOff   9 (18s ago)     14m
pod/argocd-redis-ha-server-0                            3/3     Running            0               11m
pod/argocd-redis-ha-server-1                            3/3     Running            0               16m
pod/argocd-redis-ha-server-2                            3/3     Running            0               13m
pod/argocd-repo-server-59d6fd5d45-f2fnp                 1/1     Running            0               12m
pod/argocd-repo-server-59d6fd5d45-vnwc7                 1/1     Running            0               14m
pod/argocd-server-67cfc6c877-rnkrf                      1/1     Running            0               12m
pod/argocd-server-67cfc6c877-zdns4                      1/1     Running            0               14m

Related helm chart

argo-cd

Helm chart version

5.28.2

To Reproduce

argo-cd:

  redis-ha:
    enabled: true

  controller:
    replicas: 1

    metrics:
      enabled: true

  dex:
    enabled: false

  server:
    autoscaling:
      enabled: true
      minReplicas: 2

    extraArgs:
    - --insecure

    ingress:
      enabled: true
      ingressClassName: nginx
      hosts:
      - argocd.k.home.net
      tls:
      - secretName: argocd.k.home.net-tls
        hosts:
        - argocd.k.home.net
      annotations:
        cert-manager.io/issuer: "cluster-adcs-issuer"                   #use specific name of issuer
        cert-manager.io/issuer-kind: "ClusterAdcsIssuer"                #or ClusterAdcsIssuer
        cert-manager.io/issuer-group: "adcs.certmanager.csf.nokia.com"
        nginx.ingress.kubernetes.io/rewrite-target: /
        nginx.ingress.kubernetes.io/proxy-body-size: 1000m
        nginx.ingress.kubernetes.io/proxy-buffer-size: 16k

    volumeMounts:
    - mountPath: "/etc/ssl/certs"
      name: ca-bundle
    volumes:
    - name: ca-bundle
      secret:
        secretName: ca-bundle

    config:
      url: "https://argocd.k.home.net"
      oidc.config: |
        name: Azure
        issuer: https://login.microsoftonline.com/<snip>/v2.0
        clientID: <snip>
        clientSecret: <snip>
        requestedIDTokenClaims:
          groups:
            essential: true
        requestedScopes:
        - openid
        - profile
        - email
        - offline_access

    rbacConfig:
      policy.csv: |
        # Grant all members of the group 'my-org:team-alpha; the ability to sync apps in 'my-project'
        #p, my-org:team-alpha, applications, sync, my-project/*, allow
        # Grant all members of 'my-org:team-beta' admins
        g, k-app-argocd-admin, role:admin

  repoServer:
    autoscaling:
      enabled: true
      minReplicas: 2

    volumeMounts:
    - mountPath: "/etc/ssl/certs"
      name: ca-bundle
    volumes:
    - name: ca-bundle
      secret:
        secretName: ca-bundle

  applicationSet:
    replicaCount: 2

Expected behavior

argocd stood up with ha configuration

Screenshots

Additional context

kubernetes v1.25.8
os redhat 9
k8s installed via kubeadm

I boosted the memory by 4 gb at a time up to 48 on each of 3 worker nodes, this is a newly setup cluster with this argocd deployment pretty much the only thing running. If I ssh into each of the worker nodes it shows haproxy using up all the cpu and memory.

The text was updated successfully, but these errors were encountered:

lknite · 2023-04-12T21:58:10Z

Could it be haproxy specific? What image is being used for redis-ha-haproxy?

Maybe this?
haproxy/haproxy#1834

mkilchhofer · 2023-04-13T13:00:46Z

Hmm your mentioned chart version include haproxy 2.6.4, the issue you mentioned targets 2.6.3.
But OOM kills are due to memory, not CPU.

Can you try to add some limits on the haproxy pods? Eg.

redis-ha:
  haproxy:
    resources:
      limits:
        cpu: 1
        memory: 512Mi

lknite · 2023-04-13T16:06:26Z

With limits it crashes much faster now:

lknite · 2023-04-13T22:13:59Z

Am trying out different versions of haproxy using:

    haproxy:
      # -- HAProxy tag
      image:
        tag: 2.7.0

Reference:
https://github.com/DandyDeveloper/charts/tree/master/charts/redis-ha

lknite · 2023-04-13T23:08:48Z

maybe? docker-library/haproxy#194 (comment)

lknite · 2023-04-14T17:23:08Z

Looks like kubernetes relies on this to be fixed at the container service level, in my case this is containerd fixed like this:

# sed -i 's/LimitNOFILE=infinity/LimitNOFILE=65535/' /usr/lib/systemd/system/containerd.service
# systemctl daemon-reload
# systemctl restart containerd
# k delete deployment <asdf>

sspreitzer · 2024-06-09T14:51:57Z

Looks like kubernetes relies on this to be fixed at the container service level, in my case this is containerd fixed like this:
# sed -i 's/LimitNOFILE=infinity/LimitNOFILE=65535/' /usr/lib/systemd/system/containerd.service
# systemctl daemon-reload
# systemctl restart containerd
# k delete deployment <asdf>

I do not understand how «out of memory» is related to «limit of number of open files» ?
And how setting a lower than infinite number of open files results in fixing the out of memory killing?
Can someone elaborate?

sspreitzer · 2024-06-09T14:55:06Z

I do not understand how «out of memory» is related to «limit of number of open files» ? And how setting a lower than infinite number of open files results in fixing the out of memory killing? Can someone elaborate?

Adding this comment reference as elaboration source: kubernetes/kubernetes#3595 (comment)

sspreitzer · 2024-06-09T15:45:26Z

Looks like kubernetes relies on this to be fixed at the container service level, in my case this is containerd fixed like this:
# sed -i 's/LimitNOFILE=infinity/LimitNOFILE=65535/' /usr/lib/systemd/system/containerd.service
# systemctl daemon-reload
# systemctl restart containerd
# k delete deployment <asdf>

Creating a systemd drop-in via ansible relaxed the situation for me. Setting the process defaults of soft 1024 and hard 524288 for the containerd service.

- name: Set RHEL9 ulimit
  hosts: all
  tasks:
    - name: Create dropin directory if not exists
      ansible.builtin.file:
        path: /etc/systemd/system/containerd.service.d
        state: directory
      when:
        - ansible_os_family == "RedHat"
        - ansible_distribution_major_version == "9"
    - name: Add ulimits dropin
      ansible.builtin.copy:
        dest: /etc/systemd/system/containerd.service.d/ulimits.conf
        content: |
          [Service]
          LimitNOFILE=
          LimitNOFILE=1024:524288
      when:
        - ansible_os_family == "RedHat"
        - ansible_distribution_major_version == "9"
      notify:
        - Restart containerd
  handlers:
    - name: Restart containerd
      ansible.builtin.systemd_service:
        daemon_reload: true
        name: containerd
        state: restarted
        enabled: true

lknite added the bug Something isn't working label Apr 12, 2023

lknite changed the title ~~enabling ha with autoscaling results in haproxy crashing with OOMKilled~~ Enabling ha with autoscaling results in redis-ha-haproxy crashing with OOMKilled Apr 12, 2023

lknite changed the title ~~Enabling ha with autoscaling results in redis-ha-haproxy crashing with OOMKilled~~ (argo-cd) Enabling ha with autoscaling results in redis-ha-haproxy crashing with OOMKilled Apr 12, 2023

lknite changed the title ~~(argo-cd) Enabling ha with autoscaling results in redis-ha-haproxy crashing with OOMKilled~~ [argo-cd] Enabling ha with autoscaling results in redis-ha-haproxy crashing with OOMKilled Apr 12, 2023

mkilchhofer added the argo-cd label Apr 13, 2023

lknite mentioned this issue Apr 13, 2023

OOMkilled haproxy/haproxy#2121

Closed

lknite mentioned this issue Apr 14, 2023

rlimit support kubernetes/kubernetes#3595

Open

lknite closed this as completed Apr 14, 2023

jdoylei mentioned this issue Sep 1, 2023

HA install's argocd-redis-ha-haproxy pods have runaway memory consumption argoproj/argo-cd#15319

Open

krimeshshah mentioned this issue Nov 12, 2024

Haproxy maxconn customconfig setting is not working #3025

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[argo-cd] Enabling ha with autoscaling results in redis-ha-haproxy crashing with OOMKilled #1958

[argo-cd] Enabling ha with autoscaling results in redis-ha-haproxy crashing with OOMKilled #1958

lknite commented Apr 12, 2023 •

edited

Loading

lknite commented Apr 12, 2023

mkilchhofer commented Apr 13, 2023 •

edited

Loading

lknite commented Apr 13, 2023

lknite commented Apr 13, 2023

lknite commented Apr 13, 2023

lknite commented Apr 14, 2023 •

edited

Loading

sspreitzer commented Jun 9, 2024

sspreitzer commented Jun 9, 2024

sspreitzer commented Jun 9, 2024

[argo-cd] Enabling ha with autoscaling results in redis-ha-haproxy crashing with OOMKilled #1958

[argo-cd] Enabling ha with autoscaling results in redis-ha-haproxy crashing with OOMKilled #1958

Comments

lknite commented Apr 12, 2023 • edited Loading

Describe the bug

Related helm chart

Helm chart version

To Reproduce

Expected behavior

Screenshots

Additional context

lknite commented Apr 12, 2023

mkilchhofer commented Apr 13, 2023 • edited Loading

lknite commented Apr 13, 2023

lknite commented Apr 13, 2023

lknite commented Apr 13, 2023

lknite commented Apr 14, 2023 • edited Loading

sspreitzer commented Jun 9, 2024

sspreitzer commented Jun 9, 2024

sspreitzer commented Jun 9, 2024

lknite commented Apr 12, 2023 •

edited

Loading

mkilchhofer commented Apr 13, 2023 •

edited

Loading

lknite commented Apr 14, 2023 •

edited

Loading