HA install's argocd-redis-ha-haproxy pods have runaway memory consumption #15319

jdoylei · 2023-09-01T14:07:32Z

Describe the bug

We are running the HA install from the argo-cd official manifests (argoproj/argo-cd manifests/ha/namespace-install.yaml).

Over the course of a week, the argocd-redis-ha-haproxy pods grow unbounded in memory until they consume all the memory on the node (we should also set a memory limit, but judging from other posted issues and threads, it seems like that would just switch the problem to OOMKilled CrashLoopBackoff).

E.g. the haproxy pods are taking 10x the memory of anything else, totaling about 7.5 GB after 3 days:

$ kubectl top pod --sort-by=memory
NAME                                               CPU(cores)   MEMORY(bytes)
argocd-redis-ha-haproxy-5b8c745498-q49mz           2m           2500Mi
argocd-redis-ha-haproxy-5b8c745498-sh4bd           1m           2493Mi
argocd-redis-ha-haproxy-5b8c745498-rxkst           2m           2485Mi
argocd-redis-ha-server-2                           381m         292Mi
argocd-redis-ha-server-1                           101m         288Mi
argocd-redis-ha-server-0                           187m         287Mi
argocd-application-controller-0                    246m         275Mi
...

They start small in memory but grow linearly the whole time (see monitoring screenshot below covering 3 days).

To Reproduce

Just running the HA install with this version of Argo CD. There are only 3 small Applications defined, and very little web traffic - maybe 2 users accessing a couple times a day. From reading other issues and threads, it sounds like this is environment dependent to a degree, but definitely affects multiple environments.

Expected behavior

Given a constant amount of web traffic, the haproxy pods should each stay constant with its memory usage, hopefully within the 300 MB area that other pods are using.

Screenshots

Version

$ argocd version
argocd: v2.5.16+84fbc93
  BuildDate: 2023-03-23T15:29:04Z
  GitCommit: 84fbc930161f29ebe45a7da3b2e81ee256d119c2
  GitTreeState: clean
  GoVersion: go1.18.10
  Compiler: gc
  Platform: windows/amd64
argocd-server: v2.5.16+84fbc93
  BuildDate: 2023-03-23T14:57:24Z
  GitCommit: 84fbc930161f29ebe45a7da3b2e81ee256d119c2
  GitTreeState: clean
  GoVersion: go1.18.10
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v4.5.7 2022-08-02T16:35:54Z
  Helm Version: v3.10.3+g835b733
  Kubectl Version: v0.24.2
  Jsonnet Version: v0.18.0

Related info

Haproxy pod keeps OOM crashing on Rocky Linux 9 with kernel 5.14 #12289 was opened earlier this year describing similar memory exhaustion in Argo CD's haproxy pods, but was closed because the issue was environment-specific (Rocky Linux not CentOS).
[argo-cd] Enabling ha with autoscaling results in redis-ha-haproxy crashing with OOMKilled argo-helm#1958 was opened a little later this year describing similar memory exhaustion in Argo CD's haproxy pods, and the reporter closed it when they discovered a workaround they could apply at the K8s node level (edit /usr/lib/systemd/system/containerd.service). Note that the reporter's environment was different from above (RHEL 9 I think, rather than Rocky Linux).
OOMkilled haproxy/haproxy#2121 was opened by the same reporter as no. 2 above and ultimately closed it there with the same K8s node workaround.
Memory exhaustion using haproxy image docker-library/haproxy#194 was opened last year describing similar memory exhaustion in the haproxy image, and the issue is still open. A workaround for docker engine ("docker run --ulimit") is described. This reporter's environment was also different (Fedora).

The K8s node workaround from no. 2 and no. 3 is not a good solution for us, as we're using a K8s distribution (TKG) that is pretty dynamic about creating new nodes and doesn't give a good way to hook in customizations like editing containerd.service. Our TKG K8s clusters use nodes with VMware's Photon OS and containerd.

But no. 3 above also suggests tuning haproxy directly instead, which may be a better solution:

Or set a hard limit on FDs as documented:
http://docs.haproxy.org/2.6/configuration.html#3.1-fd-hard-limit

Proposed fix

Is it possible that Argo CD's HA manifests could follow the suggestion about tuning haproxy with the global parameter fd-hard-limit, e.g. in argocd-redis-ha-configmap haproxy.cfg?

jdoylei · 2023-09-09T16:24:51Z

After doing some testing with haproxy.cfg fd-hard-limit, I think the "Related info" and "Proposed fix" above are the wrong direction. I was able to use fd-hard-limit to control e.g. haproxy's session limits, and verify that using haproxy's stats page. E.g. with fd-hard-limit 2500:

But the limit changes didn't impact the overall behavior of the haproxy pods' pod_memory_working_set_bytes metric shown above. And I think the other GitHub issues linked above were probably describing different behavior from what we're having - those issues are probably describing immediate memory consumption (within minutes), whereas we're seeing steady growth of about 25 MB an hour, so it's a day or so before we're seeing a real issue.

After also looking at "ps" and "/proc/n/status" in the haproxy containers, I'm questioning whether the haproxy container processes are really consuming memory, or just causing an inaccurate pod_memory_working_set_bytes metric. I'm thinking the root cause of our situation is some disconnect with the pod_memory_working_set_bytes metric in our cluster - but I don't know why an issue like that would impact the haproxy containers as opposed to all the others.

tophercullen · 2024-03-06T16:58:11Z

Applied the latest release of the argocd ha manifest (2.9.7) to a brand new, PoC cluster and found this issue because the haproxy containers were getting OOMKilled immediately on startup.

timgriffiths · 2024-05-20T00:58:10Z

Found the quickest way to fix this is by setting a global setting in haproxy config maxconn 4000 you can also fix it by changing the max open file limit in containerd but as this comment points out docker-library/haproxy#194 (comment) this only works as haproxy derives the max number of connections from the max open files on a system which seems like a bit of a bug or at least we should set a max as part of config

pre · 2024-09-27T14:00:41Z

I have verified on our setup that maxconn 4096 allows haproxy remain in the failure loop peacefully. Without maxconn haproxy consumes all the memory it can get in a matter of seconds when all of the redis backend become unreachable.

fix: HA proxy memory runaway on certain rpm based distro's -> Setting maxconn in haproxy config (#15319) #18283 (comment)

pre · 2024-09-27T15:14:30Z

I found the culprit in our case. In our environment we updated to containerd://1.6.24 in Garden Linux 1592.1.

Previously ArgoCD's plain haproxy manifests worked without issues in containerd://1.6.24 with Garden Linux 1443.10.

The newer OS changed something, and haproxy goes to OOM Kill crash loop if maxconn 4096 is not limited in haproxy.cfg.

Defining maxconn should be the default in all manifests delivered by ArgoCD (both Helm and https://raw.githubusercontent.com/argoproj/argo-cd/${VERSION}/manifests/ha/install.yaml).

jhanbo · 2024-10-01T11:38:04Z

I found the culprit in our case. In our environment we updated to containerd://1.6.24 in Garden Linux 1592.1.

Previously ArgoCD's plain haproxy manifests worked without issues in containerd://1.6.24 with Garden Linux 1443.10.

The newer OS changed something, and haproxy goes to OOM Kill crash loop if maxconn 4096 is not limited in haproxy.cfg.

Defining maxconn should be the default in all manifests delivered by ArgoCD (both Helm and https://raw.githubusercontent.com/argoproj/argo-cd/${VERSION}/manifests/ha/install.yaml).

Hi @pre,

I'm facing the same issue. It seems that's related to Garden Linux upgrade since the container-runtime has changed to containerd://1.7.2 while it's in Garden Linux 1443.10 containerd://1.6.24 as you mentioned.

OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
Garden Linux 1592.1 6.6.47-cloud-amd64 containerd://1.7.2

jdoylei added the bug Something isn't working label Sep 1, 2023

jdoylei mentioned this issue Sep 5, 2023

Enable password-protected Redis for HA #11387

Closed

timgriffiths linked a pull request May 20, 2024 that will close this issue

fix: HA proxy memory runaway on certain rpm based distro's -> Setting maxconn in haproxy config (#15319) #18283

Open

14 tasks

timgriffiths added a commit to timgriffiths/argo-cd that referenced this issue May 28, 2024

Merge remote-tracking branch 'upstream/master' into fix-argoproj#15319

e882dd1

pre mentioned this issue Sep 27, 2024

Unable to deploy ArgoCD with HA #11388

Open

3 tasks

timgriffiths added a commit to timgriffiths/argo-cd that referenced this issue Nov 21, 2024

Merge branch 'master' into fix-argoproj#15319

11adcde

timgriffiths added a commit to timgriffiths/argo-cd that referenced this issue Nov 24, 2024

Merge branch 'master' into fix-argoproj#15319

8640ebc

timgriffiths added a commit to timgriffiths/argo-cd that referenced this issue Nov 24, 2024

Merge branch 'master' into fix-argoproj#15319

c9a2d43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HA install's argocd-redis-ha-haproxy pods have runaway memory consumption #15319

HA install's argocd-redis-ha-haproxy pods have runaway memory consumption #15319

jdoylei commented Sep 1, 2023

jdoylei commented Sep 9, 2023 •

edited

Loading

tophercullen commented Mar 6, 2024 •

edited

Loading

timgriffiths commented May 20, 2024

pre commented Sep 27, 2024

pre commented Sep 27, 2024

jhanbo commented Oct 1, 2024

HA install's argocd-redis-ha-haproxy pods have runaway memory consumption #15319

HA install's argocd-redis-ha-haproxy pods have runaway memory consumption #15319

Comments

jdoylei commented Sep 1, 2023

jdoylei commented Sep 9, 2023 • edited Loading

tophercullen commented Mar 6, 2024 • edited Loading

timgriffiths commented May 20, 2024

pre commented Sep 27, 2024

pre commented Sep 27, 2024

jhanbo commented Oct 1, 2024

jdoylei commented Sep 9, 2023 •

edited

Loading

tophercullen commented Mar 6, 2024 •

edited

Loading