-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HA install's argocd-redis-ha-haproxy pods have runaway memory consumption #15319
Comments
After doing some testing with haproxy.cfg fd-hard-limit, I think the "Related info" and "Proposed fix" above are the wrong direction. I was able to use fd-hard-limit to control e.g. haproxy's session limits, and verify that using haproxy's stats page. E.g. with fd-hard-limit 2500: But the limit changes didn't impact the overall behavior of the haproxy pods' pod_memory_working_set_bytes metric shown above. And I think the other GitHub issues linked above were probably describing different behavior from what we're having - those issues are probably describing immediate memory consumption (within minutes), whereas we're seeing steady growth of about 25 MB an hour, so it's a day or so before we're seeing a real issue. After also looking at "ps" and "/proc/n/status" in the haproxy containers, I'm questioning whether the haproxy container processes are really consuming memory, or just causing an inaccurate pod_memory_working_set_bytes metric. I'm thinking the root cause of our situation is some disconnect with the pod_memory_working_set_bytes metric in our cluster - but I don't know why an issue like that would impact the haproxy containers as opposed to all the others. |
Applied the latest release of the argocd ha manifest (2.9.7) to a brand new, PoC cluster and found this issue because the haproxy containers were getting OOMKilled immediately on startup. |
Found the quickest way to fix this is by setting a global setting in haproxy config |
I have verified on our setup that |
I found the culprit in our case. In our environment we updated to Previously ArgoCD's plain haproxy manifests worked without issues in The newer OS changed something, and haproxy goes to OOM Kill crash loop if Defining |
Hi @pre, I'm facing the same issue. It seems that's related to OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME |
Describe the bug
We are running the HA install from the argo-cd official manifests (argoproj/argo-cd manifests/ha/namespace-install.yaml).
Over the course of a week, the argocd-redis-ha-haproxy pods grow unbounded in memory until they consume all the memory on the node (we should also set a memory limit, but judging from other posted issues and threads, it seems like that would just switch the problem to OOMKilled CrashLoopBackoff).
E.g. the haproxy pods are taking 10x the memory of anything else, totaling about 7.5 GB after 3 days:
They start small in memory but grow linearly the whole time (see monitoring screenshot below covering 3 days).
To Reproduce
Just running the HA install with this version of Argo CD. There are only 3 small Applications defined, and very little web traffic - maybe 2 users accessing a couple times a day. From reading other issues and threads, it sounds like this is environment dependent to a degree, but definitely affects multiple environments.
Expected behavior
Given a constant amount of web traffic, the haproxy pods should each stay constant with its memory usage, hopefully within the 300 MB area that other pods are using.
Screenshots
Version
Related info
Haproxy pod keeps OOM crashing on Rocky Linux 9 with kernel 5.14 #12289 was opened earlier this year describing similar memory exhaustion in Argo CD's haproxy pods, but was closed because the issue was environment-specific (Rocky Linux not CentOS).
[argo-cd] Enabling ha with autoscaling results in redis-ha-haproxy crashing with OOMKilled argo-helm#1958 was opened a little later this year describing similar memory exhaustion in Argo CD's haproxy pods, and the reporter closed it when they discovered a workaround they could apply at the K8s node level (edit /usr/lib/systemd/system/containerd.service). Note that the reporter's environment was different from above (RHEL 9 I think, rather than Rocky Linux).
OOMkilled haproxy/haproxy#2121 was opened by the same reporter as no. 2 above and ultimately closed it there with the same K8s node workaround.
Memory exhaustion using haproxy image docker-library/haproxy#194 was opened last year describing similar memory exhaustion in the haproxy image, and the issue is still open. A workaround for docker engine ("docker run --ulimit") is described. This reporter's environment was also different (Fedora).
The K8s node workaround from no. 2 and no. 3 is not a good solution for us, as we're using a K8s distribution (TKG) that is pretty dynamic about creating new nodes and doesn't give a good way to hook in customizations like editing containerd.service. Our TKG K8s clusters use nodes with VMware's Photon OS and containerd.
But no. 3 above also suggests tuning haproxy directly instead, which may be a better solution:
Proposed fix
Is it possible that Argo CD's HA manifests could follow the suggestion about tuning haproxy with the global parameter fd-hard-limit, e.g. in argocd-redis-ha-configmap haproxy.cfg?
The text was updated successfully, but these errors were encountered: