Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HA install's argocd-redis-ha-haproxy pods have runaway memory consumption #15319

Open
jdoylei opened this issue Sep 1, 2023 · 6 comments · May be fixed by #18283
Open

HA install's argocd-redis-ha-haproxy pods have runaway memory consumption #15319

jdoylei opened this issue Sep 1, 2023 · 6 comments · May be fixed by #18283
Labels
bug Something isn't working

Comments

@jdoylei
Copy link

jdoylei commented Sep 1, 2023

Describe the bug

We are running the HA install from the argo-cd official manifests (argoproj/argo-cd manifests/ha/namespace-install.yaml).

Over the course of a week, the argocd-redis-ha-haproxy pods grow unbounded in memory until they consume all the memory on the node (we should also set a memory limit, but judging from other posted issues and threads, it seems like that would just switch the problem to OOMKilled CrashLoopBackoff).

E.g. the haproxy pods are taking 10x the memory of anything else, totaling about 7.5 GB after 3 days:

$ kubectl top pod --sort-by=memory
NAME                                               CPU(cores)   MEMORY(bytes)
argocd-redis-ha-haproxy-5b8c745498-q49mz           2m           2500Mi
argocd-redis-ha-haproxy-5b8c745498-sh4bd           1m           2493Mi
argocd-redis-ha-haproxy-5b8c745498-rxkst           2m           2485Mi
argocd-redis-ha-server-2                           381m         292Mi
argocd-redis-ha-server-1                           101m         288Mi
argocd-redis-ha-server-0                           187m         287Mi
argocd-application-controller-0                    246m         275Mi
...

They start small in memory but grow linearly the whole time (see monitoring screenshot below covering 3 days).

To Reproduce

Just running the HA install with this version of Argo CD. There are only 3 small Applications defined, and very little web traffic - maybe 2 users accessing a couple times a day. From reading other issues and threads, it sounds like this is environment dependent to a degree, but definitely affects multiple environments.

Expected behavior

Given a constant amount of web traffic, the haproxy pods should each stay constant with its memory usage, hopefully within the 300 MB area that other pods are using.

Screenshots

image

Version

$ argocd version
argocd: v2.5.16+84fbc93
  BuildDate: 2023-03-23T15:29:04Z
  GitCommit: 84fbc930161f29ebe45a7da3b2e81ee256d119c2
  GitTreeState: clean
  GoVersion: go1.18.10
  Compiler: gc
  Platform: windows/amd64
argocd-server: v2.5.16+84fbc93
  BuildDate: 2023-03-23T14:57:24Z
  GitCommit: 84fbc930161f29ebe45a7da3b2e81ee256d119c2
  GitTreeState: clean
  GoVersion: go1.18.10
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v4.5.7 2022-08-02T16:35:54Z
  Helm Version: v3.10.3+g835b733
  Kubectl Version: v0.24.2
  Jsonnet Version: v0.18.0

Related info

  1. Haproxy pod keeps OOM crashing on Rocky Linux 9 with kernel 5.14 #12289 was opened earlier this year describing similar memory exhaustion in Argo CD's haproxy pods, but was closed because the issue was environment-specific (Rocky Linux not CentOS).

  2. [argo-cd] Enabling ha with autoscaling results in redis-ha-haproxy crashing with OOMKilled argo-helm#1958 was opened a little later this year describing similar memory exhaustion in Argo CD's haproxy pods, and the reporter closed it when they discovered a workaround they could apply at the K8s node level (edit /usr/lib/systemd/system/containerd.service). Note that the reporter's environment was different from above (RHEL 9 I think, rather than Rocky Linux).

  3. OOMkilled haproxy/haproxy#2121 was opened by the same reporter as no. 2 above and ultimately closed it there with the same K8s node workaround.

  4. Memory exhaustion using haproxy image docker-library/haproxy#194 was opened last year describing similar memory exhaustion in the haproxy image, and the issue is still open. A workaround for docker engine ("docker run --ulimit") is described. This reporter's environment was also different (Fedora).

The K8s node workaround from no. 2 and no. 3 is not a good solution for us, as we're using a K8s distribution (TKG) that is pretty dynamic about creating new nodes and doesn't give a good way to hook in customizations like editing containerd.service. Our TKG K8s clusters use nodes with VMware's Photon OS and containerd.

But no. 3 above also suggests tuning haproxy directly instead, which may be a better solution:

Or set a hard limit on FDs as documented:
http://docs.haproxy.org/2.6/configuration.html#3.1-fd-hard-limit

Proposed fix

Is it possible that Argo CD's HA manifests could follow the suggestion about tuning haproxy with the global parameter fd-hard-limit, e.g. in argocd-redis-ha-configmap haproxy.cfg?

@jdoylei jdoylei added the bug Something isn't working label Sep 1, 2023
@jdoylei
Copy link
Author

jdoylei commented Sep 9, 2023

After doing some testing with haproxy.cfg fd-hard-limit, I think the "Related info" and "Proposed fix" above are the wrong direction. I was able to use fd-hard-limit to control e.g. haproxy's session limits, and verify that using haproxy's stats page. E.g. with fd-hard-limit 2500:

image

But the limit changes didn't impact the overall behavior of the haproxy pods' pod_memory_working_set_bytes metric shown above. And I think the other GitHub issues linked above were probably describing different behavior from what we're having - those issues are probably describing immediate memory consumption (within minutes), whereas we're seeing steady growth of about 25 MB an hour, so it's a day or so before we're seeing a real issue.

After also looking at "ps" and "/proc/n/status" in the haproxy containers, I'm questioning whether the haproxy container processes are really consuming memory, or just causing an inaccurate pod_memory_working_set_bytes metric. I'm thinking the root cause of our situation is some disconnect with the pod_memory_working_set_bytes metric in our cluster - but I don't know why an issue like that would impact the haproxy containers as opposed to all the others.

@tophercullen
Copy link

tophercullen commented Mar 6, 2024

Applied the latest release of the argocd ha manifest (2.9.7) to a brand new, PoC cluster and found this issue because the haproxy containers were getting OOMKilled immediately on startup.

@timgriffiths
Copy link

Found the quickest way to fix this is by setting a global setting in haproxy config maxconn 4000 you can also fix it by changing the max open file limit in containerd but as this comment points out docker-library/haproxy#194 (comment) this only works as haproxy derives the max number of connections from the max open files on a system which seems like a bit of a bug or at least we should set a max as part of config

@pre
Copy link

pre commented Sep 27, 2024

I have verified on our setup that maxconn 4096 allows haproxy remain in the failure loop peacefully. Without maxconn haproxy consumes all the memory it can get in a matter of seconds when all of the redis backend become unreachable.

@pre
Copy link

pre commented Sep 27, 2024

I found the culprit in our case. In our environment we updated to containerd://1.6.24 in Garden Linux 1592.1.

Previously ArgoCD's plain haproxy manifests worked without issues in containerd://1.6.24 with Garden Linux 1443.10.

The newer OS changed something, and haproxy goes to OOM Kill crash loop if maxconn 4096 is not limited in haproxy.cfg.

Defining maxconn should be the default in all manifests delivered by ArgoCD (both Helm and https://raw.githubusercontent.com/argoproj/argo-cd/${VERSION}/manifests/ha/install.yaml).

@jhanbo
Copy link

jhanbo commented Oct 1, 2024

I found the culprit in our case. In our environment we updated to containerd://1.6.24 in Garden Linux 1592.1.

Previously ArgoCD's plain haproxy manifests worked without issues in containerd://1.6.24 with Garden Linux 1443.10.

The newer OS changed something, and haproxy goes to OOM Kill crash loop if maxconn 4096 is not limited in haproxy.cfg.

Defining maxconn should be the default in all manifests delivered by ArgoCD (both Helm and https://raw.githubusercontent.com/argoproj/argo-cd/${VERSION}/manifests/ha/install.yaml).

Hi @pre,

I'm facing the same issue. It seems that's related to Garden Linux upgrade since the container-runtime has changed to containerd://1.7.2 while it's in Garden Linux 1443.10 containerd://1.6.24 as you mentioned.

OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
Garden Linux 1592.1 6.6.47-cloud-amd64 containerd://1.7.2

timgriffiths added a commit to timgriffiths/argo-cd that referenced this issue Nov 21, 2024
timgriffiths added a commit to timgriffiths/argo-cd that referenced this issue Nov 24, 2024
timgriffiths added a commit to timgriffiths/argo-cd that referenced this issue Nov 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants