Memory leak in proxy? #388

snickell · 2022-04-07T15:09:59Z

Aloha, we've been seeing a pattern of growing daily memory usage (followed by increasing slugishness then non-responsiveness above around 1-2GB of RAM) in the 'proxy' pod:

The different colors are fresh proxy reboots, which have been required to keep the cluster running.

-Seth

snickell · 2022-04-07T15:10:50Z

Sorry, clipped the units:

The pattern is nearly identical on the other cluster.

snickell · 2022-04-07T15:13:39Z

We're running z2jh chart version 1.1.3-n354.h751bc313 (I believe the latest ~3 weeks ago), but as you can see, this pattern predates this chart version by quite a bit.

snickell · 2022-04-07T15:52:50Z

We start seeing serious performance problems at about 1.5GB, which is suspiciously close to the heap limit for node 🤔 So maybe its a memory leak that then cascade fails at the heap limit into some sort of .... garbage collection nightmare? or?

manics · 2022-04-07T16:12:09Z

Do you happen to know if the memory increases are correlated with particular events, e.g. a user starting a new server, or connecting to a particular service?

snickell · 2022-04-07T19:05:47Z

No, but I'm looking into it, my vague suspicion: websockets? We push them pretty hard, e.g. many users are streaming VNC over websocket. Is there a log mode that has useful stats about e.g. the routing table?

snickell · 2022-04-10T14:43:05Z

OK, so a further development, since high RAM usage correlated with performance problems, I added a k8s memory limit to the pod, thinking it would get killed when it passed 1.4GB of RAM, and reboot fresh, a decent-ish workaround for now.

Here's what happened instead:

Note that there's one other unusual thing here, I kubectl exec'ed several 200MB "ram balloon" processes to try to push it over the edge faster for testing. They clearly didn't work haha, and I doubt that's why this is not growing at the normal leakage rate, but worth mentioning.

Did something change or did adding a k8s memory limit suddenly change the behavior?

snickell · 2022-04-10T14:43:51Z

(note this otherwise consistent memory growth pattern goes back to jan, and a number of version upgrades since from the z2jh chart..... this is.... weird)

consideRatio · 2022-04-11T05:32:05Z

Hmmm, so when rhe pod restarts, is it because it has been evicted from a node, or is it because it has restarted its process within the container etc?

Being evicted from a node can happen based on external logic, while managing memory within the container can happen because of more internal logic, which can be enabled by limits to clairfy it needs to not surpass certain limits.

Need to learn more about OOMkiller things within the container vs by the kubelet etc, but perhaps you ended up helping it avoid getting evicted by surpassing its memory limit. Hmmm..

rcthomas · 2022-07-06T20:13:02Z

@snickell was what you observed related to load at all? Like, on weekend days do you observe this behavior? We're currently experiencing relatively-speaking high load on our deployment, and I observe something similar. Memory consumption in the proxy will just suddenly shoot up and it becomes non-responsive. Are you still using CHP for your proxy? I am considering swapping it for Traefik in the coming days here.

consideRatio · 2022-07-14T01:06:19Z

@snickell have you experienced this with older versions of z2jh -> chp as well?

marcelofernandez · 2023-08-28T13:13:59Z

Still happening on the latest version (v4.5.6).

shaneknapp · 2024-06-06T18:19:47Z

see also #434

i believe the socket leak is the root cause of the memory leak. on our larger, more active hubs we've seen constant spiking of the chp ram under "load", and chp running out of heap space: #434 (comment)

"load" is ~300+ users logging in around the "same time".

"same time" is anywhere from 15m to a couple of hours.

i don't believe that increasing the chp heap size is the correct fix, as the memory/socket leak still needs to be addressed. however, increasing it may help, but that would need some experimentation.

marcelofernandez · 2024-06-07T14:09:43Z

We finally replaced chp with traefik in our z2jh deployment, and this problem got obviously fixed.😬

Check out that alternative just in case you are experiencing this.

shaneknapp · 2024-06-07T16:16:33Z

thanks, good to know. we've also been considering this as well.

…

On Fri, Jun 7, 2024 at 7:10 AM Marcelo Fernández ***@***.***> wrote: We finally replaced chp with traefik in our z2jh deployment, and this problem got obviously fixed.😬 Check out that alternative just in case you are experiencing this. — Reply to this email directly, view it on GitHub <#388 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMIHLEJDD5VQBI52PJGGKTZGG5L3AVCNFSM6AAAAABI5H3Y42VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJUHEZDOOJWGU> . You are receiving this because you commented.Message ID: ***@***.***>

consideRatio · 2024-06-07T16:20:03Z

@marcelofernandez are you able to share config for your setup?

shaneknapp · 2024-06-15T17:50:11Z

We finally replaced chp with traefik in our z2jh deployment, and this problem got obviously fixed.😬

Check out that alternative just in case you are experiencing this.

echoing @consideRatio -- do you have any relevant traefik config bits you could share? this would be super useful! :)

thanks in advance...

marcelofernandez · 2024-06-17T15:45:01Z

Hey guys, sure!

First, and foremost, I'm sorry I can't give you all the details of my company's internal PR because:

I don't wanna go into any IP issues and (most importantly),
We're still using a very old version of z2jh so I'm not sure how all this is still relevant to the latest versions. I'd prepare a PR for z2jh without a hitch in a perfect world.

That said, I can give you an overview of what I did.

The complicated part was that it seemed like nobody did this in the past, so I based my job on this (far more ambitious) previous and rejected PR which originally was aimed to replace both proxies:

HTTP -> HTTPS one (the TLS frontend terminator called autohttps), and
configurable-http-proxy, but:
- Also making it HA-ready, supporting more than one instance of the proxy (making it more scalable), and
- Creating a new service called Consul to store all the proxies shared-config, etc. which brought more complexity to the PR.

The only thing I did (because I only wanted stability) based on that PR was to:

Drop the configurable-http-proxy Pod, and
Replace it with just one container of Traefik inside the Hub Pod,
Using the JupyterHub Traefik Proxy component (running in the Hub container) to automatically configure the Traefik container.
Now, both containers (Hub + Traefik) run in the same Pod still called Hub.

Based on the Z2JH's architectural graph, here are the changes.

Before:

After:

Once I defined the idea of what I wanted, I had to drop unneeded code from the PR above, configure the hub to call the proxy in the same pod (http://localhost:8081) and that's it.

I implemented this like a year and a half ago, if you have more questions, just let me know...

Regards

manics · 2024-08-13T13:27:54Z

4.6.2 was released 2 months ago with a fix for the leaking sockets. Is there still a memory leak or can we close this issue?

shaneknapp · 2024-08-13T22:00:43Z

@manics i don't think we should close this yet... we still saw chp run out of nodejs heap on hubs w/lots of traffic and usage even after we deployed 4.6.2, but since summer is slow it hasn't bitten us yet.

i'm sure that within a few weeks we'll see OOMs/socket leaks once the fall term ramps up.

minrk · 2024-08-15T13:18:11Z

If anyone can make a stress test to provoke this, ideally with just CHP (or the JupyterHub Proxy API, like the traefik proxy benchmarks) I can test if the migration to http2-proxy will help. I tried a simple local test with a simple backend and apache-bench, but many millions of requests and hundreds of gigabytes later, I see no significant increase in memory or socket consumption (still sub-100MB). So there must be something relevant in typical use (websockets, connections dropped in a particular way, adding/removing routes, etc.) that a naïve benchmark doesn't trigger.

snickell added the bug label Apr 7, 2022

consideRatio transferred this issue from jupyterhub/zero-to-jupyterhub-k8s Apr 7, 2022

consideRatio mentioned this issue Jul 14, 2022

Information about memory/cpu etc for the JupyterHub Helm chart's proxy pod jupyterhub/grafana-dashboards#44

Closed

manics mentioned this issue Oct 12, 2022

Socket leak #434

Open

cccs-nik mentioned this issue Aug 14, 2024

Ability to disable proxy jupyterhub/zero-to-jupyterhub-k8s#3481

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in proxy? #388

Memory leak in proxy? #388

snickell commented Apr 7, 2022

snickell commented Apr 7, 2022

snickell commented Apr 7, 2022

snickell commented Apr 7, 2022

manics commented Apr 7, 2022

snickell commented Apr 7, 2022

snickell commented Apr 10, 2022

snickell commented Apr 10, 2022

consideRatio commented Apr 11, 2022

rcthomas commented Jul 6, 2022

consideRatio commented Jul 14, 2022 •

edited

Loading

marcelofernandez commented Aug 28, 2023

shaneknapp commented Jun 6, 2024 •

edited

Loading

marcelofernandez commented Jun 7, 2024

shaneknapp commented Jun 7, 2024 via email

consideRatio commented Jun 7, 2024

shaneknapp commented Jun 15, 2024

marcelofernandez commented Jun 17, 2024

manics commented Aug 13, 2024 •

edited

Loading

shaneknapp commented Aug 13, 2024

minrk commented Aug 15, 2024

Memory leak in proxy? #388

Memory leak in proxy? #388

Comments

snickell commented Apr 7, 2022

snickell commented Apr 7, 2022

snickell commented Apr 7, 2022

snickell commented Apr 7, 2022

manics commented Apr 7, 2022

snickell commented Apr 7, 2022

snickell commented Apr 10, 2022

snickell commented Apr 10, 2022

consideRatio commented Apr 11, 2022

rcthomas commented Jul 6, 2022

consideRatio commented Jul 14, 2022 • edited Loading

marcelofernandez commented Aug 28, 2023

shaneknapp commented Jun 6, 2024 • edited Loading

marcelofernandez commented Jun 7, 2024

shaneknapp commented Jun 7, 2024 via email

consideRatio commented Jun 7, 2024

shaneknapp commented Jun 15, 2024

marcelofernandez commented Jun 17, 2024

manics commented Aug 13, 2024 • edited Loading

shaneknapp commented Aug 13, 2024

minrk commented Aug 15, 2024

consideRatio commented Jul 14, 2022 •

edited

Loading

shaneknapp commented Jun 6, 2024 •

edited

Loading

manics commented Aug 13, 2024 •

edited

Loading