Add missing default values for proxy pod's probes #2423

consideRatio · 2021-10-04T16:04:49Z

A followup on #2421 thanks to @mriedem's input in #2421 (comment).

The k8s native default value of a probe's failureThreshold is 3, and the default value of timeoutSeconds is 1. I opt to set the helm chart defaults to be a bit more tolerant though, hoping for a bit longer that the pod becomes responsive like we do for the hub pod - now also for the proxy pod.

@mriedem what do you think?

mriedem · 2021-10-06T13:02:00Z

I opt to set the helm chart defaults to be a bit more tolerant though, hoping for a bit longer that the pod becomes responsive like we do for the hub pod - now also for the proxy pod.

The logic makes sense to me since the default proxy (chp) is also a singleton (like the hub) so we want to be a bit more conservative with the liveness probe so it's not killing the proxy all the time, especially under load - like a lot of users signing on at once during the beginning of some event.

I looked over our proxy liveness probe failure metrics for the last month and there were 3 instances where we would have tripped the liveness probe:

Those were also during a high load / user event and there were users complaining at the time of issues with accessing their notebooks (they were getting 5xx errors). As I looked more into that issue it seemed the chp proxy was the problem - response times steadily increased and memory increased:

We're currently using chp version 4.4.0.

In our case the liveness probe restarting the proxy might have helped but it's hard to tell since we're not in the middle of that issue anymore.

With this change the proxy liveness probe will only restart the container after 90 seconds of consecutive failures. Since we hit a max of 6 failures in the graph above I'm not sure if we would have hit the failure threshold of 30 set here, but I think that's a decent default since like I said we don't want to have the opposite problem of restarting the proxy too often since it's a singleton. So +1 from me.

mriedem

I think this is a sane default.

mriedem · 2021-10-06T13:04:59Z

Those were also during a high load / user event and there were users complaining at the time of issues with accessing their notebooks (they were getting 5xx errors).

For more context users (and myself when I tried to recreate) were getting 522 errors from Cloudflare and that was likely a result of the slow responses from the proxy in our case.

jupyterhub/zero-to-jupyterhub-k8s#2423 Merge pull request #2423 from consideRatio/pr/fix-proxy-probes-default-values

consideRatio · 2021-10-06T13:08:49Z

@mriedem wow thank you for such a thought out and researched response!!!

What would make cloudflare give a 522 error at all, is it involved to proxy all incomming traffic? I guess it could be doing that as a protection from DDoS attacks or similar.

mriedem · 2021-10-06T13:15:39Z

What would make cloudflare give a 522 error at all, is it involved to proxy all incomming traffic?

Cloudflare is there for routing public ingress traffic to IBM Cloud and our jupyterhub deployment running on the kubernetes service there. There were a lot of 503s in the chp logs as well, but those are more or less normal I've found - I think those also happen if the browser tab is left open but the underlying notebook kernel dies.

My guess was the load on the proxy (and whatever was going on with runaway memory there - maybe a giant cache that just needed to be flushed with a restart) was causing slow responses on getting notebook content and that probably triggered the Cloudflare timeout. We might have also been having Cloud Object Storage issues that weekend which matters for us since we use object storage to back the notebook server filesystem rather than PVCs.

consideRatio · 2021-10-06T13:24:25Z

@mriedem I'd love to learn about the object storage setup. Would you be willing to write something brief on discourse.jupyter.org about that or help me point to some references about backing up filesystem data on PVCs via object storage?

For you information, regarding the 503 errors, that relates very much to work by @yuvipanda and @minrk here:

Fail suspected API requests with 424, not 503 jupyterhub#3636
use 503 error code for requests to stopped servers mybinder.org-deploy#2043
I'm a bit confused about what proxy-patches does though.

mriedem · 2021-10-06T14:22:52Z

I'd love to learn about the object storage setup. Would you be willing to write something brief on discourse.jupyter.org about that or help me point to some references about backing up filesystem data on PVCs via object storage?

I'll talk with my team about that since I know the high level parts of the object storage setup but not all of the low level details, though those low level details might not be necessary or useful in a generic high level post on how we do our setup with object storage. I know we have some other gotchas that we've had to address when running that way, like a custom liveness probe for jupyter-server, some changes in the user image for a custom ipython profile for filesystem performance improvements, and some performance issues with nbconvert.

mriedem · 2021-10-06T14:34:13Z

At a high level we have a single PVC and all of the singleuser-server pods connect to that at a sub-path which is their user ID, so e.g. pod jupyter-<userid> has a prefix in the COS bucket of <userid>. So the notebook container $HOME at /home/jovyan is really the /<userid> prefix path in the COS bucket.

We (currently) don't purge old user data from that bucket even after we've culled the user due to inactivity from the hub. We've talked about doing that after the user was inactive for something like 1 year, but with object storage being cheap and basically unlimited it's not really a problem for us to just leave the user data there.

mriedem · 2021-10-14T14:13:58Z

At a high level we have a single PVC and all of the singleuser-server pods connect to that at a sub-path which is their user ID, so e.g. pod jupyter-<userid> has a prefix in the COS bucket of <userid>. So the notebook container $HOME at /home/jovyan is really the /<userid> prefix path in the COS bucket.

I just read through https://zero-to-jupyterhub.readthedocs.io/en/latest/kubernetes/amazon/efs_storage.html#setting-up-efs-storage-on-aws and our setup in IBM Cloud with object storage (s3fs instead of NFS) is very similar.

Add missing default values for proxy pod's probes

9e1d941

consideRatio added the maintenance label Oct 4, 2021

mriedem approved these changes Oct 6, 2021

View reviewed changes

consideRatio merged commit b6688f5 into jupyterhub:main Oct 6, 2021

consideRatio pushed a commit to jupyterhub/helm-chart that referenced this pull request Oct 6, 2021

[jupyterhub] Automatic update for commit 1.1.3-n097.hb6688f54

53d1b39

jupyterhub/zero-to-jupyterhub-k8s#2423 Merge pull request #2423 from consideRatio/pr/fix-proxy-probes-default-values

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add missing default values for proxy pod's probes #2423

Add missing default values for proxy pod's probes #2423

consideRatio commented Oct 4, 2021 •

edited

Loading

mriedem commented Oct 6, 2021 •

edited

Loading

mriedem left a comment

mriedem commented Oct 6, 2021

consideRatio commented Oct 6, 2021

mriedem commented Oct 6, 2021

consideRatio commented Oct 6, 2021 •

edited

Loading

mriedem commented Oct 6, 2021

mriedem commented Oct 6, 2021

mriedem commented Oct 14, 2021

Add missing default values for proxy pod's probes #2423

Add missing default values for proxy pod's probes #2423

Conversation

consideRatio commented Oct 4, 2021 • edited Loading

mriedem commented Oct 6, 2021 • edited Loading

mriedem left a comment

Choose a reason for hiding this comment

mriedem commented Oct 6, 2021

consideRatio commented Oct 6, 2021

mriedem commented Oct 6, 2021

consideRatio commented Oct 6, 2021 • edited Loading

mriedem commented Oct 6, 2021

mriedem commented Oct 6, 2021

mriedem commented Oct 14, 2021

consideRatio commented Oct 4, 2021 •

edited

Loading

mriedem commented Oct 6, 2021 •

edited

Loading

consideRatio commented Oct 6, 2021 •

edited

Loading