Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing default values for proxy pod's probes #2423

Conversation

consideRatio
Copy link
Member

@consideRatio consideRatio commented Oct 4, 2021

A followup on #2421 thanks to @mriedem's input in #2421 (comment).

The k8s native default value of a probe's failureThreshold is 3, and the default value of timeoutSeconds is 1. I opt to set the helm chart defaults to be a bit more tolerant though, hoping for a bit longer that the pod becomes responsive like we do for the hub pod - now also for the proxy pod.


@mriedem what do you think?

@mriedem
Copy link
Contributor

mriedem commented Oct 6, 2021

I opt to set the helm chart defaults to be a bit more tolerant though, hoping for a bit longer that the pod becomes responsive like we do for the hub pod - now also for the proxy pod.

The logic makes sense to me since the default proxy (chp) is also a singleton (like the hub) so we want to be a bit more conservative with the liveness probe so it's not killing the proxy all the time, especially under load - like a lot of users signing on at once during the beginning of some event.

I looked over our proxy liveness probe failure metrics for the last month and there were 3 instances where we would have tripped the liveness probe:

image

Those were also during a high load / user event and there were users complaining at the time of issues with accessing their notebooks (they were getting 5xx errors). As I looked more into that issue it seemed the chp proxy was the problem - response times steadily increased and memory increased:

image

image

We're currently using chp version 4.4.0.

In our case the liveness probe restarting the proxy might have helped but it's hard to tell since we're not in the middle of that issue anymore.

With this change the proxy liveness probe will only restart the container after 90 seconds of consecutive failures. Since we hit a max of 6 failures in the graph above I'm not sure if we would have hit the failure threshold of 30 set here, but I think that's a decent default since like I said we don't want to have the opposite problem of restarting the proxy too often since it's a singleton. So +1 from me.

Copy link
Contributor

@mriedem mriedem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a sane default.

@consideRatio consideRatio merged commit b6688f5 into jupyterhub:main Oct 6, 2021
@mriedem
Copy link
Contributor

mriedem commented Oct 6, 2021

Those were also during a high load / user event and there were users complaining at the time of issues with accessing their notebooks (they were getting 5xx errors).

For more context users (and myself when I tried to recreate) were getting 522 errors from Cloudflare and that was likely a result of the slow responses from the proxy in our case.

consideRatio pushed a commit to jupyterhub/helm-chart that referenced this pull request Oct 6, 2021
jupyterhub/zero-to-jupyterhub-k8s#2423 Merge pull request #2423 from consideRatio/pr/fix-proxy-probes-default-values
@consideRatio
Copy link
Member Author

@mriedem wow thank you for such a thought out and researched response!!!

What would make cloudflare give a 522 error at all, is it involved to proxy all incomming traffic? I guess it could be doing that as a protection from DDoS attacks or similar.

@mriedem
Copy link
Contributor

mriedem commented Oct 6, 2021

What would make cloudflare give a 522 error at all, is it involved to proxy all incomming traffic?

Cloudflare is there for routing public ingress traffic to IBM Cloud and our jupyterhub deployment running on the kubernetes service there. There were a lot of 503s in the chp logs as well, but those are more or less normal I've found - I think those also happen if the browser tab is left open but the underlying notebook kernel dies.

My guess was the load on the proxy (and whatever was going on with runaway memory there - maybe a giant cache that just needed to be flushed with a restart) was causing slow responses on getting notebook content and that probably triggered the Cloudflare timeout. We might have also been having Cloud Object Storage issues that weekend which matters for us since we use object storage to back the notebook server filesystem rather than PVCs.

@consideRatio
Copy link
Member Author

consideRatio commented Oct 6, 2021

@mriedem I'd love to learn about the object storage setup. Would you be willing to write something brief on discourse.jupyter.org about that or help me point to some references about backing up filesystem data on PVCs via object storage?

For you information, regarding the 503 errors, that relates very much to work by @yuvipanda and @minrk here:

@mriedem
Copy link
Contributor

mriedem commented Oct 6, 2021

I'd love to learn about the object storage setup. Would you be willing to write something brief on discourse.jupyter.org about that or help me point to some references about backing up filesystem data on PVCs via object storage?

I'll talk with my team about that since I know the high level parts of the object storage setup but not all of the low level details, though those low level details might not be necessary or useful in a generic high level post on how we do our setup with object storage. I know we have some other gotchas that we've had to address when running that way, like a custom liveness probe for jupyter-server, some changes in the user image for a custom ipython profile for filesystem performance improvements, and some performance issues with nbconvert.

@mriedem
Copy link
Contributor

mriedem commented Oct 6, 2021

At a high level we have a single PVC and all of the singleuser-server pods connect to that at a sub-path which is their user ID, so e.g. pod jupyter-<userid> has a prefix in the COS bucket of <userid>. So the notebook container $HOME at /home/jovyan is really the /<userid> prefix path in the COS bucket.

We (currently) don't purge old user data from that bucket even after we've culled the user due to inactivity from the hub. We've talked about doing that after the user was inactive for something like 1 year, but with object storage being cheap and basically unlimited it's not really a problem for us to just leave the user data there.

@mriedem
Copy link
Contributor

mriedem commented Oct 14, 2021

At a high level we have a single PVC and all of the singleuser-server pods connect to that at a sub-path which is their user ID, so e.g. pod jupyter-<userid> has a prefix in the COS bucket of <userid>. So the notebook container $HOME at /home/jovyan is really the /<userid> prefix path in the COS bucket.

I just read through https://zero-to-jupyterhub.readthedocs.io/en/latest/kubernetes/amazon/efs_storage.html#setting-up-efs-storage-on-aws and our setup in IBM Cloud with object storage (s3fs instead of NFS) is very similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants