-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add missing default values for proxy pod's probes #2423
Add missing default values for proxy pod's probes #2423
Conversation
The logic makes sense to me since the default proxy (chp) is also a singleton (like the hub) so we want to be a bit more conservative with the liveness probe so it's not killing the proxy all the time, especially under load - like a lot of users signing on at once during the beginning of some event. I looked over our proxy liveness probe failure metrics for the last month and there were 3 instances where we would have tripped the liveness probe: Those were also during a high load / user event and there were users complaining at the time of issues with accessing their notebooks (they were getting 5xx errors). As I looked more into that issue it seemed the chp proxy was the problem - response times steadily increased and memory increased: We're currently using chp version 4.4.0. In our case the liveness probe restarting the proxy might have helped but it's hard to tell since we're not in the middle of that issue anymore. With this change the proxy liveness probe will only restart the container after 90 seconds of consecutive failures. Since we hit a max of 6 failures in the graph above I'm not sure if we would have hit the failure threshold of 30 set here, but I think that's a decent default since like I said we don't want to have the opposite problem of restarting the proxy too often since it's a singleton. So +1 from me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a sane default.
For more context users (and myself when I tried to recreate) were getting 522 errors from Cloudflare and that was likely a result of the slow responses from the proxy in our case. |
jupyterhub/zero-to-jupyterhub-k8s#2423 Merge pull request #2423 from consideRatio/pr/fix-proxy-probes-default-values
@mriedem wow thank you for such a thought out and researched response!!! What would make cloudflare give a 522 error at all, is it involved to proxy all incomming traffic? I guess it could be doing that as a protection from DDoS attacks or similar. |
Cloudflare is there for routing public ingress traffic to IBM Cloud and our jupyterhub deployment running on the kubernetes service there. There were a lot of 503s in the chp logs as well, but those are more or less normal I've found - I think those also happen if the browser tab is left open but the underlying notebook kernel dies. My guess was the load on the proxy (and whatever was going on with runaway memory there - maybe a giant cache that just needed to be flushed with a restart) was causing slow responses on getting notebook content and that probably triggered the Cloudflare timeout. We might have also been having Cloud Object Storage issues that weekend which matters for us since we use object storage to back the notebook server filesystem rather than PVCs. |
@mriedem I'd love to learn about the object storage setup. Would you be willing to write something brief on discourse.jupyter.org about that or help me point to some references about backing up filesystem data on PVCs via object storage? For you information, regarding the 503 errors, that relates very much to work by @yuvipanda and @minrk here:
|
I'll talk with my team about that since I know the high level parts of the object storage setup but not all of the low level details, though those low level details might not be necessary or useful in a generic high level post on how we do our setup with object storage. I know we have some other gotchas that we've had to address when running that way, like a custom liveness probe for jupyter-server, some changes in the user image for a custom ipython profile for filesystem performance improvements, and some performance issues with nbconvert. |
At a high level we have a single PVC and all of the We (currently) don't purge old user data from that bucket even after we've culled the user due to inactivity from the hub. We've talked about doing that after the user was inactive for something like 1 year, but with object storage being cheap and basically unlimited it's not really a problem for us to just leave the user data there. |
I just read through https://zero-to-jupyterhub.readthedocs.io/en/latest/kubernetes/amazon/efs_storage.html#setting-up-efs-storage-on-aws and our setup in IBM Cloud with object storage (s3fs instead of NFS) is very similar. |
A followup on #2421 thanks to @mriedem's input in #2421 (comment).
The k8s native default value of a probe's
failureThreshold
is 3, and the default value oftimeoutSeconds
is 1. I opt to set the helm chart defaults to be a bit more tolerant though, hoping for a bit longer that the pod becomes responsive like we do for the hub pod - now also for the proxy pod.@mriedem what do you think?