Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] possible solution to propagate informative spawn failure messages from spawner to bhub ui #819

Closed
wants to merge 1 commit into from

Conversation

bitnik
Copy link
Collaborator

@bitnik bitnik commented Apr 3, 2019

First of all I need help here :) I don't fully understand what is happening (specially on JupyterHub side) but here is what I understand so far:

  1. slow_spawn_timeout setting is by default 0:
  1. This means any kind of failure during spawn is caught by jhub as TimeoutError and jhub always raises 500 error (even maybe spawner sends 409 error): https://github.com/jupyterhub/jupyterhub/blob/e89836c035f79a44cb5ebc1126e53c6f605464c1/jupyterhub/handlers/base.py#L887-L924
  2. BinderHub catches this 500 error of API request and retries same API request 4 more times:
    if e.code >= 500:
    self.log.error("Error accessing Hub API (using %s): %s", request_url, e)
    if i == self.retries:
    # last api request failed, raise the exception
    raise
    await gen.sleep(retry_delay)
    # exponential backoff for consecutive failures
    retry_delay *= 2
    else:
    raise
  3. After retrying, launch fails with
    raise web.HTTPError(500, "Failed to launch image %s" % image)
  4. Then BinderHub retries launch process (in total 3 times by default)
  5. In total BinderHub makes 12 API requests and then user gets the standard error ("Failed to launch image ...") on the UI. During these 12 requests, hub restarts 2 times because consecutive_failure_limit is by default 5 (https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/8ed2f8111b5575dc5df29afb114a8ee5906f9a96/jupyterhub/values.yaml#L17)

The same process happens regardless of type of error from spawner. In this PR to solve this issue, slow_spawn_timeout is set to 10 seconds (default value), so BinderHub gets actual error from spawner. I also set consecutiveFailureLimit to 0, so hub doesn't restart after informative failures of spawner. But this is actually not good, because hub also ignores real errors that it has to restart to get rid of them.

I updated build and launch code too. So now it propagates 409 error messages from JupyterHub API to UI and doesn't retry to launch. And as I wrote before I need help here, this part can be wrong or missing. So I don't mind if we totally changed it.

related to #712 and #805

@betatim
Copy link
Member

betatim commented Apr 3, 2019

For a historical perspective it would be good to hear from @yuvipanda

@betatim
Copy link
Member

betatim commented Apr 3, 2019

I think the default path is (this is pretty tricky code and I've gotten it wrong before :-/) to go here https://github.com/jupyterhub/jupyterhub/blob/e89836c035f79a44cb5ebc1126e53c6f605464c1/jupyterhub/handlers/base.py#L894-L902 when spawning. The slow_spawn_timeout is (if I remember correctly) more about giving the spawner "a second or two" to become fully ready so users can go straight to their notebook server instead of first being redirected to a "waiting for your spawner" page to then be redirected to their notebook server. So maybe a better name would be something like extra_patience_for_fast_spawners.

A default for the consecutive_failure_limit tuned more to lower traffic hubs than mybinder.org is a good idea. If we change it we need to remember to update the mybinder.org config so that when this is deployed the value doesn't change.

While I read and think: when do spawns fail for reasons that are "interesting" to the user? Also what does a 409 represent?

Should this be > 500 instead of >= 500? That is what the comment says and I wouldn't really expect a status 500 to be something that would get fixed by retrying, but then maybe sometimes it does?

@bitnik
Copy link
Collaborator Author

bitnik commented Apr 4, 2019

A default for the consecutive_failure_limit tuned more to lower traffic hubs than mybinder.org is a good idea. If we change it we need to remember to update the mybinder.org config so that when this is deployed the value doesn't change.

Now I think setting consecutive_failure_limit 0 was bad idea. I should revert it. And it should be documented, so each deployment can configure it for their need. What do you think?

While I read and think: when do spawns fail for reasons that are "interesting" to the user? Also what does a 409 represent?

For example (a future case), when user has custom resource request more than limit, spawner can fail and send informative error message as "Requested amount of resource is not allowed, limit is ...". I think when #712 (#712 (comment)) is implemented, different BinderHub deployments would need this for different reasons, e.g., I need this for #794 to limit number of projects per user and inform users if they try to have more projects than allowed. Maybe 409 was wrong or maybe all 4xx client errors should be handled in that way. Now I see that for example jupyterhub sends 400 error when user reaches the named server limit (https://github.com/jupyterhub/jupyterhub/blob/e89836c035f79a44cb5ebc1126e53c6f605464c1/jupyterhub/apihandlers/users.py#L380-L392).

@minrk
Copy link
Member

minrk commented Apr 25, 2019

consecutive_failure_limit was added specifically to solve issues faced on mybinder.org (a self-diagnosis of a truly unhealthy Hub), so I agree that setting it to 0 would not be a good plan.

I think we also do indeed want slow_spawn_timeout to be 0 for BinderHub (and I think will probably go away as a Hub option in the future, always using the 0 behavior). This ensures prompt replies. Setting it to anything else gives quite nondeterministic behavior because failures could happen while we're waiting, in which case we get the error, or not, in which case we don't.

What I think we want to do us use the JupyterHub progress API (which is in part inspired by BinderHub), which should let us hook up an event stream and relay those messages up through the BinderHub pipe. That ought to get us errors that we want.

@bitnik
Copy link
Collaborator Author

bitnik commented Sep 12, 2019

Closing this PR in favor of #950

@bitnik bitnik closed this Sep 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants