[WIP] possible solution to propagate informative spawn failure messages from spawner to bhub ui #819

bitnik · 2019-04-03T10:52:45Z

First of all I need help here :) I don't fully understand what is happening (specially on JupyterHub side) but here is what I understand so far:

slow_spawn_timeout setting is by default 0:

This means any kind of failure during spawn is caught by jhub as TimeoutError and jhub always raises 500 error (even maybe spawner sends 409 error): https://github.com/jupyterhub/jupyterhub/blob/e89836c035f79a44cb5ebc1126e53c6f605464c1/jupyterhub/handlers/base.py#L887-L924

BinderHub catches this 500 error of API request and retries same API request 4 more times:

binderhub/binderhub/launcher.py

Lines 75 to 84 in 1835d07

    
           if e.code >= 500: 
        
               self.log.error("Error accessing Hub API (using %s): %s", request_url, e) 
        
               if i == self.retries: 
        
                   # last api request failed, raise the exception 
        
                   raise 
        
               await gen.sleep(retry_delay) 
        
               # exponential backoff for consecutive failures 
        
               retry_delay *= 2 
        
           else: 
        
               raise

After retrying, launch fails with

binderhub/binderhub/launcher.py

Line 197 in 1835d07

raise web.HTTPError(500, "Failed to launch image %s" % image)
Then BinderHub retries launch process (in total 3 times by default)
In total BinderHub makes 12 API requests and then user gets the standard error ("Failed to launch image ...") on the UI. During these 12 requests, hub restarts 2 times because consecutive_failure_limit is by default 5 (https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/8ed2f8111b5575dc5df29afb114a8ee5906f9a96/jupyterhub/values.yaml#L17)

The same process happens regardless of type of error from spawner. In this PR to solve this issue, slow_spawn_timeout is set to 10 seconds (default value), so BinderHub gets actual error from spawner. I also set consecutiveFailureLimit to 0, so hub doesn't restart after informative failures of spawner. But this is actually not good, because hub also ignores real errors that it has to restart to get rid of them.

I updated build and launch code too. So now it propagates 409 error messages from JupyterHub API to UI and doesn't retry to launch. And as I wrote before I need help here, this part can be wrong or missing. So I don't mind if we totally changed it.

related to #712 and #805

betatim · 2019-04-03T20:07:35Z

For a historical perspective it would be good to hear from @yuvipanda

betatim · 2019-04-03T20:23:19Z

I think the default path is (this is pretty tricky code and I've gotten it wrong before :-/) to go here https://github.com/jupyterhub/jupyterhub/blob/e89836c035f79a44cb5ebc1126e53c6f605464c1/jupyterhub/handlers/base.py#L894-L902 when spawning. The slow_spawn_timeout is (if I remember correctly) more about giving the spawner "a second or two" to become fully ready so users can go straight to their notebook server instead of first being redirected to a "waiting for your spawner" page to then be redirected to their notebook server. So maybe a better name would be something like extra_patience_for_fast_spawners.

A default for the consecutive_failure_limit tuned more to lower traffic hubs than mybinder.org is a good idea. If we change it we need to remember to update the mybinder.org config so that when this is deployed the value doesn't change.

While I read and think: when do spawns fail for reasons that are "interesting" to the user? Also what does a 409 represent?

Should this be > 500 instead of >= 500? That is what the comment says and I wouldn't really expect a status 500 to be something that would get fixed by retrying, but then maybe sometimes it does?

bitnik · 2019-04-04T08:44:42Z

A default for the consecutive_failure_limit tuned more to lower traffic hubs than mybinder.org is a good idea. If we change it we need to remember to update the mybinder.org config so that when this is deployed the value doesn't change.

Now I think setting consecutive_failure_limit 0 was bad idea. I should revert it. And it should be documented, so each deployment can configure it for their need. What do you think?

While I read and think: when do spawns fail for reasons that are "interesting" to the user? Also what does a 409 represent?

For example (a future case), when user has custom resource request more than limit, spawner can fail and send informative error message as "Requested amount of resource is not allowed, limit is ...". I think when #712 (#712 (comment)) is implemented, different BinderHub deployments would need this for different reasons, e.g., I need this for #794 to limit number of projects per user and inform users if they try to have more projects than allowed. Maybe 409 was wrong or maybe all 4xx client errors should be handled in that way. Now I see that for example jupyterhub sends 400 error when user reaches the named server limit (https://github.com/jupyterhub/jupyterhub/blob/e89836c035f79a44cb5ebc1126e53c6f605464c1/jupyterhub/apihandlers/users.py#L380-L392).

minrk · 2019-04-25T09:37:48Z

consecutive_failure_limit was added specifically to solve issues faced on mybinder.org (a self-diagnosis of a truly unhealthy Hub), so I agree that setting it to 0 would not be a good plan.

I think we also do indeed want slow_spawn_timeout to be 0 for BinderHub (and I think will probably go away as a Hub option in the future, always using the 0 behavior). This ensures prompt replies. Setting it to anything else gives quite nondeterministic behavior because failures could happen while we're waiting, in which case we get the error, or not, in which case we don't.

What I think we want to do us use the JupyterHub progress API (which is in part inspired by BinderHub), which should let us hook up an event stream and relay those messages up through the BinderHub pipe. That ought to get us errors that we want.

bitnik · 2019-09-12T07:55:43Z

Closing this PR in favor of #950

propagate 409 error messages from spawner to bhub ui

c31914b

bitnik closed this Sep 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] possible solution to propagate informative spawn failure messages from spawner to bhub ui #819

[WIP] possible solution to propagate informative spawn failure messages from spawner to bhub ui #819

bitnik commented Apr 3, 2019 •

edited

Loading

betatim commented Apr 3, 2019

betatim commented Apr 3, 2019 •

edited

Loading

bitnik commented Apr 4, 2019 •

edited

Loading

minrk commented Apr 25, 2019

bitnik commented Sep 12, 2019

	if e.code >= 500:
	self.log.error("Error accessing Hub API (using %s): %s", request_url, e)
	if i == self.retries:
	# last api request failed, raise the exception
	raise
	await gen.sleep(retry_delay)
	# exponential backoff for consecutive failures
	retry_delay *= 2
	else:
	raise

[WIP] possible solution to propagate informative spawn failure messages from spawner to bhub ui #819

[WIP] possible solution to propagate informative spawn failure messages from spawner to bhub ui #819

Conversation

bitnik commented Apr 3, 2019 • edited Loading

betatim commented Apr 3, 2019

betatim commented Apr 3, 2019 • edited Loading

bitnik commented Apr 4, 2019 • edited Loading

minrk commented Apr 25, 2019

bitnik commented Sep 12, 2019

bitnik commented Apr 3, 2019 •

edited

Loading

betatim commented Apr 3, 2019 •

edited

Loading

bitnik commented Apr 4, 2019 •

edited

Loading