-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cull idle kernels #2215
Cull idle kernels #2215
Conversation
@@ -52,6 +58,26 @@ def _update_root_dir(self, proposal): | |||
raise TraitError("kernel root dir %r is not a directory" % value) | |||
return value | |||
|
|||
cull_kernels_after_minutes_env = 'CULL_KERNELS_AFTER_MINUTES' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Providing a way to set traitlets with environment variables is done in the kernel gateway code base, but it's not common in the notebook codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I left a few comments inline. Let me know if you'd like any help finishing it up, or clarifying things.
You can push new commits to this same branch and it will update the PR.
@@ -52,6 +58,26 @@ def _update_root_dir(self, proposal): | |||
raise TraitError("kernel root dir %r is not a directory" % value) | |||
return value | |||
|
|||
cull_kernels_after_minutes_env = 'CULL_KERNELS_AFTER_MINUTES' | |||
cull_kernels_after_minutes_default = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's simplify the trait declarations for now. @parente is right that the environment variable declarations aren't generally used here, so we can have a single assignment:
cull_idle_timeout = Integer(0, config=True,
help="...
)
without the @default
generator or other variables defined.
Plus, I think it might be clearer to use seconds everywhere, rather than a _minutes
timeout here and a _seconds
timeout below. How about we call them:
cull_idle_timeout: timeout (in seconds) after which a kernel is considered idle and ready to be culled.
cull_interval: interval (in seconds) on which to check for idle kernels to cleanup...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you - this has been addressed in the subsequent commit.
for kId, kernel in self._kernels.items(): | ||
self.cull_kernel(kId, kernel) | ||
|
||
def cull_kernel(self, kId, kernel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cull_kernel can just take a kernel_id
argument, and the above can be for kernel_id in list(self._kernels)
. The list
will make a copy of the keys, which may be needed, as shutting down a kernel will remove it from the _kernels
dictionary and you can't iterate over a dictionary while changing it, for example:
for kernel_id in list(self._kernels):
self.cull_kernel_if_idle(kernel_id)
def cull_kernel_if_idle(self, kernel_id):
kernel = self._kernels[kernel_id]
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - thanks.
if self._culler_callback is None: | ||
loop = IOLoop.current() | ||
if self.kernel_culling_interval_seconds <= 0: #handle case where user set invalid value | ||
self.log.warn("Invalid value for 'kernel_culling_interval_seconds' detected (%s) - using default value (%s).", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.warn
is deprecated in Python in favor of log.warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
if activity is not None: | ||
dtNow = utcnow() | ||
#dtActivity = datetime.strptime(activity,'%Y-%m-%dT%H:%M:%S.%f') | ||
dtIdle = dtNow - activity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Python, we try to use _
to separate words, such as kernel_id
, idle_time
instead of camelCase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had noticed the convention previously but am finding this to be a very difficult habit to break - thanks for pointing it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I think many people will be happy to see this ! It deserved a good note in what's new.
A couple of suggestion, but it's look ok as is for 5.1
if self.cull_interval <= 0: #handle case where user set invalid value | ||
self.log.warning("Invalid value for 'cull_interval' detected (%s) - using default value (%s).", | ||
self.cull_interval, self.cull_interval_default) | ||
self.cull_interval = self.cull_interval_default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also warn for cull_interval < 60s ? it's technically not wrong, but it's likely that user might have thoughs it was minutes. Even anything below 300s seem it can be a mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. I thought implementing a minimum timeout would be good (now that I've tested it :-)). 300 strikes me as good minimum. If folks are making changes in these areas and need shorter timeouts for testing, they can temporarily adjust the constant.
Positive values less than the minimum would result in a warning and auto-adjustment to the minimum (similar to what happens if the interval is <= 0 and culling is enabled).
So, to your comment above, the description would be...
"Timeout (in seconds) after which a kernel is considered idle and ready to be culled. Values of 0 or lower disable culling. The minimum timeout is 300 seconds. Positive values less than the minimum will be adjusted to the minimum."
(or is that too verbose?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Carreau a warning is a good idea.
@kevin-bates that sounds perfect. You might add (5 minutes)
next to 300 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
@@ -52,6 +58,15 @@ def _update_root_dir(self, proposal): | |||
raise TraitError("kernel root dir %r is not a directory" % value) | |||
return value | |||
|
|||
cull_idle_timeout = Integer(0, config=True, | |||
help="""Timeout (in seconds) after which a kernel is considered idle and ready to be culled.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicking, I think a "Values of 0 or lower will disable culling" could be nice. But we can improve later in a subsequent PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
def cull_kernels(self): | ||
self.log.debug("Polling every %s seconds for kernels idle > %s seconds...", | ||
self.cull_interval, self.cull_idle_timeout) | ||
for kernel_id in list(self._kernels): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing the list()
here is to reify the things you want to try to cull and make sure you don't modify the list while culling. Is that right ? If so it seem like something easily broken.
What do @minrk and @parente think about building the list of kernel to shutdown and shut them al down at once ?
That should make the flow a little bit easier to follow (and also to change the predicate if necessary).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way is fine. I don't think getting a list of dictionary keys is easily broken.
Would this be clearer?
# identify the idle kernels to be culled
to_cull = [ kid for kid in self._kernels if self.should_cull(kid) ]
# cull them
for kernel_id in to_cull:
self.shutdown_kernel(kernel_id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do @minrk and @parente think about building the list of kernel to shutdown and shut them al down at once ?
Sounds good to me as long as there's appropriate checks in place so that an problem getting the status of any one kernel (if that's possible) doesn't prevent cleanup of others that should be culled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm not really following the benefit of obtaining the list of cullable kernels, then walking that list to perform the cull when we could just as easily encounter an issue determining the cullable kernels. So, to address the point raised by @parente, what about something like the following? (Is that the right way to handle an unknown set of exceptions in python?)
for kernel_id in list(self._kernels):
try:
self.cull_kernel_if_idle(kernel_id)
except:
self.log.error("The following error was encountered while checking the idle duration of kernel %s: %s",
kernel_id, sys.exc_info()[0])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works, too. There are plenty of valid ways to do it. A few notes:
- a comment about copying the kernel id list so we don't modify the dict during iteration might help clarity
- you can use
log.exception
to automatically log a traceback, so you don't need to useexc_info
. - always use at least
except Exception:
, becauseexcept:
catches things like KeyboardInterrupts that shouldn't be caught.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @minrk - very helpful. Here's what I've got...
"""Create a separate list of kernels to avoid conflicting updates while iterating"""
for kernel_id in list(self._kernels):
try:
self.cull_kernel_if_idle(kernel_id)
except Exception as e:
self.log.exception("The following exception was encountered while checking the idle duration of kernel %s: %s",
kernel_id, e)
@kevin-bates excellent, thanks! We're in the process of finishing up the 5.0 release from the master branch, but this can be merged once that release has been pushed out. |
5.0 is out, long live 5.1! |
After finally getting a test bed setup for this change, I realize that it does not take into account whether the kernel is busy or not, or whether there are connections to the kernel or not. I'm going to submit a follow-on PR that adds settings for whether these attributes should affect the culling decision or not. |
when will 5.1 go release? |
@BigQuant - Sounds like its very soon per the latest Jupyter Weekly Summary (2017, week 29): https://groups.google.com/forum/#!topic/jupyter/NgR4XpUf66E
|
Yup, we're hoping to put the release candidate out any day now. |
This is exactly the feature I wanted! |
(Updated to include review recommendations below...)
This change introduces the ability to cull kernels that have been idle for a specified amount of time. Off by default, it is enabled by setting
--MappingKernelManager.cull_idle_timeout
to a positive value representing the number of seconds a kernel must remain inactive to be culled. This feature leverages activity tracking'slast_activity
value to determine when the last activity for a given kernel occurred. Typical values forcull_idle_timeout
will be on the order of 43200 (12 hours) or 86400 (24 hours) seconds as described in this JKG issue. These changes originally targeted Jupyter Kernel Gateway per issue #226 but really belongs in the Notebook server.An optional property has been introduced to adjust the interval in which idle kernels are culled provided they have exceeded the aforementioned timeout property. This property is adjusted by setting
--MappingKernelManager.cull_interval
to a positive value. If not set or non-positive, the default value of 300 (5 minutes) seconds will be used.JKG Issue 227 has been opened to remove the activity tracking that was also in place in the JKG. Prior to that, there could be false reports of active kernels that have been culled by the underlying MappingKernelManager. If not removed, we should minimally adjust the JKG's activity tracking handler to consult the active kernel manager class on each request.