subprocesses of kernels are not killed by restart #104

jankatins · 2015-12-10T21:40:04Z

Restarting a kernel only kills the direct child, but not any opened subprocesses of that child.

This is a problem if the kernel spawns multiple childprocess like the R kernel (on windows: R - cmd - R - cmd - Rterm) or if the python kernel opens a subprocess which sleeps.

Issue on the irkernel repo: IRkernel/IRkernel#226

IMO the kernel manager should kill the complete process group and not only the process which is directly spawned by the kernel manager.

I use restart kernel when I want resources back which I lost in the kernel (due to the stupid commands I executed) and currently I have a 4 processes running (cmd - R - cmd - Rterm) where I killed the kernel 15min ago (and it's still 25% CPU and 1GB Ram). So if I have to wait for the actual process which does the work to finish, the restart command is kind of useless :-(

reproducible example (at least on windows): put it into a cell, execute it and then restart the kernel via the notebook UI.

import platform
import subprocess
if platform.system() == 'Windows':
    CREATE_NO_WINDOW = 0x08000000
    subprocess_creation_flags = CREATE_NO_WINDOW
else:
    # Apparently None won't work here
    subprocess_creation_flags = 0
exec_line = u"""python -c "import time;time.sleep( 50000 )" """
p = subprocess.Popen(exec_line, shell=False,
                     stdout=subprocess.PIPE, stderr=subprocess.PIPE,
                     creationflags=subprocess_creation_flags)
p.communicate()

The text was updated successfully, but these errors were encountered:

takluyver · 2015-12-10T21:57:49Z

I suspect that makes sense on Unix, but killpg is only available on Unix, so that's not going to help you on Windows.

The question you linked has a couple of answers that would apply on Windows: one with psutil, and one that involves launching TASKKILL in another subprocess.

jankatins · 2016-04-25T13:19:13Z

ipyparallel has some better way to handle signals on windows: https://github.com/ipython/ipyparallel/blob/d35b4901adde6e651bb6ffc5c8aa9df0d20f1a73/ipyparallel/apps/launcher.py#L282-L288

Would you take a patch which uses check_output(['taskkill', '-pid', str(self.kernel.pid), '-t', '-f']) on windows instead self.kernel.kill() in _kill_kernel() (source)?

takluyver · 2016-04-25T13:47:20Z

In principle I'd be OK with that. Though a note on this page is a bit concerning: "Home editions of Windows do not have TASKKILL, useTSKILL instead."

Maybe that's outdated? Do we have a good way of checking?

jankatins · 2016-04-25T15:24:17Z

Not on this side: only pro systems here :-/

jankatins · 2016-04-25T15:30:58Z

BTW: can someone on linux check if the problem exists there, too? E.g. run the cell in the initial message and restart the kernel and see if the subprocess is also gone...?

takluyver · 2016-04-25T15:42:29Z

Yes, with a bit of adapting (turning command string into command list), I can run that (on Linux), and when I restart the kernel, the subprocess is orphaned and continues running.

However, thinking about this some more, I'm not sure that the proposed fixes will help. When we restart a kernel, we first ask it nicely to shut down, and only if that doesn't work do we forcibly kill it. So I guess it should be the kernel's own responsibility to deal with child processes in the first instance.

jankatins · 2016-04-25T16:00:36Z

Ok, I did a bit of googling turned up that taskkill is not available in win7 home. I've not found out about Win8 and Win10. So the workaround is to taskkill, check if the command was found and if not to add a kernel.kill() to use the old subprocess way...

RE "subprocess cleanup": that's true as long as the kernel can answer, but sometimes that's not the case: e.g. if the user starts a http server for displaying a plot and then the plotting itself takes too long, so that the kernel is unreachable and can't answer the "shutdown kernel" message. Or the irkernel call chain mess where the kernel isn't responding as long as it executes something...

Another way might be using psutil: http://stackoverflow.com/questions/4789837/how-to-terminate-a-python-subprocess-launched-with-shell-true/25134985#25134985 https://github.com/giampaolo/psutil/blob/1b71ca531ba377e8c71d2c581debfce7eae35a40/psutil/tests/__init__.py#L326

-> If psutil is importable, use that method to kill all children, otherwise the old one?

jankatins · 2016-04-25T16:14:41Z

The "kill all children" might also becomes a necessity if the future kernel nanny becomes unresponsive itself (not sure if this might ever happen...), e.g. how to kill the chain nanny - kernel in that case.

takluyver · 2016-04-25T16:56:24Z

Part of the idea of the kernel nanny is that it should always be responsive, because it's not running user code, or any significant tasks. So the frontend would ask the nanny to shut down, and the nanny would deal with asking the kernel nicely and then forcefully killing it if necessary.

However, since the normal case is a kernel that is responsive, I think the first thing is to work out how kernels should shut themselves down and terminate child processes, before we worry too much about what to do when they're not responsive.

jankatins · 2016-04-25T17:58:02Z

I think the first thing is to work out how kernels should shut themselves down and terminate child processes, before we worry too much about what to do when they're not responsive.

Is that for this issue or the kernel nanny proposal? If the first: that's over my head :-/

-> Should I still submit a PR for the taskkill (or even the psutil) way to kill the kernel and all it's children?

takluyver · 2016-04-25T19:32:02Z

That's for this issue. I don't think the kernel nanny thing changes this in any significant way.

I'd rather leave this in a consistent state, i.e. as it is now, than introduce a bunch of extra complexity that fixes it in some situations but not others. Especially if we don't know how to fix it in the most common case, where the kernel is responding.

jankatins · 2016-04-25T21:29:59Z

So, here is my argument for having such a "kernel massacre" function included in both the nanny and the juypter_client code. I think there are at least two use cases here for such a function:

the initial case of the non responding R kernel which should get restarted / killed (e.g. endless loop). In this case, the r kernel can handle the children, because a) the R kernel is busy executing the non responding code and doesn't see the shutdown message and b) the kernel is not the parent but the child in the call chain: notebook-server - r.exe - cmd - rterm.exe (probably longer when not directly calling the 64bit version) -> the notebook kills r.exe, but rterm.exe is the actual r kernel.
the case which got simulated by the above code: the kernel opens a subprocess but the users doesn't kill the subprocess before killing the kernel. This asks for a solution in the kernel, but this means that all kernels must implement a way to kill their own children (I opened this for the R kernel: on shutdown, kill all children IRkernel/IRkernel#307).

But IMO both cases are a nice feature of the kernel nanny (and the code in jupyter_client which would have the same role in case there is no nanny): cleaning up after the user:

receive the shutdown request from the server (or being the server)
register the children of the kernel process
ask the kernel to shutdown itself
wait a few seconds
cleanup: kill every kernel/children which are still around

takluyver · 2016-04-26T08:56:35Z

psutil can find all children of a process, though killpg seems like a better way to do it on Unix. Maybe we just need to depend on psutil on Windows.

mlucool · 2022-02-01T00:28:54Z

Hi,

Adding another simple example of this issue:

Run the following code then restart your notebook. Then run it again and you'll see the children piling up (this assumes you have no other things sleeping for the demo to work).

import getpass, os, psutil, signal, subprocess

def find_interesting_children():
    user = getpass.getuser()
    ret = []
    for pid in list(psutil.pids()):
        try:
            proc = psutil.Process(pid)
        except Exception:
            # Ignore now-dead processes.
            continue
        if proc.username() != user:
            continue
        if proc.name() != 'sleep':
            continue
        ret.append(pid)
    return ret

def preexec_function():
    signal.signal(signal.SIGINT, signal.SIG_IGN)

print(find_interesting_children())
child = subprocess.Popen(['sleep', '180d'], preexec_fn=preexec_function)
print(find_interesting_children())

If you want to cleanup run later:

for pid in find_interesting_children():
    os.kill(pid, signal.SIGTERM)

On a high level, this stems from the choice in this toy example to ignore SIGINT. Sometimes a subprocess may do this and it's out of your control so restarting kernels leaves behind a mess (if you don't ignore SIGINT things work as expected). Can we add logic to handle this variant of process cleanup?

kevin-bates · 2022-02-02T02:07:52Z

Hi @mlucool, Thanks for the test case!

I've spent the better part of today looking into this and it's quite interesting (IMHO). There are three areas of interest, two of which are still a bit of mystery to me.

First, I think the LocalProvisioner has a regression issue in that it's issuing Popen kill() and terminate() commands directly, rather than preferring to issue the respective signals (SIGKILL, SIGTERM) against the process group - as was the case in 6.x. This issue, though, does not manifest itself in the current incantation of things (more below).
It's a mystery to me why the ignore of SIGINT has an impact when ignoring any other signal (or the default SIGINT handler) does not!
There are two forms of shutdown that can happen, "immediate" and "graceful". The default (via the now parameter) is "graceful", meaning that a shutdown_request message is sent and, unless there are issues, nothing else occurs (if the message is detected as not having worked [via polling], then SIGTERM/SIGKILL are sent). "Immediate" shutdown issues a kill() call directly. (The regression mentioned in item 1 is that this should use the signal against the process group, before issuing kill().)

What I don't understand is why handling SIGINT makes a difference here. Seems like it, coupled with the message-based shutdown, is an area that should be looked at (at the kernel level I presume). (Does the kernel's message-based shutdown try to send SIGINT to its process group as its form of cleaning up child processes by any chance? Just trying to explain what is seen.)

One thing we could do relative to restarts is perform shutdown in an immediate fashion (now=True). This will then trigger the signal/process-group form of shutdown (provided the LocalProvisioner is fixed).

This would still leave an issue when just terminating a kernel that has set up its child processes to ignore SIGINT since that uses "graceful" termination.

For now, I'd like to fix the regression issue in LocalProvisioner. Of course, this fix won't be "visible" until immediate shutdowns are more prevalent.

Should we make the shutdown issued from restart_kernel use now=True and bypass the message-based shutdown on restarts? Another approach might be to update the handler in jupyter_server to use now=True when restarting the kernel.

It would be great to hear from someone with some kernel-side experience and how "shutdown-request" with ignored SIGINT comes into play.

mlucool · 2022-02-02T21:46:06Z

CC @Carreau

I think what's happening is as follows (I am not familiar with the client or kernel code, so this is a guess):

SIGINT is sent to the kernel and it relays it to its children
The kernel is correctly shut down but its children are not
The children are setting their parent now to 1 which jupyter can no longer tell it should be trying to kill

Proposal (maybe this should move to IPykernel/IPython?):

On SIGINT the kernel does the normal SIGINT things but does not exit until all its children exit first.
Because jupyter_client can send SIGTERM too, the kernel should also wait to exit on SIGTERM until all its children exit first.
The kernel shutdown code should be robust to getting a SIGTERM while blocking on children as a result of a prior SIGINT.
jupyter_client will do its usual escalation from SIGINT to SIGTERM to SIGKILL in a way that causes the whole process group to receive each signal.

kevin-bates · 2022-02-02T22:17:52Z

On SIGINT the kernel does the normal SIGINT things but does not exit until all its children exit first.

The SIGINT issued during shutdown_kernel() is merely an attempt to have any current processing interrupted prior to the process's termination (via "shutdown_request" and, possibly, SIGTERM/SIGKILL). It was (relatively) recently added and is literally the interrupt_kernel() method itself. It is not meant to terminate anything.

I think the parent process should be responsible for the termination of its immediate children (and so on). We have the ability to cheat when the kernel is a local process and send a signal to the process group but (per my previous comment) that only occurs in "fallback" situations when the message-based shutdown request has proven ineffective. (Note: this was one of the primary reasons the interrupt was introduced in the shutdown sequence, so the kernel could respond to "shutdown_request" messages.)

I would agree that when receiving a "shutdown_request" message, the kernel process should do whatever it can to terminate its children. This could probably be something like os.killpg(os.getpgrp(), SIGTERM) (and the equivalent windows code) as its last act.

Fixes #jupyter/jupyter_client#104 This should make sure we properly cull all subprocesses at shutdown, it does change one of the private method from sync to async in order to no user time.sleep or thread so this may affect subclasses, though I doubt it. It's also not completely clear to me whether this works on windows as SIGINT I belove is not a thing. Regardless as this affects things like dask, and others that are mostly on unix, it should be an improvement. It does the following, stopping as soon as it does not find any more children to current process. - Send sigint to everything - Immediately send sigterm in look with an exponential backoff from 0.01 to 1 second roughtly multiplying the delay until next send by 3 each time. - Switch to sending sigkill with same backoff. There is no delay after sigint, as this is just a courtesy. The delays backoff are not configurable. I can imagine that on slow systems it may make sens

mlucool · 2022-03-14T22:08:43Z

FTR, ipykernel 6.9.2 fixes #104 (comment).

alita-moore · 2024-05-18T11:52:20Z

Is there a way to make jupyter NOT stop sub-processes?

kevin-bates · 2024-05-21T14:28:25Z

Hi @alita-moore. Life-cycle management of the kernel process (i.e., the sub-preocess) is the responsibility of the kernel provisioner. This gives one the ability to implement whatever kind of process management they'd like beyond what is provided by the local provisioner by default.

jankatins mentioned this issue Dec 10, 2015

Killing the kernel while it sends data is not killing the kernel completly IRkernel/IRkernel#226

Closed

flying-sheep mentioned this issue Jan 27, 2016

Tracking other projects’ bugs that affect us IRkernel/IRkernel#250

Closed

3 tasks

This was referenced Feb 16, 2016

Add a way to run a command before starting the kernel itself #127

Closed

Conda on windows Cadair/jupyter_environment_kernels#6

Merged

This was referenced Apr 18, 2016

Kernel nanny proposal jupyter/enhancement-proposals#14

Closed

Shell operations should use same Python as kernel anaconda/nb_conda_kernels#19

Closed

jankatins mentioned this issue Apr 25, 2016

on shutdown, kill all children IRkernel/IRkernel#307

Open

This was referenced May 25, 2017

Prototype new kernel discovery machinery #261

Merged

Interrupting the kernel does not work on Windows JuliaLang/IJulia.jl#503

Closed

BoPeng mentioned this issue Jun 29, 2018

R.exe (and Rterm.exe) persisted after shutting down the ir kernel. vatlab/sos-r#8

Open

jankatins mentioned this issue Sep 12, 2018

Jupyter 6.0 kernel provider mechanism anaconda/nb_conda_kernels#105

Merged

kevin-bates mentioned this issue Feb 3, 2022

Prefer sending signals to kernel process group #743

Merged

Carreau self-assigned this Feb 3, 2022

Carreau mentioned this issue Feb 17, 2022

BUG: Kill subprocesses on shutdown. ipython/ipykernel#869

Merged

blink1073 closed this as completed in ipython/ipykernel#869 Feb 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

subprocesses of kernels are not killed by restart #104

subprocesses of kernels are not killed by restart #104

jankatins commented Dec 10, 2015

takluyver commented Dec 10, 2015

jankatins commented Apr 25, 2016

takluyver commented Apr 25, 2016

jankatins commented Apr 25, 2016

jankatins commented Apr 25, 2016

takluyver commented Apr 25, 2016

jankatins commented Apr 25, 2016 •

edited

Loading

jankatins commented Apr 25, 2016

takluyver commented Apr 25, 2016

jankatins commented Apr 25, 2016

takluyver commented Apr 25, 2016

jankatins commented Apr 25, 2016 •

edited

Loading

takluyver commented Apr 26, 2016

mlucool commented Feb 1, 2022

kevin-bates commented Feb 2, 2022

mlucool commented Feb 2, 2022 •

edited

Loading

kevin-bates commented Feb 2, 2022

mlucool commented Mar 14, 2022

alita-moore commented May 18, 2024

kevin-bates commented May 21, 2024

subprocesses of kernels are not killed by restart #104

subprocesses of kernels are not killed by restart #104

Comments

jankatins commented Dec 10, 2015

takluyver commented Dec 10, 2015

jankatins commented Apr 25, 2016

takluyver commented Apr 25, 2016

jankatins commented Apr 25, 2016

jankatins commented Apr 25, 2016

takluyver commented Apr 25, 2016

jankatins commented Apr 25, 2016 • edited Loading

jankatins commented Apr 25, 2016

takluyver commented Apr 25, 2016

jankatins commented Apr 25, 2016

takluyver commented Apr 25, 2016

jankatins commented Apr 25, 2016 • edited Loading

takluyver commented Apr 26, 2016

mlucool commented Feb 1, 2022

kevin-bates commented Feb 2, 2022

mlucool commented Feb 2, 2022 • edited Loading

kevin-bates commented Feb 2, 2022

mlucool commented Mar 14, 2022

alita-moore commented May 18, 2024

kevin-bates commented May 21, 2024

jankatins commented Apr 25, 2016 •

edited

Loading

jankatins commented Apr 25, 2016 •

edited

Loading

mlucool commented Feb 2, 2022 •

edited

Loading