Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subprocesses of kernels are not killed by restart #104

Closed
jankatins opened this issue Dec 10, 2015 · 20 comments · Fixed by ipython/ipykernel#869
Closed

subprocesses of kernels are not killed by restart #104

jankatins opened this issue Dec 10, 2015 · 20 comments · Fixed by ipython/ipykernel#869
Assignees

Comments

@jankatins
Copy link

Restarting a kernel only kills the direct child, but not any opened subprocesses of that child.

This is a problem if the kernel spawns multiple childprocess like the R kernel (on windows: R - cmd - R - cmd - Rterm) or if the python kernel opens a subprocess which sleeps.

Issue on the irkernel repo: IRkernel/IRkernel#226

IMO the kernel manager should kill the complete process group and not only the process which is directly spawned by the kernel manager.

I use restart kernel when I want resources back which I lost in the kernel (due to the stupid commands I executed) and currently I have a 4 processes running (cmd - R - cmd - Rterm) where I killed the kernel 15min ago (and it's still 25% CPU and 1GB Ram). So if I have to wait for the actual process which does the work to finish, the restart command is kind of useless :-(

reproducible example (at least on windows): put it into a cell, execute it and then restart the kernel via the notebook UI.

import platform
import subprocess
if platform.system() == 'Windows':
    CREATE_NO_WINDOW = 0x08000000
    subprocess_creation_flags = CREATE_NO_WINDOW
else:
    # Apparently None won't work here
    subprocess_creation_flags = 0
exec_line = u"""python -c "import time;time.sleep( 50000 )" """
p = subprocess.Popen(exec_line, shell=False,
                     stdout=subprocess.PIPE, stderr=subprocess.PIPE,
                     creationflags=subprocess_creation_flags)
p.communicate()
@takluyver
Copy link
Member

I suspect that makes sense on Unix, but killpg is only available on Unix, so that's not going to help you on Windows.

The question you linked has a couple of answers that would apply on Windows: one with psutil, and one that involves launching TASKKILL in another subprocess.

@jankatins
Copy link
Author

ipyparallel has some better way to handle signals on windows: https://github.com/ipython/ipyparallel/blob/d35b4901adde6e651bb6ffc5c8aa9df0d20f1a73/ipyparallel/apps/launcher.py#L282-L288

Would you take a patch which uses check_output(['taskkill', '-pid', str(self.kernel.pid), '-t', '-f']) on windows instead self.kernel.kill() in _kill_kernel() (source)?

@takluyver
Copy link
Member

In principle I'd be OK with that. Though a note on this page is a bit concerning: "Home editions of Windows do not have TASKKILL, useTSKILL instead."

Maybe that's outdated? Do we have a good way of checking?

@jankatins
Copy link
Author

Not on this side: only pro systems here :-/

@jankatins
Copy link
Author

BTW: can someone on linux check if the problem exists there, too? E.g. run the cell in the initial message and restart the kernel and see if the subprocess is also gone...?

@takluyver
Copy link
Member

Yes, with a bit of adapting (turning command string into command list), I can run that (on Linux), and when I restart the kernel, the subprocess is orphaned and continues running.

However, thinking about this some more, I'm not sure that the proposed fixes will help. When we restart a kernel, we first ask it nicely to shut down, and only if that doesn't work do we forcibly kill it. So I guess it should be the kernel's own responsibility to deal with child processes in the first instance.

@jankatins
Copy link
Author

jankatins commented Apr 25, 2016

Ok, I did a bit of googling turned up that taskkill is not available in win7 home. I've not found out about Win8 and Win10. So the workaround is to taskkill, check if the command was found and if not to add a kernel.kill() to use the old subprocess way...

RE "subprocess cleanup": that's true as long as the kernel can answer, but sometimes that's not the case: e.g. if the user starts a http server for displaying a plot and then the plotting itself takes too long, so that the kernel is unreachable and can't answer the "shutdown kernel" message. Or the irkernel call chain mess where the kernel isn't responding as long as it executes something...

Another way might be using psutil: http://stackoverflow.com/questions/4789837/how-to-terminate-a-python-subprocess-launched-with-shell-true/25134985#25134985 https://github.com/giampaolo/psutil/blob/1b71ca531ba377e8c71d2c581debfce7eae35a40/psutil/tests/__init__.py#L326

-> If psutil is importable, use that method to kill all children, otherwise the old one?

@jankatins
Copy link
Author

The "kill all children" might also becomes a necessity if the future kernel nanny becomes unresponsive itself (not sure if this might ever happen...), e.g. how to kill the chain nanny - kernel in that case.

@takluyver
Copy link
Member

Part of the idea of the kernel nanny is that it should always be responsive, because it's not running user code, or any significant tasks. So the frontend would ask the nanny to shut down, and the nanny would deal with asking the kernel nicely and then forcefully killing it if necessary.

However, since the normal case is a kernel that is responsive, I think the first thing is to work out how kernels should shut themselves down and terminate child processes, before we worry too much about what to do when they're not responsive.

@jankatins
Copy link
Author

I think the first thing is to work out how kernels should shut themselves down and terminate child processes, before we worry too much about what to do when they're not responsive.

Is that for this issue or the kernel nanny proposal? If the first: that's over my head :-/

-> Should I still submit a PR for the taskkill (or even the psutil) way to kill the kernel and all it's children?

@takluyver
Copy link
Member

That's for this issue. I don't think the kernel nanny thing changes this in any significant way.

I'd rather leave this in a consistent state, i.e. as it is now, than introduce a bunch of extra complexity that fixes it in some situations but not others. Especially if we don't know how to fix it in the most common case, where the kernel is responding.

@jankatins
Copy link
Author

jankatins commented Apr 25, 2016

So, here is my argument for having such a "kernel massacre" function included in both the nanny and the juypter_client code. I think there are at least two use cases here for such a function:

  • the initial case of the non responding R kernel which should get restarted / killed (e.g. endless loop). In this case, the r kernel can handle the children, because a) the R kernel is busy executing the non responding code and doesn't see the shutdown message and b) the kernel is not the parent but the child in the call chain: notebook-server - r.exe - cmd - rterm.exe (probably longer when not directly calling the 64bit version) -> the notebook kills r.exe, but rterm.exe is the actual r kernel.
  • the case which got simulated by the above code: the kernel opens a subprocess but the users doesn't kill the subprocess before killing the kernel. This asks for a solution in the kernel, but this means that all kernels must implement a way to kill their own children (I opened this for the R kernel: on shutdown, kill all children IRkernel/IRkernel#307).

But IMO both cases are a nice feature of the kernel nanny (and the code in jupyter_client which would have the same role in case there is no nanny): cleaning up after the user:

  1. receive the shutdown request from the server (or being the server)
  2. register the children of the kernel process
  3. ask the kernel to shutdown itself
  4. wait a few seconds
  5. cleanup: kill every kernel/children which are still around

@takluyver
Copy link
Member

psutil can find all children of a process, though killpg seems like a better way to do it on Unix. Maybe we just need to depend on psutil on Windows.

@mlucool
Copy link
Contributor

mlucool commented Feb 1, 2022

Hi,

Adding another simple example of this issue:

Run the following code then restart your notebook. Then run it again and you'll see the children piling up (this assumes you have no other things sleeping for the demo to work).

import getpass, os, psutil, signal, subprocess

def find_interesting_children():
    user = getpass.getuser()
    ret = []
    for pid in list(psutil.pids()):
        try:
            proc = psutil.Process(pid)
        except Exception:
            # Ignore now-dead processes.
            continue
        if proc.username() != user:
            continue
        if proc.name() != 'sleep':
            continue
        ret.append(pid)
    return ret

def preexec_function():
    signal.signal(signal.SIGINT, signal.SIG_IGN)

print(find_interesting_children())
child = subprocess.Popen(['sleep', '180d'], preexec_fn=preexec_function)
print(find_interesting_children())

If you want to cleanup run later:

for pid in find_interesting_children():
    os.kill(pid, signal.SIGTERM)

On a high level, this stems from the choice in this toy example to ignore SIGINT. Sometimes a subprocess may do this and it's out of your control so restarting kernels leaves behind a mess (if you don't ignore SIGINT things work as expected). Can we add logic to handle this variant of process cleanup?

@kevin-bates
Copy link
Member

Hi @mlucool, Thanks for the test case!

I've spent the better part of today looking into this and it's quite interesting (IMHO). There are three areas of interest, two of which are still a bit of mystery to me.

  1. First, I think the LocalProvisioner has a regression issue in that it's issuing Popen kill() and terminate() commands directly, rather than preferring to issue the respective signals (SIGKILL, SIGTERM) against the process group - as was the case in 6.x. This issue, though, does not manifest itself in the current incantation of things (more below).
  2. It's a mystery to me why the ignore of SIGINT has an impact when ignoring any other signal (or the default SIGINT handler) does not!
  3. There are two forms of shutdown that can happen, "immediate" and "graceful". The default (via the now parameter) is "graceful", meaning that a shutdown_request message is sent and, unless there are issues, nothing else occurs (if the message is detected as not having worked [via polling], then SIGTERM/SIGKILL are sent). "Immediate" shutdown issues a kill() call directly. (The regression mentioned in item 1 is that this should use the signal against the process group, before issuing kill().)

What I don't understand is why handling SIGINT makes a difference here. Seems like it, coupled with the message-based shutdown, is an area that should be looked at (at the kernel level I presume). (Does the kernel's message-based shutdown try to send SIGINT to its process group as its form of cleaning up child processes by any chance? Just trying to explain what is seen.)

One thing we could do relative to restarts is perform shutdown in an immediate fashion (now=True). This will then trigger the signal/process-group form of shutdown (provided the LocalProvisioner is fixed).

This would still leave an issue when just terminating a kernel that has set up its child processes to ignore SIGINT since that uses "graceful" termination.

For now, I'd like to fix the regression issue in LocalProvisioner. Of course, this fix won't be "visible" until immediate shutdowns are more prevalent.

Should we make the shutdown issued from restart_kernel use now=True and bypass the message-based shutdown on restarts? Another approach might be to update the handler in jupyter_server to use now=True when restarting the kernel.

It would be great to hear from someone with some kernel-side experience and how "shutdown-request" with ignored SIGINT comes into play.

@mlucool
Copy link
Contributor

mlucool commented Feb 2, 2022

CC @Carreau

I think what's happening is as follows (I am not familiar with the client or kernel code, so this is a guess):

  1. SIGINT is sent to the kernel and it relays it to its children
  2. The kernel is correctly shut down but its children are not
  3. The children are setting their parent now to 1 which jupyter can no longer tell it should be trying to kill

Proposal (maybe this should move to IPykernel/IPython?):

  1. On SIGINT the kernel does the normal SIGINT things but does not exit until all its children exit first.
  2. Because jupyter_client can send SIGTERM too, the kernel should also wait to exit on SIGTERM until all its children exit first.
  3. The kernel shutdown code should be robust to getting a SIGTERM while blocking on children as a result of a prior SIGINT.
  4. jupyter_client will do its usual escalation from SIGINT to SIGTERM to SIGKILL in a way that causes the whole process group to receive each signal.

@kevin-bates
Copy link
Member

On SIGINT the kernel does the normal SIGINT things but does not exit until all its children exit first.

The SIGINT issued during shutdown_kernel() is merely an attempt to have any current processing interrupted prior to the process's termination (via "shutdown_request" and, possibly, SIGTERM/SIGKILL). It was (relatively) recently added and is literally the interrupt_kernel() method itself. It is not meant to terminate anything.

I think the parent process should be responsible for the termination of its immediate children (and so on). We have the ability to cheat when the kernel is a local process and send a signal to the process group but (per my previous comment) that only occurs in "fallback" situations when the message-based shutdown request has proven ineffective. (Note: this was one of the primary reasons the interrupt was introduced in the shutdown sequence, so the kernel could respond to "shutdown_request" messages.)

I would agree that when receiving a "shutdown_request" message, the kernel process should do whatever it can to terminate its children. This could probably be something like os.killpg(os.getpgrp(), SIGTERM) (and the equivalent windows code) as its last act.

Carreau added a commit to Carreau/ipykernel that referenced this issue Feb 17, 2022
Fixes #jupyter/jupyter_client#104

This should make sure we properly cull all subprocesses at shutdown,
it does change one of the private method from sync to async in order to
no user time.sleep or thread so this may affect subclasses, though I
doubt it.

It's also not completely clear to me whether this works on windows as
SIGINT I belove is not a thing.

Regardless as this affects things like dask, and others that are mostly
on unix, it should be an improvement.

It does the following, stopping as soon as it does not find any more
children to current process.

 - Send sigint to everything
 - Immediately send sigterm in look with an exponential backoff from
   0.01 to 1 second roughtly multiplying the delay until next send by 3
     each time.
 - Switch to sending sigkill with same backoff.

There is no delay after sigint, as this is just a courtesy.
The delays backoff are not configurable. I can imagine that on slow
systems it may make sens
Carreau added a commit to Carreau/ipykernel that referenced this issue Feb 17, 2022
Fixes #jupyter/jupyter_client#104

This should make sure we properly cull all subprocesses at shutdown,
it does change one of the private method from sync to async in order to
no user time.sleep or thread so this may affect subclasses, though I
doubt it.

It's also not completely clear to me whether this works on windows as
SIGINT I belove is not a thing.

Regardless as this affects things like dask, and others that are mostly
on unix, it should be an improvement.

It does the following, stopping as soon as it does not find any more
children to current process.

 - Send sigint to everything
 - Immediately send sigterm in look with an exponential backoff from
   0.01 to 1 second roughtly multiplying the delay until next send by 3
     each time.
 - Switch to sending sigkill with same backoff.

There is no delay after sigint, as this is just a courtesy.
The delays backoff are not configurable. I can imagine that on slow
systems it may make sens
@mlucool
Copy link
Contributor

mlucool commented Mar 14, 2022

FTR, ipykernel 6.9.2 fixes #104 (comment).

@alita-moore
Copy link

Is there a way to make jupyter NOT stop sub-processes?

@kevin-bates
Copy link
Member

Hi @alita-moore. Life-cycle management of the kernel process (i.e., the sub-preocess) is the responsibility of the kernel provisioner. This gives one the ability to implement whatever kind of process management they'd like beyond what is provided by the local provisioner by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants