Remove dead workers immediately after killing. #1084

rwe · 2015-07-13T21:46:20Z

This also organizes the shared "worker cleanup" operations like closing worker.tmp and calling self.cfg.worker_exit.

This means that self.WORKERS will always be updated after spawning/killing, and may only be out of sync due to unknown dead processes (handled in reap_workers).

This was done based on conversation in #1078. The original suggestion was to maintain a dead worker list ~~but I believe that this is a little simpler and accomplishes the same thing.~~ and I eventually came to the same conclusion.

rwe · 2015-07-13T21:50:31Z

gunicorn/arbiter.py

-                    if not worker:
-                        continue
-                    worker.tmp.close()
+                    self.reap_dead_worker(wpid)


Note that this did not call self.cfg.worker_exit(self, worker) before, but will now. I believe that that was an oversight and this is a correct change, but could use a double-check if it was intentional.

rwe · 2015-07-13T21:57:25Z

gunicorn/arbiter.py

                       extra={"metric": "gunicorn.workers",
-                              "value": len(workers),
+                              "value": len(self.WORKERS),


Before, workers and self.WORKERS could be out of sync here if workers were killed, because they would be reaped asynchronously. Because both their insertion (via spawn_worker) and removal (via kill_worker) are now synchronous, self.WORKERS is the correct value here.

tilgovi · 2015-07-13T22:09:18Z

I think reap_dead_worker can just be inlined in cleanup_worker and everything can be done when reap_workers happens on SIGCHLD. I think it's confusing to have several places where cleanup is called.

tilgovi · 2015-07-13T22:10:04Z

Oh, I see. But if we do that, then we can't pop synchronously, because by the time we get the SIGCHLD we don't have the worker object.

tilgovi · 2015-07-13T22:13:02Z

I'll wait for others to comment and take a look again tomorrow.

tilgovi · 2015-07-13T22:13:12Z

Thanks for doing this.

rwe · 2015-07-13T23:13:10Z

No problem @tilgovi. If I read your comments correctly, you arrived at same conclusion as I did here in terms of cleaning up immediately versus needing to keep the worker object around in a separate non-WORKERS list. Is that right?

rwe · 2015-07-14T00:00:28Z

As I look more into this, I'm starting to think that the original suggestion of keeping a 'killed worker' list is correct. Otherwise killing and waiting for ECHILD must be done synchronously to avoid prematurely signaling the child's exit before it's handled the kill signal.

This also organizes the shared "worker cleanup" operations like closing `worker.tmp` and calling `self.cfg.worker_exit`. This means that `self.WORKERS` will always be updated after spawning/killing, and may only be out of sync due to unknown dead processes (handled in `reap_workers`).

rwe · 2015-07-14T00:14:08Z

gunicorn/arbiter.py

-                worker.tmp.close()
-                self.cfg.worker_exit(self, worker)
-            except:
-                pass


Removed because this portion of spawn_worker is executed on the worker itself, and we let the parent handle its exit in reap_workers.

rwe · 2015-07-14T00:19:20Z

Ok, I believe this is correct now. It removes them from self.WORKERS immediately but still cleans them up properly when SIGCHILD comes in.

rwe · 2015-07-16T14:21:34Z

This is ready for review, if anyone's available to take a look. My main interest in this PR is to unblock #1078

tilgovi · 2015-07-16T16:50:14Z

gunicorn/arbiter.py

-                        continue
-                    worker.tmp.close()
+
+                    worker = (self.WORKERS.pop(wpid, None)


I would add a comment here to say to explain that this function might happen before the SIGCHLD handler for this pid.

I assume that's why this is like this.

No, reap_workers is only called in response to SIGCHILD as before.

self.DYING_WORKERS exists so that kill_worker can remove pids from self.WORKERS while we're waiting for SIGCHILD to come in and be processed.

That means that SIGCHILD can be caused either by the child exiting unexpectedly, or because we killed it intentionally. In the first case, that worker would have still been in self.WORKERS. In the second, we would have moved it to self.DYING_WORKERS.

tilgovi · 2015-07-16T16:51:15Z

Two small comments but otherwise I think it looks fine.

tilgovi · 2015-07-16T17:25:43Z

I'd like @berkerpeksag or @benoitc to read this over, but I feel pretty good about it.

rwe · 2015-07-16T17:31:31Z

Awesome. I made one additional commit to make sure that dying workers are included in stop and murder_workers to preserve that original behavior. Pending further comments, that's all I have.

berkerpeksag · 2015-07-17T02:20:14Z

gunicorn/arbiter.py

@@ -424,7 +425,7 @@ def murder_workers(self):
        """
        if not self.timeout:
            return
-        workers = list(self.WORKERS.items())
+        workers = list(self.WORKERS.items() + self.DYING_WORKERS.items())


Unfortunately, this will raise TypeError in Python 3. I'd write this like

workers = list(self.WORKERS.items()) + list(self.DYING_WORKERS.items())

Fixed, thank you!

berkerpeksag · 2015-07-17T02:27:47Z

Looks good to me (except my comment about Python 3). Could you also squash the commits into one? Thanks!

(Should we get this in for the next release or wait for 20.0?)

benoitc · 2015-07-17T13:51:07Z

gunicorn/arbiter.py

-                try:
-                    worker = self.WORKERS.pop(pid)
-                    worker.tmp.close()
-                    self.cfg.worker_exit(self, worker)


this hooks disappeared in the change. Should probably come back

It's in reap_workers on line 469.

benoitc · 2015-07-17T13:59:32Z

I will take the time to read that patch over the week-end. To be honnest I am quite uncomfortable to maintain 2 lists there and I am not sure that all races conditions are solved right now. Like what happen if I try to kill a worker that died but we didn't get yet the signal?

We should probably start to add some tests there. (but that outside the scope of this PR).

note: I still don't understand why moving the debug log at the end of the if block like suggested wasn't enough to solve #1078.

tilgovi · 2015-07-17T14:42:07Z

I reviewed and we discussed the two places for races. It is the cause of the two places where we pop from both lists.

Is there a reason you are uncomfortable with two lists? In my opinion, it makes reasoning about the concurrency easier.

benoitc · 2015-07-22T04:49:57Z

@tilgovi I am worried about race condition when you remove a socket from SOCKETS and pour it in DYING_WORKERS . I may just be be too cautious there.

Anyway I like the idea of marking a socket as dead ASAP and remove it from the count. I wonder if we can then just keep a reference of them in the DYING_WORKERS and when it's time garbage collect hem and finally remove them from WORKERS in one place. So the change would be really atomic and not lock too much the VM. Thoughts?

tilgovi · 2015-07-22T15:07:53Z

I hear you. Signals are tricky places for concurrency problems.

I'll double check over this and see if there's anything I'd change.

On Tue, Jul 21, 2015, 21:50 Benoit Chesneau notifications@github.com
wrote:

@tilgovi https://github.com/tilgovi I am worried about race condition
when you remove a socket from SOCKETS and pour it in DYING_WORKERS . I
may just be be too cautious there.

Anyway I like the idea of marking a socket as dead ASAP and remove it from
the count. I wonder if we can then just keep a reference of them in the
DYING_WORKERS and when it's time garbage collect hem and finally remove
them from WORKERS in one place. So the change would be really atomic and
not lock too much the VM. Thoughts?

—
Reply to this email directly or view it on GitHub
#1084 (comment).

rwe · 2015-07-22T15:33:49Z

@benoitc @tilgovi Are you referring to WORKERS when you say SOCKETS? Not finding that var anywhere.

Sounds like you might be referring to this?

worker = self.WORKERS.pop(pid, None)
# <-- If `reap_workers` were run here, we would not clean up the worker
#     until it later timed out and got collected in `murder_workers()`.
#     That would be annoying (though probably not problematic).
#
#     However, that can't happen: the signal handling just adds to SIG_QUEUE,
#     which is processed in the main run loop. `reap_workers` will never
#     ever be run behind our backs.
if worker is not None:
    self.DYING_WORKERS.setdefault(pid, worker)

rwe · 2015-07-22T15:46:11Z

Nope, I'm wrong about that and not sure how I overlooked it: self.handle_child is registered directly for SIG_CHILD, unlike all other signals that go to self.signal which queues them for handling in the main loop.

So that is a potential problem. Signal handlers are supposed to do very little, so I'm not sure what the rationale is for calling reap_workers() there versus adding the signal to be handled in the main run loop.

My guess is that it was done that way to keep self.WORKERS matching the state of worker liveliness as much as possible, a behaviour which this PR obsoletes. In which case, the right move would be to remove that specialized handling and queue the signal like normal.

Does that make sense?

rwe · 2015-07-22T16:20:21Z

It seems that the original code calling reap_workers() directly in a signal handler was likely a mistake to begin with: Calling os.waitpid(), calling self.cfg.worker_exit(), or modifying self.WORKERS in that signal handler were all potentially unsafe. It could also raise HaltServer there, which wouldn't be caught and turned into self.halt in the main loop as intended.

tilgovi · 2015-07-22T16:21:47Z

Yes let's queue the call and then only modify the lists outside the signal
frame and I think this will be wonderful.

On Wed, Jul 22, 2015, 09:20 Robert Estelle notifications@github.com wrote:

It seems that the original code calling reap_workers() directly in a
signal handler was likely a mistake to begin with: Calling os.waitpid(),
calling self.cfg.worker_exit(), or modifying self.WORKERS in that signal
handler were all potentially unsafe. It could also raise HaltServer
there, which it wouldn't be caught and turned into self.halt in the main
loop as intended.

—
Reply to this email directly or view it on GitHub
#1084 (comment).

The new `DYING_WORKERS` list means that we don't need to run `reap_workers` immediately on SIG_CHLD; and in fact, we shouldn't. The original code calling `reap_workers()` directly in a signal handler at any time was likely unsafe to begin with: `reap_workers` calls `os.waitpid()`, calls `self.cfg.worker_exit()`, and modifies `self.WORKERS`. All of those are potentially unsafe in a signal handler. And not sure what raising It also can raise `HaltServer`, which wouldn't be caught and turned into `self.halt` in the main loop as intended.

After going through git blame, there was no rationale recorded for the very low limit: it was present in the first commit introducing the signal queue. Since SIG_CHLD gets queued now, it would be entirely possible for the previous limit of 5 queued signals to be exceeded during `manage_workers`.

benoitc · 2015-07-23T21:22:56Z

No please don't. The reason we are handling SIGCHLD differently is because of portability.

Also it is handled outside of the queue, because really you want it to handle it ASAP not somewhere in the signal queue. The signal queue is here mostly to reduce the signal noise. So you don't want it too large since we put the priority on the monitoring. I would like to keep that behaviour for now.

@erydo yes I was speaking about WORKERS.

There are really good points anyway in this PR that we should really keep. In kill_worker and spawn_worker we should really not removing it from the WORKERS there. But I think we could do better than juggling between WORKERS and DYING_WORKERS. I would instead just keep a reference in DYING_WORKER and not touch WORKERS and keep the logic of a lazy garbage collection on reap_worker and have another routine in the main loop that would do the garbage collection.

Thoughts?

rwe · 2015-07-23T22:48:52Z

@benoitc

I'm having trouble understanding the portability element you're referring to in that link. I see that SIGCHLD can not be set to SIG_IGN, but that isn't done here. Is there another issue I'm not parsing out?
The short signal queue means that signals can be dropped silently, even if they're important, and by default arbiter handles a bunch of signal types. For example, something sending signals to increment/decrement the worker count could easily drown out HUP/QUIT/INT etc signals.
I believe there's a semantic distinction between "workers that we expect to be alive" and "workers that we expect to be dying" that was ignored before and that this preserves. Previously, manage_workers, spawn_workers, etc. would include dying workers when verifying whether there were enough active ones. They would also potentially send duplicate SIGTERMs to the same worker over subsequent calls of manage_workers, which would could be interleaved with SIGKILL from murder_workers.

benoitc · 2015-07-24T08:36:30Z

@erydo there are different concerns thre. And I would like to separate them.

About handling reaping dead workers apart, this is a design choice. Dying workers should likely happen, and you don't want to run them each time in the main loop but you want to run them ASAP to keep the number of workers.
Queue size: you're right there is a possibility to discard some signals there and we should fix it.
Marking a worker as inactive or dying should indeed be done ASAP. But I don't think that juggling between 2 structures is OK (the or thing and moving from one to the other). Instead we should find a way to mark them in a more atomic way as dying, or simply maintain a counter since we don't really need to know which one is about to dying, the system is doing it for us anyway. Just knowing that the number of active processes has been changed should already improve the loop a lot.

About 1 and 2 I reread the code and looked at your changes and I think we could mix our concerns by implementing a priority queue. All signals will be queued there and we would assign a priority to them. Then this queue will be handled in the main loop. As for the waitpid I would like to keep it that way. This is a common pattern and we can discuss some changes about it later. Would that work for you?

About 3, I would also note that I am working on a patch implementing the IMSG framework from the OpenBSD project to improve the communication with the workers and removing the temporary file trick. I will have a working patch over the weekend and this is also why I want to keep the changes simple for now while I agree we are doing too much works to handle race conditions.

benoitc · 2017-02-01T13:12:15Z

closing the issue. I'm uncomfortable with that change as noted above. thanks for the patch anyway.

rwe force-pushed the cleanup-async-reaping branch from 96c2f30 to dc3f522 Compare July 13, 2015 21:47

rwe mentioned this pull request Jul 13, 2015

Only log '[num] workers' message when it changes. #1078

Merged

rwe reviewed Jul 13, 2015
View reviewed changes

rwe force-pushed the cleanup-async-reaping branch from dc3f522 to de31a19 Compare July 13, 2015 21:51

rwe reviewed Jul 13, 2015
View reviewed changes

rwe added 3 commits July 13, 2015 20:08

Parent's reap_workers should handle worker exit hook, not workers.

fbe9a6b

Handle cleanup of dying workers using separate worker list.

15c43f0

rwe force-pushed the cleanup-async-reaping branch from de31a19 to 15c43f0 Compare July 14, 2015 00:12

rwe reviewed Jul 14, 2015
View reviewed changes

tilgovi reviewed Jul 16, 2015
View reviewed changes

Only add non-None to DYING_WORKERS list.

49ecc8b

Consider dying workers when stopping and murdering.

e5307a4

berkerpeksag reviewed Jul 17, 2015
View reviewed changes

benoitc reviewed Jul 17, 2015
View reviewed changes

benoitc added Improvement Feature/Core labels Jul 17, 2015

Fix worker items concatenation for Py3.

dd02dca

rwe force-pushed the cleanup-async-reaping branch from 4e349e4 to 22662c9 Compare July 22, 2015 16:30

rwe force-pushed the cleanup-async-reaping branch from 22662c9 to 180fa0e Compare July 22, 2015 16:32

rwe force-pushed the cleanup-async-reaping branch from 110654b to 989a423 Compare July 22, 2015 16:42

benoitc closed this Feb 1, 2017

Remove dead workers immediately after killing. #1084

Remove dead workers immediately after killing. #1084

Conversation

rwe commented Jul 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tilgovi commented Jul 13, 2015

tilgovi commented Jul 13, 2015

tilgovi commented Jul 13, 2015

tilgovi commented Jul 13, 2015

rwe commented Jul 13, 2015

rwe commented Jul 14, 2015

Choose a reason for hiding this comment

rwe commented Jul 14, 2015

rwe commented Jul 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tilgovi commented Jul 16, 2015

tilgovi commented Jul 16, 2015

rwe commented Jul 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

berkerpeksag commented Jul 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benoitc commented Jul 17, 2015

tilgovi commented Jul 17, 2015

benoitc commented Jul 22, 2015

tilgovi commented Jul 22, 2015

rwe commented Jul 22, 2015

rwe commented Jul 22, 2015

rwe commented Jul 22, 2015

tilgovi commented Jul 22, 2015

benoitc commented Jul 23, 2015

rwe commented Jul 23, 2015

benoitc commented Jul 24, 2015

benoitc commented Feb 1, 2017