Shutdown fixes/improvements #1452

antrik · 2016-03-01T19:44:05Z

Rebased from #1067 , and added a test case to verify the fix.

Should fix #1369

The Server has a global stop channel which is used both internally (by Monitor) and externally (by the Stop method) to shut down the server. This is bad; invoking the method simultaneously by multiple goroutines is not safe (and as seen in coreos#1044 can cause panics due to a doubly-closed channel). This change centralises the shutdown procedure through the Monitor, so that when an external user of the Server wants to shut it down, it triggers an error to propagate up from the monitor. Hence there is only a single path in which the stopchannel (which terminates all other Server goroutines) can be called.

There are three different paths in the main fleetd goroutine that can access the global `srv` Server - reconfigurations, shutdowns and statedumps. Right now there's nothing preventing racy access to this instance, so introduce a mutex to protect it. One potential issue with this is that it means that a reconfigure or state dump can "block" a shutdown, but IMHO if this occurs it will expose behaviour that is broken and needs to be fixed anyway.

- add all background server components to a WaitGroup - when shutting down the server, wait on this group or until a timeout (defaulting to one minute) before restarting or exiting. - if timeout occurs, shut down hard and let a - move Monitor into server package - Server.Monitor -> Server.Supervise to remove ambiguity/duplication

Channels that are just used to "broadcast" messages (e.g. they are only ever closed) do not need a type; it is better to be more explicit about this by using a struct{}. Similarly, the channels can be receive-only.

To make things a little clearer for ol' man Crawford, rename the "Stop" function to "Kill" to align better with the channel names and be a little more explicit that it is invoked in response to a kill signal.

antrik · 2016-03-02T13:07:28Z

Meh, the new test seems to fail intermittently now and then :-(

jonboulle · 2016-03-02T13:10:35Z

How do they fail?

On Wed, Mar 2, 2016 at 2:07 PM, Olaf Buddenhagen notifications@github.com
wrote:

Meh, the new test seems to fail intermittently now and then :-(

—
Reply to this email directly or view it on GitHub
#1452 (comment).

antrik · 2016-03-02T13:51:10Z

@jonboulle in order to set the stage for triggering the condition we want to test, we first need to cut the network connection to etcd, and wait for the "Monitor triggered" message to show up in the journal. Sometimes this just waits forever. Don't know yet why this happens: whether we just miss the message for some reason; or whether fleet sometimes behaves differently, and the message is indeed not generated at all...

antrik · 2016-03-02T14:06:18Z

I should probably note that this seems to happen occasionally both with the original and with the reworked code.

Before the monitor/shutdown handling rework, fleetd would panic with a channel double-close if it was shut down while a restart (triggered by the monitor on a failed health check) was is progress. Added a functional test for this situation -- and also for regular shutdown as a baseline.

The default agent_ttl setting of 30s makes TestMonitorVsShutdown run pretty long. (And probably also other tests related to connectivity loss, that will be added in the future...) With no actual network delays involved, making the timeout much shorter shouldn't cause any trouble -- so we can simply set it globally, rather than trying to override it only for the specific test in question.

antrik · 2016-03-02T16:12:03Z

OK, it turns out it was only my method of monitoring the journal that was unreliable -- with a different approach it appears to work better.

While at it, I added two extra commits to this PR, that are not strictly necessary, but related: one reduces the agent_ttl for fleet to speed up the new test; the other is a minor cleanup to the fleet code changed in this PR, which I stumbled upon while working on this stuff.

tixxdz · 2016-03-10T15:06:57Z

@antrik could you please re-push again so the semaphore runs again ?

The function moved to a different package, but otherwise can be used as before -- I presume inlining it was some kind of accident...

antrik · 2016-03-10T15:23:36Z

I guess Semaphore hates this PR :-( Let's try a new one...

tixxdz · 2016-03-10T16:17:54Z

Ok this one was closed in favor of #1496

jonboulle added 7 commits February 24, 2016 16:10

*: change kill/stop channels to struct{}

60df897

Channels that are just used to "broadcast" messages (e.g. they are only ever closed) do not need a type; it is better to be more explicit about this by using a struct{}. Similarly, the channels can be receive-only.

server: wg.Done -> wg.Wait (doh)

a586f82

server: change Monitor to return "shutdown" bool

76437a7

server: Stop -> Kill

38bd394

To make things a little clearer for ol' man Crawford, rename the "Stop" function to "Kill" to align better with the channel names and be a little more explicit that it is invoked in response to a kill signal.

antrik added 2 commits March 2, 2016 16:57

antrik force-pushed the antrik/fix-shutdown-rebased branch from 1cd11d2 to 671d9bd Compare March 2, 2016 16:03

antrik mentioned this pull request Mar 10, 2016

Fleet loses socket after SIGHUP #1190

Closed

jonboulle added the needs review label Mar 10, 2016

jonboulle added this to the v0.12.0 milestone Mar 10, 2016

antrik force-pushed the antrik/fix-shutdown-rebased branch from 671d9bd to 92fc7ed Compare March 10, 2016 15:13

Cleanup: Restore usage of NewMonitor() instead of inlining the code

e9b6108

The function moved to a different package, but otherwise can be used as before -- I presume inlining it was some kind of accident...

antrik force-pushed the antrik/fix-shutdown-rebased branch from 92fc7ed to e9b6108 Compare March 10, 2016 15:20

antrik closed this Mar 10, 2016

antrik mentioned this pull request Mar 10, 2016

server: fix panic on graceful shutdown #1496

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shutdown fixes/improvements #1452

Shutdown fixes/improvements #1452

antrik commented Mar 1, 2016

antrik commented Mar 2, 2016

jonboulle commented Mar 2, 2016

antrik commented Mar 2, 2016

antrik commented Mar 2, 2016

antrik commented Mar 2, 2016

tixxdz commented Mar 10, 2016

antrik commented Mar 10, 2016

tixxdz commented Mar 10, 2016

Shutdown fixes/improvements #1452

Shutdown fixes/improvements #1452

Conversation

antrik commented Mar 1, 2016

antrik commented Mar 2, 2016

jonboulle commented Mar 2, 2016

antrik commented Mar 2, 2016

antrik commented Mar 2, 2016

antrik commented Mar 2, 2016

tixxdz commented Mar 10, 2016

antrik commented Mar 10, 2016

tixxdz commented Mar 10, 2016