-
Notifications
You must be signed in to change notification settings - Fork 302
Conversation
@jonboulle I'm not sure I buy this at a conceptual level. The HeartMonitor does not have sole reign over the Server, it's simply another actor watching for inputs and reacting by asking the Server to stop and start. |
@bcwaldon that's what the monitor was, I'm proposing a conceptual change. I can probably tweak some naming to make it a bit easier to swallow. But I am pretty sure this is the right way to go. An external actor saying "stop" is just another signal for it to act on |
@jonboulle talked about this in person - just going to address the naming, but conceptually this makes sense |
FWIW in Aurora we had a [0] well actually just a single |
@jonboulle I'd like to keep the |
...
|
@jonboulle any movement here? |
Got stuck in a major yak shave. Will try extricate myself. |
@bcwaldon is this getting better or worse |
LGTM |
// beats successfully. If the heartbeat check fails for any | ||
// reason, an error is returned. If the supplied channel is | ||
// closed, Monitor returns ErrShutdown. | ||
func (m *Monitor) Monitor(hrt heart.Heart, sdc <-chan bool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the values in sdc don't matter, a channel of struct{}
would be more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
He's got a good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM |
@jonboulle rebase and merge it |
I haven't really tested this to a satisfactory degree yet. For example, ee33bbe (golang is hard) |
The Server has a global stop channel which is used both internally (by Monitor) and externally (by the Stop method) to shut down the server. This is bad; invoking the method simultaneously by multiple goroutines is not safe (and as seen in coreos#1044 can cause panics due to a doubly-closed channel). This change centralises the shutdown procedure through the Monitor, so that when an external user of the Server wants to shut it down, it triggers an error to propagate up from the monitor. Hence there is only a single path in which the stopchannel (which terminates all other Server goroutines) can be called.
There are three different paths in the main fleetd goroutine that can access the global `srv` Server - reconfigurations, shutdowns and statedumps. Right now there's nothing preventing racy access to this instance, so introduce a mutex to protect it. One potential issue with this is that it means that a reconfigure or state dump can "block" a shutdown, but IMHO if this occurs it will expose behaviour that is broken and needs to be fixed anyway.
- add all background server components to a WaitGroup - when shutting down the server, wait on this group or until a timeout (defaulting to one minute) before restarting or exiting. - if timeout occurs, shut down hard and let a - move Monitor into server package - Server.Monitor -> Server.Supervise to remove ambiguity/duplication
Channels that are just used to "broadcast" messages (e.g. they are only ever closed) do not need a type; it is better to be more explicit about this by using a struct{}. Similarly, the channels can be receive-only.
To make things a little clearer for ol' man Crawford, rename the "Stop" function to "Kill" to align better with the channel names and be a little more explicit that it is invoked in response to a kill signal.
This should be code complete, but requires a rebase and testing. I'd consider it blocked on #1403 |
Rebased series (also squashing the fixup commit along the way): https://github.com/endocode/fleet/tree/antrik/fix-shutdown-rebased (I also looked through the changes. They all look reasonable to me -- but I guess that's not very relevant, considering this has been reviewed before...) Now that we have functional tests running (no regressions here), is it time to make an updated PR and get it merged? Or shall I spend some time trying to come up with additional unit tests (and possibly functional tests) checking the specific issue this addresses, and/or any code paths that seem most likely to experience regressions? |
@antrik if you can think of how to devise a test for this, that would be fantastic. In any case it would be great to have a new PR to move forward with this. |
@jonboulle well, a functional test might be tricky. I believe I understand more or less how the race can happen -- but whether I can find a way to trigger it on purpose, I am not sure. (Plus I don't know whether it is likely enough actually to hit it when running repeatedly in a loop for just a couple of seconds...) Triggering it with a unit test is probably way easier -- but also less meaningful... In any case, delving into this might take a couple of days -- so the question is whether you consider that worthwhile? If so, I'll get on it; otherwise, I'll just make a new PR from the rebased branch without any new tests. |
This seems easy enough to check, maybe worth a quick experiment? please put up another PR for merging |
So it turns out this is actually pretty easy to reproduce: we just need to make sure that etcd stops responding (which we can do for example by sending it SIGSTOP) -- once the timeout passes and the monitor triggers, fleet will indefinitely hang in a state of limbo (as long as etcd remains unavailable), where initiating a shutdown reliably triggers the crash. And this patch series indeed fixes the problem :-) Now I "just" need to find a way to turn this into an automated test -- which might be more tricky, as the tests currently rely on a system-provided etcd we have no control over, rather than launching a private one... Any suggestions? |
@antrik it is not a problem to create a test which will use etcd inside the systemd-nspawn container. |
looks like it is related #715 |
I'm not sure what's going on here, but clearly fleet should never try to close an already-closed channel: From #1044: