This repository has been archived by the owner on Jan 30, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 302
server: fix panic on graceful shutdown #1496
Merged
jonboulle
merged 10 commits into
coreos:master
from
endocode:antrik/fix-shutdown-rebased
Mar 10, 2016
Merged
server: fix panic on graceful shutdown #1496
jonboulle
merged 10 commits into
coreos:master
from
endocode:antrik/fix-shutdown-rebased
Mar 10, 2016
Commits on Mar 10, 2016
-
server: centralise shutdown through Monitor
The Server has a global stop channel which is used both internally (by Monitor) and externally (by the Stop method) to shut down the server. This is bad; invoking the method simultaneously by multiple goroutines is not safe (and as seen in coreos#1044 can cause panics due to a doubly-closed channel). This change centralises the shutdown procedure through the Monitor, so that when an external user of the Server wants to shut it down, it triggers an error to propagate up from the monitor. Hence there is only a single path in which the stopchannel (which terminates all other Server goroutines) can be called.
Configuration menu - View commit details
-
Copy full SHA for ac30f61 - Browse repository at this point
Copy the full SHA ac30f61View commit details -
fleetd: protect access to global Server instance
There are three different paths in the main fleetd goroutine that can access the global `srv` Server - reconfigurations, shutdowns and statedumps. Right now there's nothing preventing racy access to this instance, so introduce a mutex to protect it. One potential issue with this is that it means that a reconfigure or state dump can "block" a shutdown, but IMHO if this occurs it will expose behaviour that is broken and needs to be fixed anyway.
Configuration menu - View commit details
-
Copy full SHA for 2af2610 - Browse repository at this point
Copy the full SHA 2af2610View commit details -
server: introduce shutdown timeout
- add all background server components to a WaitGroup - when shutting down the server, wait on this group or until a timeout (defaulting to one minute) before restarting or exiting. - if timeout occurs, shut down hard and let a - move Monitor into server package - Server.Monitor -> Server.Supervise to remove ambiguity/duplication
Configuration menu - View commit details
-
Copy full SHA for 997bc47 - Browse repository at this point
Copy the full SHA 997bc47View commit details -
*: change kill/stop channels to struct{}
Channels that are just used to "broadcast" messages (e.g. they are only ever closed) do not need a type; it is better to be more explicit about this by using a struct{}. Similarly, the channels can be receive-only.
Configuration menu - View commit details
-
Copy full SHA for bcb9b9c - Browse repository at this point
Copy the full SHA bcb9b9cView commit details -
Configuration menu - View commit details
-
Copy full SHA for e449439 - Browse repository at this point
Copy the full SHA e449439View commit details -
Configuration menu - View commit details
-
Copy full SHA for 59f172a - Browse repository at this point
Copy the full SHA 59f172aView commit details -
To make things a little clearer for ol' man Crawford, rename the "Stop" function to "Kill" to align better with the channel names and be a little more explicit that it is invoked in response to a kill signal.
Configuration menu - View commit details
-
Copy full SHA for db39286 - Browse repository at this point
Copy the full SHA db39286View commit details -
Add testcase for panic on shutdown
Before the monitor/shutdown handling rework, fleetd would panic with a channel double-close if it was shut down while a restart (triggered by the monitor on a failed health check) was is progress. Added a functional test for this situation -- and also for regular shutdown as a baseline.
Configuration menu - View commit details
-
Copy full SHA for b358ee4 - Browse repository at this point
Copy the full SHA b358ee4View commit details -
Functional tests: Reduce fleet agent timeout
The default agent_ttl setting of 30s makes TestMonitorVsShutdown run pretty long. (And probably also other tests related to connectivity loss, that will be added in the future...) With no actual network delays involved, making the timeout much shorter shouldn't cause any trouble -- so we can simply set it globally, rather than trying to override it only for the specific test in question.
Configuration menu - View commit details
-
Copy full SHA for b43353d - Browse repository at this point
Copy the full SHA b43353dView commit details -
Cleanup: Restore usage of NewMonitor() instead of inlining the code
The function moved to a different package, but otherwise can be used as before -- I presume inlining it was some kind of accident...
Configuration menu - View commit details
-
Copy full SHA for 40691b5 - Browse repository at this point
Copy the full SHA 40691b5View commit details
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.