Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

server: fix panic on graceful shutdown #1496

Merged
merged 10 commits into from
Mar 10, 2016

Commits on Mar 10, 2016

  1. server: centralise shutdown through Monitor

    The Server has a global stop channel which is used both internally (by
    Monitor) and externally (by the Stop method) to shut down the server.
    This is bad; invoking the method simultaneously by multiple goroutines
    is not safe (and as seen in coreos#1044 can cause panics due to a
    doubly-closed channel).
    
    This change centralises the shutdown procedure through the Monitor, so
    that when an external user of the Server wants to shut it down, it
    triggers an error to propagate up from the monitor. Hence there is only
    a single path in which the stopchannel (which terminates all other
    Server goroutines) can be called.
    jonboulle authored and antrik committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    ac30f61 View commit details
    Browse the repository at this point in the history
  2. fleetd: protect access to global Server instance

    There are three different paths in the main fleetd goroutine that can
    access the global `srv` Server - reconfigurations, shutdowns and
    statedumps. Right now there's nothing preventing racy access to this
    instance, so introduce a mutex to protect it.
    
    One potential issue with this is that it means that a reconfigure or
    state dump can "block" a shutdown, but IMHO if this occurs it will
    expose behaviour that is broken and needs to be fixed anyway.
    jonboulle authored and antrik committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    2af2610 View commit details
    Browse the repository at this point in the history
  3. server: introduce shutdown timeout

    - add all background server components to a WaitGroup
    - when shutting down the server, wait on this group or until a timeout
      (defaulting to one minute) before restarting or exiting.
    - if timeout occurs, shut down hard and let a
    - move Monitor into server package
    - Server.Monitor -> Server.Supervise to remove ambiguity/duplication
    jonboulle authored and antrik committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    997bc47 View commit details
    Browse the repository at this point in the history
  4. *: change kill/stop channels to struct{}

    Channels that are just used to "broadcast" messages (e.g. they are only
    ever closed) do not need a type; it is better to be more explicit about
    this by using a struct{}.
    
    Similarly, the channels can be receive-only.
    jonboulle authored and antrik committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    bcb9b9c View commit details
    Browse the repository at this point in the history
  5. server: wg.Done -> wg.Wait (doh)

    jonboulle authored and antrik committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    e449439 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    59f172a View commit details
    Browse the repository at this point in the history
  7. server: Stop -> Kill

    To make things a little clearer for ol' man Crawford, rename the "Stop"
    function to "Kill" to align better with the channel names and be a
    little more explicit that it is invoked in response to a kill signal.
    jonboulle authored and antrik committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    db39286 View commit details
    Browse the repository at this point in the history
  8. Add testcase for panic on shutdown

    Before the monitor/shutdown handling rework, fleetd would panic with a
    channel double-close if it was shut down while a restart (triggered by
    the monitor on a failed health check) was is progress.
    
    Added a functional test for this situation -- and also for regular
    shutdown as a baseline.
    antrik committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    b358ee4 View commit details
    Browse the repository at this point in the history
  9. Functional tests: Reduce fleet agent timeout

    The default agent_ttl setting of 30s makes TestMonitorVsShutdown run
    pretty long. (And probably also other tests related to connectivity
    loss, that will be added in the future...)
    
    With no actual network delays involved, making the timeout much shorter
    shouldn't cause any trouble -- so we can simply set it globally, rather
    than trying to override it only for the specific test in question.
    antrik committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    b43353d View commit details
    Browse the repository at this point in the history
  10. Cleanup: Restore usage of NewMonitor() instead of inlining the code

    The function moved to a different package, but otherwise can be used as
    before -- I presume inlining it was some kind of accident...
    antrik committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    40691b5 View commit details
    Browse the repository at this point in the history