panic due to double-close of channel #1067

jonboulle · 2014-12-17T01:09:33Z

I'm not sure what's going on here, but clearly fleet should never try to close an already-closed channel: From #1044:

Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: ERROR server.go:169: Server monitor triggered: Monitor timed out before successful heartbeat
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: panic: runtime error: close of closed channel
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 11048 [running]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: runtime.panic(0x77e400, 0x9bf155)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/runtime/panic.c:279 +0xf5
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/server.(*Server).Monitor(0xc208455560)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/server/server.go:171 +0xfb
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: created by github.com/coreos/fleet/server.(*Server).Run
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/server/server.go:152 +0x10e
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 16 [chan receive]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: main.listenForSignals(0xc2080a5650)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/fleetd/fleet.go:189 +0x16d
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: main.main()
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/fleetd/fleet.go:121 +0xf2b
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 19 [finalizer wait]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: runtime.park(0x416b60, 0x9c3dd0, 0x9c2029)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/runtime/proc.c:1369 +0x89
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: runtime.parkunlock(0x9c3dd0, 0x9c2029)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/runtime/proc.c:1385 +0x3b
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: runfinq()
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/runtime/mgc0.c:2644 +0xcf
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: runtime.goexit()
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/runtime/proc.c:1445
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 20 [syscall]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: os/signal.loop()
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/os/signal/signal_unix.go:21 +0x1e
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: created by os/signal.init·1
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/os/signal/signal_unix.go:27 +0x32
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 21 [IO wait, 17 minutes]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.runtime_pollWait(0x7fef8134ac10, 0x72, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/runtime/netpoll.goc:146 +0x66
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*pollDesc).Wait(0xc208037f00, 0x72, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*pollDesc).WaitRead(0xc208037f00, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*netFD).readMsg(0xc208037ea0, 0xc2083a0570, 0x10, 0x10, 0xc20809f220, 0x1000, 0x1000, 0xffffffffffffffff, 0x0, 0x0, ...)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_unix.go:296 +0x47f
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*UnixConn).ReadMsgUnix(0xc2080480c0, 0xc2083a0570, 0x10, 0x10, 0xc20809f220, 0x1000, 0x1000, 0xc208234000, 0xb97a, 0xb97a, ...)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/unixsock_posix.go:154 +0x16c
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*oobReader).Read(0xc20809f200, 0xc2083a0570, 0x10, 0x10, 0x1, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/transport_unix.go:21 +0xc9
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: io.ReadAtLeast(0x7fef8134adf8, 0xc20809f200, 0xc2083a0570, 0x10, 0x10, 0x10, 0x0, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/io/io.go:289 +0xf7
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: io.ReadFull(0x7fef8134adf8, 0xc20809f200, 0xc2083a0570, 0x10, 0x10, 0xb97a, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/io/io.go:307 +0x71
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*unixTransport).ReadMessage(0xc208001730, 0xc20800f470, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/transport_unix.go:85 +0x198
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*Conn).inWorker(0xc208003e60)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/conn.go:241 +0x57
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: created by github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*Conn).Auth
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/auth.go:118 +0xd2a
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 22 [chan receive, 17 minutes]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*Conn).outWorker(0xc208003e60)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/conn.go:363 +0x54
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: created by github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*Conn).Auth
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/auth.go:119 +0xd45
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 23 [IO wait]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.runtime_pollWait(0x7fef8134ab60, 0x72, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/runtime/netpoll.goc:146 +0x66
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*pollDesc).Wait(0xc208036ae0, 0x72, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*pollDesc).WaitRead(0xc208036ae0, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*netFD).readMsg(0xc208036a80, 0xc2082bcd00, 0x10, 0x10, 0xc20813f620, 0x1000, 0x1000, 0xffffffffffffffff, 0x0, 0x0, ...)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_unix.go:296 +0x47f
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*UnixConn).ReadMsgUnix(0xc208048018, 0xc2082bcd00, 0x10, 0x10, 0xc20813f620, 0x1000, 0x1000, 0xc2081acb90, 0x49, 0x49, ...)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/unixsock_posix.go:154 +0x16c
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*oobReader).Read(0xc20813f600, 0xc2082bcd00, 0x10, 0x10, 0x1, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/transport_unix.go:21 +0xc9
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: io.ReadAtLeast(0x7fef8134adf8, 0xc20813f600, 0xc2082bcd00, 0x10, 0x10, 0x10, 0x0, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/io/io.go:289 +0xf7
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: io.ReadFull(0x7fef8134adf8, 0xc20813f600, 0xc2082bcd00, 0x10, 0x10, 0x49, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/io/io.go:307 +0x71
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*unixTransport).ReadMessage(0xc2080015f0, 0xc2080a6000, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/transport_unix.go:85 +0x198
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*Conn).inWorker(0xc20808e120)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/conn.go:241 +0x57
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: created by github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*Conn).Auth
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/auth.go:118 +0xd2a
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 24 [chan receive, 19 minutes]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*Conn).outWorker(0xc20808e120)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/conn.go:363 +0x54
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: created by github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*Conn).Auth
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/auth.go:119 +0xd45
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 25 [chan receive]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/coreos/go-systemd/dbus.func·001()
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/coreos/go-systemd/dbus/subscription.go:66 +0x60
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: created by github.com/coreos/fleet/Godeps/_workspace/src/github.com/coreos/go-systemd/dbus.(*Conn).dispatch
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/coreos/go-systemd/dbus/subscription.go:98 +0xc3
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 26 [IO wait, 19 minutes]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.runtime_pollWait(0x7fef8134aab0, 0x72, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/runtime/netpoll.goc:146 +0x66
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*pollDesc).Wait(0xc208037b80, 0x72, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*pollDesc).WaitRead(0xc208037b80, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*netFD).accept(0xc208037b20, 0x87c188, 0x0, 0x7fef81349440, 0xb)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_unix.go:419 +0x343
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*UnixListener).AcceptUnix(0xc2080b24a0, 0x18, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/unixsock_posix.go:293 +0x73
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*UnixListener).Accept(0xc2080b24a0, 0x0, 0x0, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/unixsock_posix.go:304 +0x4b
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net/http.(*Server).Serve(0xc208004300, 0x7fef8134b2b0, 0xc2080b24a0, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/http/server.go:1698 +0x91
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net/http.Serve(0x7fef8134b2b0, 0xc2080b24a0, 0x7fef8134c5b0, 0xc20804a000, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/http/server.go:1576 +0x7c
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/api.func·001()
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/api/server.go:35 +0x78
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: created by github.com/coreos/fleet/api.(*Server).Serve
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/api/server.go:39 +0xf5
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 11408 [select]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/agent.(*UnitStatePublisher).Run(0xc2084ed580, 0xc2084543c0, 0xc208454360)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/agent/unit_state.go:105 +0x287
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: created by github.com/coreos/fleet/server.(*Server).Run
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/server/server.go:161 +0x25f
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 1463 [IO wait, 15 minutes]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.runtime_pollWait(0x7fef81354a28, 0x72, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/runtime/netpoll.goc:146 +0x66
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*pollDesc).Wait(0xc208587950, 0x72, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*pollDesc).WaitRead(0xc208587950, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*netFD).readMsg(0xc2085878f0, 0xc20851aae0, 0x10, 0x10, 0xc20837e020, 0x1000, 0x1000, 0xffffffffffffffff, 0x0, 0x0, ...)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_unix.go:296 +0x47f
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*UnixConn).ReadMsgUnix(0xc20858c180, 0xc20851aae0, 0x10, 0x10, 0xc20837e020, 0x1000, 0x1000, 0xc2085d6000, 0x1, 0x30000000000b95e, ...)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/unixsock_posix.go:154 +0x16c
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*oobReader).Read(0xc20837e000, 0xc20851aae0, 0x10, 0x10, 0x1, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/transport_unix.go:21 +0xc9
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: io.ReadAtLeast(0x7fef8134adf8, 0xc20837e000, 0xc20851aae0, 0x10, 0x10, 0x10, 0x0, 0x0, 0x0)

...

bcwaldon · 2014-12-17T17:56:27Z

@jonboulle I'm not sure I buy this at a conceptual level. The HeartMonitor does not have sole reign over the Server, it's simply another actor watching for inputs and reacting by asking the Server to stop and start.

jonboulle · 2014-12-17T18:20:28Z

@bcwaldon that's what the monitor was, I'm proposing a conceptual change. I can probably tweak some naming to make it a bit easier to swallow. But I am pretty sure this is the right way to go. An external actor saying "stop" is just another signal for it to act on

bcwaldon · 2014-12-18T01:50:26Z

@jonboulle talked about this in person - just going to address the naming, but conceptually this makes sense

jonboulle · 2014-12-18T02:05:55Z

FWIW in Aurora we had a StatusManager which contained any number of StatusCheckers [0], and would shut down the executor when a StatusChecker reported unhealthy. Then KillManager is just another StatusChecker that the StatusManager is monitoring (and in our case, server.Kill() would just trigger the KillManager)

[0] well actually just a single ChainedStatusChecker encapsulating multiple StatusCheckers, but that's just a detail since Python don't know how to goroutine

bcwaldon · 2014-12-18T02:12:33Z

@jonboulle I'd like to keep the *Checker/Status*/*Manager naming to a minimum, but other than that, just do what you feel is right.

jonboulle · 2014-12-18T02:13:26Z

...
On Dec 17, 2014 6:12 PM, "Brian Waldon" notifications@github.com wrote:

@jonboulle https://github.com/jonboulle I'd like to keep the Checker/
Status/*Manager naming to a minimum, but other than that, just do what
you feel is right.

—
Reply to this email directly or view it on GitHub
#1067 (comment).

bcwaldon · 2014-12-30T15:19:10Z

@jonboulle any movement here?

jonboulle · 2014-12-30T22:56:58Z

Got stuck in a major yak shave. Will try extricate myself.

jonboulle · 2015-01-21T18:42:14Z

@bcwaldon is this getting better or worse

bcwaldon · 2015-01-21T19:04:49Z

LGTM

crawford · 2015-01-21T21:19:28Z

server/monitor.go

+// beats successfully. If the heartbeat check fails for any
+// reason, an error is returned. If the supplied channel is
+// closed, Monitor returns ErrShutdown.
+func (m *Monitor) Monitor(hrt heart.Heart, sdc <-chan bool) error {


If the values in sdc don't matter, a channel of struct{} would be more appropriate.

He's got a good point.

jonboulle@9f9b713

crawford · 2015-01-21T23:30:34Z

LGTM

bcwaldon · 2015-01-27T18:01:38Z

@jonboulle rebase and merge it

jonboulle · 2015-01-27T18:09:18Z

I haven't really tested this to a satisfactory degree yet. For example, ee33bbe (golang is hard)

The Server has a global stop channel which is used both internally (by Monitor) and externally (by the Stop method) to shut down the server. This is bad; invoking the method simultaneously by multiple goroutines is not safe (and as seen in coreos#1044 can cause panics due to a doubly-closed channel). This change centralises the shutdown procedure through the Monitor, so that when an external user of the Server wants to shut it down, it triggers an error to propagate up from the monitor. Hence there is only a single path in which the stopchannel (which terminates all other Server goroutines) can be called.

There are three different paths in the main fleetd goroutine that can access the global `srv` Server - reconfigurations, shutdowns and statedumps. Right now there's nothing preventing racy access to this instance, so introduce a mutex to protect it. One potential issue with this is that it means that a reconfigure or state dump can "block" a shutdown, but IMHO if this occurs it will expose behaviour that is broken and needs to be fixed anyway.

- add all background server components to a WaitGroup - when shutting down the server, wait on this group or until a timeout (defaulting to one minute) before restarting or exiting. - if timeout occurs, shut down hard and let a - move Monitor into server package - Server.Monitor -> Server.Supervise to remove ambiguity/duplication

Channels that are just used to "broadcast" messages (e.g. they are only ever closed) do not need a type; it is better to be more explicit about this by using a struct{}. Similarly, the channels can be receive-only.

To make things a little clearer for ol' man Crawford, rename the "Stop" function to "Kill" to align better with the channel names and be a little more explicit that it is invoked in response to a kill signal.

jonboulle · 2016-02-10T09:51:05Z

This should be code complete, but requires a rebase and testing. I'd consider it blocked on #1403

antrik · 2016-02-19T16:27:28Z

Rebased series (also squashing the fixup commit along the way): https://github.com/endocode/fleet/tree/antrik/fix-shutdown-rebased

(I also looked through the changes. They all look reasonable to me -- but I guess that's not very relevant, considering this has been reviewed before...)

Now that we have functional tests running (no regressions here), is it time to make an updated PR and get it merged? Or shall I spend some time trying to come up with additional unit tests (and possibly functional tests) checking the specific issue this addresses, and/or any code paths that seem most likely to experience regressions?

jonboulle · 2016-02-23T12:03:30Z

@antrik if you can think of how to devise a test for this, that would be fantastic. In any case it would be great to have a new PR to move forward with this.

antrik · 2016-02-23T14:44:47Z

@jonboulle well, a functional test might be tricky. I believe I understand more or less how the race can happen -- but whether I can find a way to trigger it on purpose, I am not sure. (Plus I don't know whether it is likely enough actually to hit it when running repeatedly in a loop for just a couple of seconds...)

Triggering it with a unit test is probably way easier -- but also less meaningful...

In any case, delving into this might take a couple of days -- so the question is whether you consider that worthwhile? If so, I'll get on it; otherwise, I'll just make a new PR from the rebased branch without any new tests.

jonboulle · 2016-02-24T14:54:11Z

@antrik

(Plus I don't know whether it is likely enough actually to hit it when running repeatedly in a loop for just a couple of seconds...)

This seems easy enough to check, maybe worth a quick experiment?

please put up another PR for merging

antrik · 2016-02-25T18:19:39Z

So it turns out this is actually pretty easy to reproduce: we just need to make sure that etcd stops responding (which we can do for example by sending it SIGSTOP) -- once the timeout passes and the monitor triggers, fleet will indefinitely hang in a state of limbo (as long as etcd remains unavailable), where initiating a shutdown reliably triggers the crash.

And this patch series indeed fixes the problem :-)

Now I "just" need to find a way to turn this into an automated test -- which might be more tricky, as the tests currently rely on a system-provided etcd we have no control over, rather than launching a private one... Any suggestions?

kayrus · 2016-02-25T18:49:00Z

@antrik it is not a problem to create a test which will use etcd inside the systemd-nspawn container.

kayrus · 2016-02-26T09:53:48Z

looks like it is related #715

jonboulle · 2016-03-10T16:28:30Z

#1496

bcwaldon added the bug label Dec 16, 2014

jonboulle force-pushed the server branch from f303763 to 5a12d4c Compare December 17, 2014 01:10

jonboulle force-pushed the server branch from b3edc51 to 80d6e4e Compare January 21, 2015 18:41

crawford reviewed Jan 21, 2015
View reviewed changes

jonboulle force-pushed the server branch from 854c841 to ee33bbe Compare January 27, 2015 18:08

bcwaldon assigned mischief Jul 9, 2015

jonboulle added 7 commits September 12, 2015 11:58

*: change kill/stop channels to struct{}

8d98335

Channels that are just used to "broadcast" messages (e.g. they are only ever closed) do not need a type; it is better to be more explicit about this by using a struct{}. Similarly, the channels can be receive-only.

server: wg.Done -> wg.Wait (doh)

4cad792

server: change Monitor to return "shutdown" bool

679706e

server: Stop -> Kill

7d41605

To make things a little clearer for ol' man Crawford, rename the "Stop" function to "Kill" to align better with the channel names and be a little more explicit that it is invoked in response to a kill signal.

server: fix channel and loop variable

37a5369

jonboulle force-pushed the server branch from ee33bbe to 37a5369 Compare September 12, 2015 19:05

jonboulle added kind/bug and removed bug labels Sep 24, 2015

jonboulle modified the milestone: v0.12.0 Jan 19, 2016

jonboulle mentioned this pull request Feb 10, 2016

Panic on graceful shutdown (close of closed channel) #1369

Closed

antrik mentioned this pull request Mar 1, 2016

Shutdown fixes/improvements #1452

Closed

jonboulle added the reviewed/duplicate label Mar 10, 2016

jonboulle closed this Mar 10, 2016

jonboulle deleted the server branch March 10, 2016 16:28

kayrus mentioned this pull request Mar 31, 2016

Add test for behaviour on etcd connectivity loss #1501

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic due to double-close of channel #1067

panic due to double-close of channel #1067

jonboulle commented Dec 17, 2014

bcwaldon commented Dec 17, 2014

jonboulle commented Dec 17, 2014

bcwaldon commented Dec 18, 2014

jonboulle commented Dec 18, 2014

bcwaldon commented Dec 18, 2014

jonboulle commented Dec 18, 2014

bcwaldon commented Dec 30, 2014

jonboulle commented Dec 30, 2014

jonboulle commented Jan 21, 2015

bcwaldon commented Jan 21, 2015

crawford Jan 21, 2015

bcwaldon Jan 21, 2015

jonboulle Jan 21, 2015

crawford commented Jan 21, 2015

bcwaldon commented Jan 27, 2015

jonboulle commented Jan 27, 2015

jonboulle commented Feb 10, 2016

antrik commented Feb 19, 2016

jonboulle commented Feb 23, 2016

antrik commented Feb 23, 2016

jonboulle commented Feb 24, 2016

antrik commented Feb 25, 2016

kayrus commented Feb 25, 2016

kayrus commented Feb 26, 2016

jonboulle commented Mar 10, 2016

panic due to double-close of channel #1067

panic due to double-close of channel #1067

Conversation

jonboulle commented Dec 17, 2014

bcwaldon commented Dec 17, 2014

jonboulle commented Dec 17, 2014

bcwaldon commented Dec 18, 2014

jonboulle commented Dec 18, 2014

bcwaldon commented Dec 18, 2014

jonboulle commented Dec 18, 2014

bcwaldon commented Dec 30, 2014

jonboulle commented Dec 30, 2014

jonboulle commented Jan 21, 2015

bcwaldon commented Jan 21, 2015

crawford Jan 21, 2015

Choose a reason for hiding this comment

bcwaldon Jan 21, 2015

Choose a reason for hiding this comment

jonboulle Jan 21, 2015

Choose a reason for hiding this comment

crawford commented Jan 21, 2015

bcwaldon commented Jan 27, 2015

jonboulle commented Jan 27, 2015

jonboulle commented Feb 10, 2016

antrik commented Feb 19, 2016

jonboulle commented Feb 23, 2016

antrik commented Feb 23, 2016

jonboulle commented Feb 24, 2016

antrik commented Feb 25, 2016

kayrus commented Feb 25, 2016

kayrus commented Feb 26, 2016

jonboulle commented Mar 10, 2016