Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

panic due to double-close of channel #1067

Closed
wants to merge 8 commits into from
Closed

Conversation

jonboulle
Copy link
Contributor

I'm not sure what's going on here, but clearly fleet should never try to close an already-closed channel: From #1044:

Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: ERROR server.go:169: Server monitor triggered: Monitor timed out before successful heartbeat
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: panic: runtime error: close of closed channel
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 11048 [running]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: runtime.panic(0x77e400, 0x9bf155)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/runtime/panic.c:279 +0xf5
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/server.(*Server).Monitor(0xc208455560)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/server/server.go:171 +0xfb
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: created by github.com/coreos/fleet/server.(*Server).Run
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/server/server.go:152 +0x10e
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 16 [chan receive]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: main.listenForSignals(0xc2080a5650)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/fleetd/fleet.go:189 +0x16d
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: main.main()
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/fleetd/fleet.go:121 +0xf2b
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 19 [finalizer wait]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: runtime.park(0x416b60, 0x9c3dd0, 0x9c2029)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/runtime/proc.c:1369 +0x89
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: runtime.parkunlock(0x9c3dd0, 0x9c2029)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/runtime/proc.c:1385 +0x3b
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: runfinq()
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/runtime/mgc0.c:2644 +0xcf
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: runtime.goexit()
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/runtime/proc.c:1445
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 20 [syscall]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: os/signal.loop()
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/os/signal/signal_unix.go:21 +0x1e
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: created by os/signal.init·1
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/os/signal/signal_unix.go:27 +0x32
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 21 [IO wait, 17 minutes]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.runtime_pollWait(0x7fef8134ac10, 0x72, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/runtime/netpoll.goc:146 +0x66
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*pollDesc).Wait(0xc208037f00, 0x72, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*pollDesc).WaitRead(0xc208037f00, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*netFD).readMsg(0xc208037ea0, 0xc2083a0570, 0x10, 0x10, 0xc20809f220, 0x1000, 0x1000, 0xffffffffffffffff, 0x0, 0x0, ...)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_unix.go:296 +0x47f
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*UnixConn).ReadMsgUnix(0xc2080480c0, 0xc2083a0570, 0x10, 0x10, 0xc20809f220, 0x1000, 0x1000, 0xc208234000, 0xb97a, 0xb97a, ...)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/unixsock_posix.go:154 +0x16c
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*oobReader).Read(0xc20809f200, 0xc2083a0570, 0x10, 0x10, 0x1, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/transport_unix.go:21 +0xc9
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: io.ReadAtLeast(0x7fef8134adf8, 0xc20809f200, 0xc2083a0570, 0x10, 0x10, 0x10, 0x0, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/io/io.go:289 +0xf7
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: io.ReadFull(0x7fef8134adf8, 0xc20809f200, 0xc2083a0570, 0x10, 0x10, 0xb97a, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/io/io.go:307 +0x71
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*unixTransport).ReadMessage(0xc208001730, 0xc20800f470, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/transport_unix.go:85 +0x198
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*Conn).inWorker(0xc208003e60)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/conn.go:241 +0x57
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: created by github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*Conn).Auth
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/auth.go:118 +0xd2a
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 22 [chan receive, 17 minutes]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*Conn).outWorker(0xc208003e60)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/conn.go:363 +0x54
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: created by github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*Conn).Auth
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/auth.go:119 +0xd45
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 23 [IO wait]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.runtime_pollWait(0x7fef8134ab60, 0x72, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/runtime/netpoll.goc:146 +0x66
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*pollDesc).Wait(0xc208036ae0, 0x72, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*pollDesc).WaitRead(0xc208036ae0, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*netFD).readMsg(0xc208036a80, 0xc2082bcd00, 0x10, 0x10, 0xc20813f620, 0x1000, 0x1000, 0xffffffffffffffff, 0x0, 0x0, ...)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_unix.go:296 +0x47f
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*UnixConn).ReadMsgUnix(0xc208048018, 0xc2082bcd00, 0x10, 0x10, 0xc20813f620, 0x1000, 0x1000, 0xc2081acb90, 0x49, 0x49, ...)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/unixsock_posix.go:154 +0x16c
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*oobReader).Read(0xc20813f600, 0xc2082bcd00, 0x10, 0x10, 0x1, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/transport_unix.go:21 +0xc9
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: io.ReadAtLeast(0x7fef8134adf8, 0xc20813f600, 0xc2082bcd00, 0x10, 0x10, 0x10, 0x0, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/io/io.go:289 +0xf7
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: io.ReadFull(0x7fef8134adf8, 0xc20813f600, 0xc2082bcd00, 0x10, 0x10, 0x49, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/io/io.go:307 +0x71
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*unixTransport).ReadMessage(0xc2080015f0, 0xc2080a6000, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/transport_unix.go:85 +0x198
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*Conn).inWorker(0xc20808e120)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/conn.go:241 +0x57
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: created by github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*Conn).Auth
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/auth.go:118 +0xd2a
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 24 [chan receive, 19 minutes]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*Conn).outWorker(0xc20808e120)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/conn.go:363 +0x54
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: created by github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*Conn).Auth
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/auth.go:119 +0xd45
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 25 [chan receive]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/coreos/go-systemd/dbus.func·001()
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/coreos/go-systemd/dbus/subscription.go:66 +0x60
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: created by github.com/coreos/fleet/Godeps/_workspace/src/github.com/coreos/go-systemd/dbus.(*Conn).dispatch
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/coreos/go-systemd/dbus/subscription.go:98 +0xc3
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 26 [IO wait, 19 minutes]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.runtime_pollWait(0x7fef8134aab0, 0x72, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/runtime/netpoll.goc:146 +0x66
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*pollDesc).Wait(0xc208037b80, 0x72, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*pollDesc).WaitRead(0xc208037b80, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*netFD).accept(0xc208037b20, 0x87c188, 0x0, 0x7fef81349440, 0xb)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_unix.go:419 +0x343
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*UnixListener).AcceptUnix(0xc2080b24a0, 0x18, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/unixsock_posix.go:293 +0x73
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*UnixListener).Accept(0xc2080b24a0, 0x0, 0x0, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/unixsock_posix.go:304 +0x4b
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net/http.(*Server).Serve(0xc208004300, 0x7fef8134b2b0, 0xc2080b24a0, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/http/server.go:1698 +0x91
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net/http.Serve(0x7fef8134b2b0, 0xc2080b24a0, 0x7fef8134c5b0, 0xc20804a000, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/http/server.go:1576 +0x7c
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/api.func·001()
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/api/server.go:35 +0x78
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: created by github.com/coreos/fleet/api.(*Server).Serve
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/api/server.go:39 +0xf5
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 11408 [select]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/agent.(*UnitStatePublisher).Run(0xc2084ed580, 0xc2084543c0, 0xc208454360)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/agent/unit_state.go:105 +0x287
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: created by github.com/coreos/fleet/server.(*Server).Run
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/server/server.go:161 +0x25f
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: goroutine 1463 [IO wait, 15 minutes]:
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.runtime_pollWait(0x7fef81354a28, 0x72, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/runtime/netpoll.goc:146 +0x66
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*pollDesc).Wait(0xc208587950, 0x72, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*pollDesc).WaitRead(0xc208587950, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*netFD).readMsg(0xc2085878f0, 0xc20851aae0, 0x10, 0x10, 0xc20837e020, 0x1000, 0x1000, 0xffffffffffffffff, 0x0, 0x0, ...)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/fd_unix.go:296 +0x47f
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: net.(*UnixConn).ReadMsgUnix(0xc20858c180, 0xc20851aae0, 0x10, 0x10, 0xc20837e020, 0x1000, 0x1000, 0xc2085d6000, 0x1, 0x30000000000b95e, ...)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /usr/lib/go/src/pkg/net/unixsock_posix.go:154 +0x16c
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus.(*oobReader).Read(0xc20837e000, 0xc20851aae0, 0x10, 0x10, 0x1, 0x0, 0x0)
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: /build/amd64-usr/var/tmp/portage/app-admin/fleet-0.8.3/work/fleet-0.8.3/gopath/src/github.com/coreos/fleet/Godeps/_workspace/src/github.com/godbus/dbus/transport_unix.go:21 +0xc9
Dec 02 18:26:05 ip-172-31-24-115.eu-west-1.compute.internal fleetd[570]: io.ReadAtLeast(0x7fef8134adf8, 0xc20837e000, 0xc20851aae0, 0x10, 0x10, 0x10, 0x0, 0x0, 0x0)

...

@bcwaldon
Copy link
Contributor Author

@jonboulle I'm not sure I buy this at a conceptual level. The HeartMonitor does not have sole reign over the Server, it's simply another actor watching for inputs and reacting by asking the Server to stop and start.

@jonboulle
Copy link
Contributor

@bcwaldon that's what the monitor was, I'm proposing a conceptual change. I can probably tweak some naming to make it a bit easier to swallow. But I am pretty sure this is the right way to go. An external actor saying "stop" is just another signal for it to act on

@bcwaldon
Copy link
Contributor Author

@jonboulle talked about this in person - just going to address the naming, but conceptually this makes sense

@jonboulle
Copy link
Contributor

FWIW in Aurora we had a StatusManager which contained any number of StatusCheckers [0], and would shut down the executor when a StatusChecker reported unhealthy. Then KillManager is just another StatusChecker that the StatusManager is monitoring (and in our case, server.Kill() would just trigger the KillManager)

[0] well actually just a single ChainedStatusChecker encapsulating multiple StatusCheckers, but that's just a detail since Python don't know how to goroutine

@bcwaldon
Copy link
Contributor Author

@jonboulle I'd like to keep the *Checker/Status*/*Manager naming to a minimum, but other than that, just do what you feel is right.

@jonboulle
Copy link
Contributor

...
On Dec 17, 2014 6:12 PM, "Brian Waldon" notifications@github.com wrote:

@jonboulle https://github.com/jonboulle I'd like to keep the Checker/
Status
/*Manager naming to a minimum, but other than that, just do what
you feel is right.


Reply to this email directly or view it on GitHub
#1067 (comment).

@bcwaldon
Copy link
Contributor Author

@jonboulle any movement here?

@jonboulle
Copy link
Contributor

Got stuck in a major yak shave. Will try extricate myself.

@jonboulle
Copy link
Contributor

@bcwaldon is this getting better or worse

@bcwaldon
Copy link
Contributor Author

LGTM

// beats successfully. If the heartbeat check fails for any
// reason, an error is returned. If the supplied channel is
// closed, Monitor returns ErrShutdown.
func (m *Monitor) Monitor(hrt heart.Heart, sdc <-chan bool) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the values in sdc don't matter, a channel of struct{} would be more appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He's got a good point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crawford
Copy link
Contributor

LGTM

@bcwaldon
Copy link
Contributor Author

@jonboulle rebase and merge it

@jonboulle
Copy link
Contributor

I haven't really tested this to a satisfactory degree yet. For example, ee33bbe (golang is hard)

The Server has a global stop channel which is used both internally (by
Monitor) and externally (by the Stop method) to shut down the server.
This is bad; invoking the method simultaneously by multiple goroutines
is not safe (and as seen in coreos#1044 can cause panics due to a
doubly-closed channel).

This change centralises the shutdown procedure through the Monitor, so
that when an external user of the Server wants to shut it down, it
triggers an error to propagate up from the monitor. Hence there is only
a single path in which the stopchannel (which terminates all other
Server goroutines) can be called.
There are three different paths in the main fleetd goroutine that can
access the global `srv` Server - reconfigurations, shutdowns and
statedumps. Right now there's nothing preventing racy access to this
instance, so introduce a mutex to protect it.

One potential issue with this is that it means that a reconfigure or
state dump can "block" a shutdown, but IMHO if this occurs it will
expose behaviour that is broken and needs to be fixed anyway.
- add all background server components to a WaitGroup
- when shutting down the server, wait on this group or until a timeout
  (defaulting to one minute) before restarting or exiting.
- if timeout occurs, shut down hard and let a
- move Monitor into server package
- Server.Monitor -> Server.Supervise to remove ambiguity/duplication
Channels that are just used to "broadcast" messages (e.g. they are only
ever closed) do not need a type; it is better to be more explicit about
this by using a struct{}.

Similarly, the channels can be receive-only.
To make things a little clearer for ol' man Crawford, rename the "Stop"
function to "Kill" to align better with the channel names and be a
little more explicit that it is invoked in response to a kill signal.
@jonboulle
Copy link
Contributor

This should be code complete, but requires a rebase and testing. I'd consider it blocked on #1403

@antrik
Copy link
Contributor

antrik commented Feb 19, 2016

Rebased series (also squashing the fixup commit along the way): https://github.com/endocode/fleet/tree/antrik/fix-shutdown-rebased

(I also looked through the changes. They all look reasonable to me -- but I guess that's not very relevant, considering this has been reviewed before...)

Now that we have functional tests running (no regressions here), is it time to make an updated PR and get it merged? Or shall I spend some time trying to come up with additional unit tests (and possibly functional tests) checking the specific issue this addresses, and/or any code paths that seem most likely to experience regressions?

@jonboulle
Copy link
Contributor

@antrik if you can think of how to devise a test for this, that would be fantastic. In any case it would be great to have a new PR to move forward with this.

@antrik
Copy link
Contributor

antrik commented Feb 23, 2016

@jonboulle well, a functional test might be tricky. I believe I understand more or less how the race can happen -- but whether I can find a way to trigger it on purpose, I am not sure. (Plus I don't know whether it is likely enough actually to hit it when running repeatedly in a loop for just a couple of seconds...)

Triggering it with a unit test is probably way easier -- but also less meaningful...

In any case, delving into this might take a couple of days -- so the question is whether you consider that worthwhile? If so, I'll get on it; otherwise, I'll just make a new PR from the rebased branch without any new tests.

@jonboulle
Copy link
Contributor

@antrik

(Plus I don't know whether it is likely enough actually to hit it when running repeatedly in a loop for just a couple of seconds...)

This seems easy enough to check, maybe worth a quick experiment?

please put up another PR for merging

@antrik
Copy link
Contributor

antrik commented Feb 25, 2016

So it turns out this is actually pretty easy to reproduce: we just need to make sure that etcd stops responding (which we can do for example by sending it SIGSTOP) -- once the timeout passes and the monitor triggers, fleet will indefinitely hang in a state of limbo (as long as etcd remains unavailable), where initiating a shutdown reliably triggers the crash.

And this patch series indeed fixes the problem :-)

Now I "just" need to find a way to turn this into an automated test -- which might be more tricky, as the tests currently rely on a system-provided etcd we have no control over, rather than launching a private one... Any suggestions?

@kayrus
Copy link
Contributor

kayrus commented Feb 25, 2016

@antrik it is not a problem to create a test which will use etcd inside the systemd-nspawn container.

@kayrus
Copy link
Contributor

kayrus commented Feb 26, 2016

looks like it is related #715

@jonboulle
Copy link
Contributor

#1496

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants