internal/poll: deadlock in Read on arm64 when an FD is closed #45211

bcmills · 2021-03-24T15:07:31Z

It's not obvious to me whether this deadlock is a bug in the specific test function, the net/http package, or *httptest.Server in particular.

2021-03-24T14:20:32-747f426/linux-arm64-aws

…

goroutine 3475 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0xffff94117d10, 0x72, 0xffffffffffffffff)
	/workdir/go/src/runtime/netpoll.go:229 +0xa4
internal/poll.(*pollDesc).wait(0x400030a898, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:84 +0x38
internal/poll.(*pollDesc).waitRead(...)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0x400030a880, 0x400048b000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/workdir/go/src/internal/poll/fd_unix.go:167 +0x1ec
net.(*netFD).Read(0x400030a880, 0x400048b000, 0x1000, 0x1000, 0xffff94e94f01, 0x40003bd680, 0x4000432668)
	/workdir/go/src/net/fd_posix.go:56 +0x44
net.(*conn).Read(0x400037c168, 0x400048b000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/workdir/go/src/net/net.go:183 +0x4c
net/http.(*connReader).Read(0x40000a1b60, 0x400048b000, 0x1000, 0x1000, 0xffff94275900, 0x1, 0x4000432788)
	/workdir/go/src/net/http/server.go:800 +0x268
bufio.(*Reader).fill(0x40005e6300)
	/workdir/go/src/bufio/bufio.go:101 +0x10c
bufio.(*Reader).ReadSlice(0x40005e6300, 0x1df0a, 0x40004327e8, 0x1000101d7b4, 0xffff94279340, 0x0, 0xf0)
	/workdir/go/src/bufio/bufio.go:360 +0x38
bufio.(*Reader).ReadLine(0x40005e6300, 0x40004ea000, 0x1000, 0x4000352b40, 0x77, 0xf0, 0x4000432898)
	/workdir/go/src/bufio/bufio.go:389 +0x30
net/textproto.(*Reader).readLineSlice(0x40000a1440, 0x40003a5200, 0x4000432968, 0xab164, 0x400015cc20, 0x0)
	/workdir/go/src/net/textproto/reader.go:57 +0x84
net/textproto.(*Reader).ReadLine(...)
	/workdir/go/src/net/textproto/reader.go:38
net/http.readRequest(0x40005e6300, 0x0, 0x40003a5200, 0x0, 0x0)
	/workdir/go/src/net/http/request.go:1028 +0x74
net/http.(*conn).readRequest(0x400011f900, 0x539c98, 0x40003bd640, 0x0, 0x0, 0x0)
	/workdir/go/src/net/http/server.go:986 +0x21c
net/http.(*conn).serve(0x400011f900, 0x539d40, 0x40003bd640)
	/workdir/go/src/net/http/server.go:1878 +0x844
created by net/http.(*Server).Serve
	/workdir/go/src/net/http/server.go:3013 +0x4b0

…

goroutine 3439 [semacquire, 2 minutes]:
internal/poll.runtime_Semacquire(0x400015ce28)
	/workdir/go/src/runtime/sema.go:61 +0x38
internal/poll.(*FD).Close(0x400015ce00, 0x400015ce00, 0x0)
	/workdir/go/src/internal/poll/fd_unix.go:116 +0x80
net.(*netFD).Close(0x400015ce00, 0x0, 0x40000ca1c8)
	/workdir/go/src/net/fd_posix.go:38 +0x48
net.(*TCPListener).close(...)
	/workdir/go/src/net/tcpsock_posix.go:157
net.(*TCPListener).Close(0x40003a1308, 0x4, 0x40004213a0)
	/workdir/go/src/net/tcpsock.go:275 +0x34
net/http/httptest.(*Server).Close(0x40004f6000)
	/workdir/go/src/net/http/httptest/server.go:204 +0xd0
net/http_test.TestTransportDialCancelRace(0x40004ec180)
	/workdir/go/src/net/http/transport_test.go:4015 +0x308
testing.tRunner(0x40004ec180, 0x4b91d8)
	/workdir/go/src/testing/testing.go:1235 +0x100
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1280 +0x350

…

goroutine 3490 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0xffff941180b0, 0x72, 0xffffffffffffffff)
	/workdir/go/src/runtime/netpoll.go:229 +0xa4
internal/poll.(*pollDesc).wait(0x400015cf18, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:84 +0x38
internal/poll.(*pollDesc).waitRead(...)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0x400015cf00, 0x40004df000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/workdir/go/src/internal/poll/fd_unix.go:167 +0x1ec
net.(*netFD).Read(0x400015cf00, 0x40004df000, 0x1000, 0x1000, 0x400035cc68, 0xe53, 0x0)
	/workdir/go/src/net/fd_posix.go:56 +0x44
net.(*conn).Read(0x40006a61f0, 0x40004df000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/workdir/go/src/net/net.go:183 +0x4c
net/http.(*persistConn).Read(0x40000b0900, 0x40004df000, 0x1000, 0x1000, 0x60, 0x0, 0x101801)
	/workdir/go/src/net/http/transport.go:1922 +0x64
bufio.(*Reader).fill(0x40004c68a0)
	/workdir/go/src/bufio/bufio.go:101 +0x10c
bufio.(*Reader).Peek(0x40004c68a0, 0x1, 0x40003fb620, 0x483939, 0x3, 0x486b25, 0xb)
	/workdir/go/src/bufio/bufio.go:139 +0x74
net/http.(*persistConn).readLoop(0x40000b0900)
	/workdir/go/src/net/http/transport.go:2083 +0x170
created by net/http.(*Transport).dialConn
	/workdir/go/src/net/http/transport.go:1743 +0x18b0

…

goroutine 3440 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0xffff94117c28, 0x72, 0x0)
	/workdir/go/src/runtime/netpoll.go:229 +0xa4
internal/poll.(*pollDesc).wait(0x400015ce18, 0x72, 0x0, 0x0, 0x484ab9)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:84 +0x38
internal/poll.(*pollDesc).waitRead(...)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0x400015ce00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/workdir/go/src/internal/poll/fd_unix.go:402 +0x204
net.(*netFD).accept(0x400015ce00, 0x4000182cb8, 0x3755834c00, 0x12b9526b)
	/workdir/go/src/net/fd_unix.go:173 +0x2c
net.(*TCPListener).accept(0x40003a1308, 0x400037c168, 0x40002a1e08, 0x53a74)
	/workdir/go/src/net/tcpsock_posix.go:140 +0x2c
net.(*TCPListener).Accept(0x40003a1308, 0x40002a1e70, 0x40002a1e78, 0x18, 0x40001f1680)
	/workdir/go/src/net/tcpsock.go:262 +0x34
net/http.(*Server).Serve(0x400067eee0, 0x539100, 0x40003a1308, 0x0, 0x0)
	/workdir/go/src/net/http/server.go:2981 +0x364
net/http/httptest.(*Server).goServe.func1(0x40004f6000)
	/workdir/go/src/net/http/httptest/server.go:308 +0x68
created by net/http/httptest.(*Server).goServe
	/workdir/go/src/net/http/httptest/server.go:306 +0x58

…

CC @neild @bradfitz @empijei

The text was updated successfully, but these errors were encountered:

neild · 2021-03-24T17:06:25Z

I think this is a internal/poll race, although I might be missing something:

goroutine 3439 [semacquire, 2 minutes]:
internal/poll.runtime_Semacquire(0x400015ce28)
	/workdir/go/src/runtime/sema.go:61 +0x38
internal/poll.(*FD).Close(0x400015ce00, 0x400015ce00, 0x0)
	/workdir/go/src/internal/poll/fd_unix.go:116 +0x80
net.(*netFD).Close(0x400015ce00, 0x0, 0x40000ca1c8)
	/workdir/go/src/net/fd_posix.go:38 +0x48
net.(*TCPListener).close(...)
	/workdir/go/src/net/tcpsock_posix.go:157
net.(*TCPListener).Close(0x40003a1308, 0x4, 0x40004213a0)
	/workdir/go/src/net/tcpsock.go:275 +0x34
net/http/httptest.(*Server).Close(0x40004f6000)
	/workdir/go/src/net/http/httptest/server.go:204 +0xd0
net/http_test.TestTransportDialCancelRace(0x40004ec180)
	/workdir/go/src/net/http/transport_test.go:4015 +0x308
testing.tRunner(0x40004ec180, 0x4b91d8)
	/workdir/go/src/testing/testing.go:1235 +0x100
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1280 +0x350

goroutine 3440 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0xffff94117c28, 0x72, 0x0)
	/workdir/go/src/runtime/netpoll.go:229 +0xa4
internal/poll.(*pollDesc).wait(0x400015ce18, 0x72, 0x0, 0x0, 0x484ab9)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:84 +0x38
internal/poll.(*pollDesc).waitRead(...)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0x400015ce00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/workdir/go/src/internal/poll/fd_unix.go:402 +0x204
net.(*netFD).accept(0x400015ce00, 0x4000182cb8, 0x3755834c00, 0x12b9526b)
	/workdir/go/src/net/fd_unix.go:173 +0x2c
net.(*TCPListener).accept(0x40003a1308, 0x400037c168, 0x40002a1e08, 0x53a74)
	/workdir/go/src/net/tcpsock_posix.go:140 +0x2c
net.(*TCPListener).Accept(0x40003a1308, 0x40002a1e70, 0x40002a1e78, 0x18, 0x40001f1680)
	/workdir/go/src/net/tcpsock.go:262 +0x34
net/http.(*Server).Serve(0x400067eee0, 0x539100, 0x40003a1308, 0x0, 0x0)
	/workdir/go/src/net/http/server.go:2981 +0x364
net/http/httptest.(*Server).goServe.func1(0x40004f6000)
	/workdir/go/src/net/http/httptest/server.go:308 +0x68
created by net/http/httptest.(*Server).goServe
	/workdir/go/src/net/http/httptest/server.go:306 +0x58

One goroutine is blocked in internal/poll.(*FD).Close waiting to acquire fd.csema:
https://go.googlesource.com/go/+/747f426944/src/internal/poll/fd_unix.go#116

The other goroutine is blocked in internal/poll.(*FD).Accept waiting for the FD to become readable:
https://go.googlesource.com/go/+/747f426944/src/internal/poll/fd_unix.go#402

bradfitz · 2021-03-24T17:16:29Z

Maybe unrelated, but we're actively debugging an issue (with Go 1.16.2) where some users get into a state where it seems internal/poll is forever broken with tens of thousands of UDP writes stuck in goroutines like:

goroutine 181363 [semacquire, 152 minutes]:
internal/poll.runtime_Semacquire(0xc00011028c)
	runtime/sema.go:61 +0x45
internal/poll.(*fdMutex).rwlock(0xc000110280, 0x10d599400, 0x4)
	internal/poll/fd_mutex.go:154 +0xb3
internal/poll.(*FD).writeLock(...)
	internal/poll/fd_mutex.go:239
internal/poll.(*FD).WriteTo(0xc000110280, 0xc0007809f0, 0x28, 0x28, 0x10d7080c8, 0xc000333d80, 0x0, 0x0, 0x0)
	internal/poll/fd_unix.go:331 +0x5a
net.(*netFD).writeTo(0xc000110280, 0xc0007809f0, 0x28, 0x28, 0x10d7080c8, 0xc000333d80, 0x10, 0x0, 0x4)
	net/fd_posix.go:79 +0x6f
net.(*UDPConn).writeTo(0xc00057e080, 0xc0007809f0, 0x28, 0x28, 0xc00071cff0, 0xc008948768, 0xc0089489a0, 0x3)
	net/udpsock_posix.go:79 +0xd6
net.(*UDPConn).WriteTo(0xc00057e080, 0xc0007809f0, 0x28, 0x28, 0x10d70a4a0, 0xc00071cff0, 0xad0000c005f028c0, 0x10d1f65b6, 0xadab3d94fe203440)
	net/udpsock.go:176 +0x86
tailscale.com/wgengine/magicsock.(*RebindingUDPConn).WriteTo(0xc000214120, 0xc0007809f0, 0x28, 0x28, 0x10d70a4a0, 0xc00071cff0, 0xc0003c4880, 0xc000679f01, 0x10d506c05)
	tailscale.com@v0.0.0-00010101000000-000000000000/wgengine/magicsock/magicsock.go:2947 +0xbe

There's no evidence (panics in logs) that we hit internal/poll's one million concurrent user limit of that UDP socket, though.

rsc · 2021-03-24T18:19:51Z

@bradfitz what architecture?

ianlancetaylor · 2021-03-24T18:55:13Z

@neild What is supposed to happen there is that (*FD).Close earlier called fd.pd.evict which calls runtime_pollUnblock, and that is supposed to wake up the call sleeping in runtime_pollWait.

Looking at the netpoll code, it does seem possible that poll_runtime_pollUnblock sets fd.closing just before a different goroutine calls poll_runtime_pollWait. I'm not immediately sure what prevents that.

ianlancetaylor · 2021-03-24T18:58:22Z

@bradfitz That seems like a different problem to me. Any idea what is holding the write lock? Do we even need to hold a write lock during (*FD).WriteTo and (*FD).WriteMsg?

bradfitz · 2021-03-24T18:59:24Z

@rsc, at least amd64 for the user we're debugging.

@ianlancetaylor, it's not obvious from the stack traces. Nothing stuck in a system call doing a write, for instance. I can email it to you privately.

neild · 2021-03-24T19:05:37Z

@ianlancetaylor I think the atomic.StorepNoWB in poll_runtime_Unblock is supposed to create a memory barrier ensuring that the write to pb.closing is observable in netpollblock after the atomic.Casuintptr call.

ianlancetaylor · 2021-03-24T19:24:50Z

@neild Thanks. Now I wonder whether runtime.arm64HasATOMICS gets set to true on this system.

ianlancetaylor · 2021-03-24T19:25:34Z

@bradfitz Sure, I can take a look though I don't suppose that I'll see anything. But something must be holding the lock, somehow.

ianlancetaylor · 2021-03-24T21:46:15Z

@bradfitz sent a stack trace offline. The stack trace showed a goroutine holding the lock that all the other goroutines are waiting for:

goroutine 180969 [IO wait, 150 minutes]:
internal/poll.runtime_pollWait(0x10df1bb40, 0x77, 0x10d708d08)
	runtime/netpoll.go:222 +0x55
internal/poll.(*pollDesc).wait(0xc000110298, 0x77, 0x0, 0x200, 0x0)
	internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitWrite(...)
	internal/poll/fd_poll_runtime.go:96
internal/poll.(*FD).WriteTo(0xc000110280, 0xc00058e200, 0x5c, 0x200, 0x10d7080c8, 0xc0001547e0, 0x0, 0x0, 0x0)
	internal/poll/fd_unix.go:344 +0x1b5
net.(*netFD).writeTo(0xc000110280, 0xc00058e200, 0x5c, 0x200, 0x10d7080c8, 0xc0001547e0, 0xc0005a6001, 0x1da0acc6ae8f8, 0x11606b1c8)
	net/fd_posix.go:79 +0x6f
net.(*UDPConn).writeTo(0xc00057e080, 0xc00058e200, 0x5c, 0x200, 0xc0007d6090, 0x0, 0xc0005e9aa8, 0x10d1f74c8)
	net/udpsock_posix.go:79 +0xd6
net.(*UDPConn).WriteTo(0xc00057e080, 0xc00058e200, 0x5c, 0x200, 0x10d70a4a0, 0xc0007d6090, 0x0, 0xc0005e9b48, 0x10d1f6c8e)
	net/udpsock.go:176 +0x86

This means that the sendto system call returned EAGAIN, and the goroutine is waiting for epoll to report that the socket is available for writing.

Normally EAGAIN means that the socket buffer is full. However, looking at the man page for sendto, it appears that EAGAIN can be returned if the socket was not bound to a local port, and there are no ports available in the ephemeral range. Could that possibly be the case here? If so, that could be a problem, because we use epoll in edge-triggered mode, so if the socket was ready for writing although it did not have an ephemeral port, epoll would never tell the program to try the sendto again.

How is this UDP socket created? It is remotely plausible that the ephemeral ports were exhausted on this system?

josharian · 2021-03-24T21:52:34Z

I just recalled from the distant past (#18541) that Little Snitch running on macOS can cause EAGAIN in unexpected places. It is possible that the user that those stack traces are from is running Little Snitch (or some other network filter extension).

Possibly related, I just encountered some runtime poller failures on a new M1, and I also run Little Snitch. I will write them up for you tomorrow, or you can ask Brad for details if you want them sooner.

josharian · 2021-03-24T21:54:46Z

Also #18751. And FWIW, runtime.arm64HasATOMICS is true on my machine.

bradfitz · 2021-03-24T22:03:30Z

@ianlancetaylor, forgot to tell you in the email: those stacks were from macOS, not Linux. (not super obvious, but some have e.g. "golang.org/x/sys@v0.0.0-20210301091718-77cc2087c03b/unix/zsyscall_darwin_amd64.go:1691 +0x66").

The UDP socket is created from:

https://github.com/tailscale/tailscale/blob/c99f260e403f8aa6665c54bf9a7f25ca696fa480/wgengine/magicsock/magicsock.go#L2669

... which is effectively net.Listen("udp4", ":41641") in this case, which I just confirmed from the user logs corresponding to those stacks.

It is remotely plausible that the ephemeral ports were exhausted on this system?

That's quite likely actually.

neild · 2021-03-24T23:07:42Z

FWIW, the net/http flake in TestTransportDialCancelRace is very reproducible on the linux-arm64-aws builder. This reproduces it for me every time:

gomote run $GOMOTE go/bin/go test net/http -run=DialCancelRace -cpu=4  -count=10000 -v -timeout=20s

ianlancetaylor · 2021-03-25T03:55:28Z

@bradfitz @josharian If the UDP socket is being created using net.ListenPacket then it's not an ephemeral port problem. But if Little Snitch can cause EAGAIN errors then it could be a kqueue problem, as our use of kqueue is also edge-triggered rather than level-triggered. If sendto can return an EAGAIN error when the UDP socket is writable, we can indeed block forever waiting for the socket to become writable when it already is.

Unfortunately I'm not sure how to verify that that is the problem, other than observing that I don't see how it could be anything else. The goroutine is clearly blocking waiting for kqueue to say that the socket is writable, and that clearly isn't happening.

Even more unfortunately I'm not sure how to fix this. We could reduce the severity by changing (*FD).WriteTo and (*FD).WriteMsg to attempt the call to syscall.Sendto and syscall.SendmsgN before acquiring the write lock. There is really no need to acquire the write lock to write a single packet. But if the system call returns EAGAIN, then we more or less do have to take the write lock, and then try the system call again, and if it returns EAGAIN again, we have to call into the netpoll system. And if the problem is indeed a spurious EAGAIN, then we can wait there forever.

ianlancetaylor · 2021-03-25T03:55:44Z

By the way I think we're pretty clearly tracking two independent problems in this one issue.

bcmills · 2021-11-12T20:51:02Z

greplogs --dashboard -md -l -e '(?m)goroutine \d+ \[.+, \d+ minutes\]:\n(?:.+\n\t.+\n)*net\.\(\*netFD\)\.Close' --since=2021-01-01

bcmills · 2021-11-12T20:54:05Z

This issue is not only regularly occurring on the builders, but also highly reproducible. To my mind, that makes it a release blocker via #11811.

ianlancetaylor · 2021-11-24T05:24:26Z

We are only seeing a problem on arm64. The problem seems to occur when a call to interna/poll.(*FD).Close hangs waiting for all uses of FD to be released, while simultaneously internal/poll.(*FD).Read hangs waiting for some data to arrive (which will never happen). What is supposed to happen here is that Close will wake up any readers, and the readers won't hang if the FD is being closed.

Schematically the code looks like:

FD.Close -> fd.pd.evict -> runtime_pollUnblock:

lock(&pd.lock)
pd.closing = true
atomic.StorepNoWB() // full memory barrier?
for {
    old := atomic.Loaduintptr(&pd.rg)
    if old == pdReady { rg = nil; break }
    if old == 0 { rg = nil; break }
    if atomic.Casuintptr(&pd.rg, old, 0) {
        if old == pdWait { old = 0 }
        rg = old
        break
    }
}
unlock(&pd.lock)
if rg != nil {
    netpollgoready(rg)
}

FD.Read -> fd.pd.waitRead -> runtime_pollWait:

if pd.closing {
    return pollErrClosing
}
for !netpollblock() {
    if pd.closing { return pollErrClosing }
}

netpollblock:

for {
    if atomic.Casuintptr(&pd.rg, pdReady, 0) { return true }
    if atomic.Casuintptr(&pd.rg, 0, pdWait) { break }
    if v := atomic.Loaduintptr(&pd.rg); v != pdReady && v != 0 { throw() }
}
if !pd.closing { gopark(if atomic.Casuintptr(&pd.rg, pdWait, g)) }
old := atomic.Xchguintptr(&pd.rg, 0)
if old > pdWait { throw() }
return old == pdReady

For this particular issue, I think the key point is: will the call to runtime_pollWait reliably see that pd.closing is set by runtime_pollUnblock? If it doesn't see that, then if the reader calls netpollblock before the close completes, which is possible, then the reader can park itself after the last attempt to wake it up. I'm focusing on pd.closing because it is not accessed atomically.

On arm64 the atomic operations are implemented as follows:

Go	ASM
`atomic.StorepNoWB`	`stlr`
`atomic.Loaduintptr`	`ldar`
`atomic.Casuintptr`	`casal`
`atomic.Xchguintptr`	`swpal`

For example, consider this sequence:

pollUnblock sets pd.closing, finds that pd.rg is nil, and returns
pollWait sets pd.rg to pdWait, checks pd.closing, sets pd.rg to g, parks

If that sequence can happen, the program can hang. The synchronization point here is pd.rg. I think the minimal set of relevant operations is:

pollUnblock: strb to pd.closing, stlr to the stack, ldar of pd.rg
pollWait: casal to pd.rg, another casal to pd.rg, ldrb of pd.closing, a third casal to pd.rg

The ldar and stlr instructions are described as one-way barriers. An ldar, which is a load-acquire, means that all loads and stores that appear later in the program will at least see the memory as of the ldar instruction. An stlr, which is a store-release, means that all loads and stores that appear earlier in the program must be completed as of the stlr instruction. A casal instruction is both a load-acquire and a store-release.

It's clear that the stlr to the stack ensures that the strb to pd.closing is visible. The ldar after that on pollUnblock doesn't have any relevant effect. The question is whether the casal in pollWait is enough to ensure that when the pollWait thread does an ldrb of pd.closing, it will see true. On amd64 this is guaranteed by the total store ordering: if the Loaduintptr by pollUnblock saw nil, then the store of the Casuintptr must follow the store to pd.closing. But I don't know if the stlr to the stack is enough to ensure that the store to pd.closing is seen after the casal instruction of a different memory location.

In the C++ memory model, if stlr is store-release and casal is both load-acquire and store-release, I don't think that this is guaranteed to work, because the stlr and the casal are to different memory locations. And the fact that there is a ldar of that same memory location doesn't provide the necessary association. Or so it seems to me.

But stlr is not a pure store-release, so this may be sufficient.

By the way, the stlr to the stack appears in the source code as

	atomic.StorepNoWB(noescape(unsafe.Pointer(&rg)), nil) // full memory barrier between store to closing and read of rg/wg in netpollunblock

So it's clear that we expect a full memory barrier at that point, which is indeed what is required to make this work.

So, does the stlr provide that memory barrier? I think so but I'm not sure.

heschi · 2021-11-24T17:11:07Z

Long-standing issue, okay after beta.

ianlancetaylor · 2021-11-29T23:05:31Z

If anybody can suggest a way to reproduce this problem reliably, that would help. The command in #45211 (comment) does not fail for me. In fact, I don't think these tests ever failed when I ran them. Thanks.

toothrot · 2021-12-08T17:21:51Z

Checking on this as a release blocker. Has anyone been able to reproduce this?

bcmills · 2021-12-08T17:37:12Z

FWIW, the failure rate in the builders seems to be highest these days on the windows-arm64-10 builder.

greplogs --dashboard -md -l -e '(?m)goroutine \d+ \[.+, \d+ minutes\]:\n(?:.+\n\t.+\n)*net\.\(\*netFD\)\.Close' --since=2021-11-13

2021-11-17T17:04:16-54b9cb8/windows-arm64-10

bcmills · 2021-12-09T17:55:37Z

Another one today:

greplogs --dashboard -md -l -e '(?m)goroutine \d+ \[.+, \d+ minutes\]:\n(?:.+\n\t.+\n)*net\.\(\*netFD\)\.Close' --since=2021-12-09

2021-12-09T17:16:12-78b4518/windows-arm64-10

heschi · 2022-01-12T17:31:09Z

We don't have a concrete action item and this is a release old, so it can't be a release blocker.

ianlancetaylor · 2022-01-12T17:35:12Z

Here is a question I asked in #45211 (comment):

So, does the stlr provide that memory barrier? I think so but I'm not sure.

If anybody knows someone familiar with the arm64 hardware memory model, that would help. I do not know anyone. Thanks.

rsc · 2022-01-12T17:56:08Z

I don't believe the non-atomic accesses to closing are guaranteed to be observed by the Go memory model.
I don't believe they are guaranteed by the underlying ARM64 memory model either.

The pollUnblock can reduce to

lock(&pd.lock)
pd.closing = true
atomic.StorepNoWB() // full memory barrier?
old := atomic.Loaduintptr(&pd.rg) // == 0
unlock(&pd.lock)

The only atomic store here is in StorepNoWB on &rg. But then netpollblock is looking at pd.closing and &pd.rg, which are both different addreses and cache lines from &rg. So I don't believe there is any requirement that the barrier in pollUnblock and the atomics in netpollblock see each other.

It seems to me that either pd.closing should be made into its own atomic, or the "grab the last pd.rg and wake it up" in pollUnblock should leave pd.rg = pdClosed (a new sentinel value) instead of leaving it pd.rg = 0. Then if there is a racing netpollblock, it is definitely guaranteed to see the pdClosed sentinel from pd.rg. pd.closing becomes an optimization that no longer contributes at all to correctness.

rsc · 2022-01-12T19:10:29Z

Looking more at this, there are only two un-locked accesses to pd.closing and neither seems performance critical enough to justify a race instead of an atomic. If we make pd.closing an atomic bool then the questions all go away. I will send a CL for that.

rsc · 2022-01-13T01:19:44Z

There were other racy fields too but I did basically what I said I'd do - move all the relevant bits into a single atomic word.

rsc · 2022-01-13T01:22:30Z

The bot seems to have gone to sleep - the CL is https://go-review.googlesource.com/c/go/+/378234.

gopherbot · 2022-01-13T01:25:46Z

Change https://golang.org/cl/378234 mentions this issue: runtime: fix net poll races

bcmills · 2022-01-14T18:03:18Z

@gopherbot, please backport to 1.17 and 1.16: this is a race condition that can result in deadlocks on ARM64 in any process that relies on Close unblocking a read. It is tricky to diagnose (and hard to reproduce reliably), and there is no apparent workaround for users on ARM64 machines.

gopherbot · 2022-01-14T18:03:49Z

Backport issue(s) opened: #50610 (for 1.16), #50611 (for 1.17).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

gopherbot · 2022-03-14T18:34:36Z

Change https://go.dev/cl/392576 mentions this issue: [release-branch.go1.16] runtime: fix net poll races

gopherbot · 2022-03-14T19:01:13Z

Change https://go.dev/cl/392714 mentions this issue: [release-branch.go1.17] runtime: fix net poll races

The netpoll code was written long ago, when the only multiprocessors that Go ran on were x86. It assumed that an atomic store would trigger a full memory barrier and then used that barrier to order otherwise racy access to a handful of fields, including pollDesc.closing. On ARM64, this code has finally failed, because the atomic store is on a value completely unrelated to any of the racily-accessed fields, and the ARMv8 hardware, unlike x86, is clever enough not to do a full memory barrier for a simple atomic store. We are seeing a constant background rate of trybot failures where the net/http tests deadlock - a netpollblock has clearly happened after the pollDesc has begun to close. The code that does the racy reads is netpollcheckerr, which needs to be able to run without acquiring a lock. This CL fixes the race, without introducing unnecessary inefficiency or deadlock, by arranging for every updater of the relevant fields to publish a summary as a single atomic uint32, and then having netpollcheckerr use a single atomic load to fetch the relevant bits and then proceed as before. For #45211 Fixes #50611 Change-Id: Ib6788c8da4d00b7bda84d55ca3fdffb5a64c1a0a Reviewed-on: https://go-review.googlesource.com/c/go/+/378234 Trust: Russ Cox <rsc@golang.org> Run-TryBot: Russ Cox <rsc@golang.org> Trust: Bryan Mills <bcmills@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org> (cherry picked from commit 17b2fb1) Reviewed-on: https://go-review.googlesource.com/c/go/+/392714 Trust: Ian Lance Taylor <iant@golang.org> Run-TryBot: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Emmanuel Odeke <emmanuel@orijtech.com>

The netpoll code was written long ago, when the only multiprocessors that Go ran on were x86. It assumed that an atomic store would trigger a full memory barrier and then used that barrier to order otherwise racy access to a handful of fields, including pollDesc.closing. On ARM64, this code has finally failed, because the atomic store is on a value completely unrelated to any of the racily-accessed fields, and the ARMv8 hardware, unlike x86, is clever enough not to do a full memory barrier for a simple atomic store. We are seeing a constant background rate of trybot failures where the net/http tests deadlock - a netpollblock has clearly happened after the pollDesc has begun to close. The code that does the racy reads is netpollcheckerr, which needs to be able to run without acquiring a lock. This CL fixes the race, without introducing unnecessary inefficiency or deadlock, by arranging for every updater of the relevant fields to publish a summary as a single atomic uint32, and then having netpollcheckerr use a single atomic load to fetch the relevant bits and then proceed as before. Fixes golang#45211 (until proven otherwise!). Change-Id: Ib6788c8da4d00b7bda84d55ca3fdffb5a64c1a0a Reviewed-on: https://go-review.googlesource.com/c/go/+/378234 Trust: Russ Cox <rsc@golang.org> Run-TryBot: Russ Cox <rsc@golang.org> Trust: Bryan Mills <bcmills@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org>

wangfakang · 2022-09-02T05:16:17Z

mark

bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 24, 2021

bcmills added this to the Backlog milestone Mar 24, 2021

josharian mentioned this issue Mar 25, 2021

runtime: unexpected FD destruction #45236

Closed

bcmills added the release-blocker label Nov 12, 2021

bcmills modified the milestones: Backlog, Go1.18 Nov 12, 2021

heschi added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Nov 24, 2021

cherrymui removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Dec 14, 2021

heschi removed the release-blocker label Jan 12, 2022

gopherbot closed this as completed in 17b2fb1 Jan 14, 2022

bcmills changed the title ~~net/http: apparent deadlock in TestTransportDialCancelRace~~ internal/poll: deadlock in Read on arm64 when an FD is closed Jan 14, 2022

This was referenced Jan 14, 2022

internal/poll: deadlock in Read on arm64 when an FD is closed [1.16 backport] #50610

Closed

internal/poll: deadlock in Read on arm64 when an FD is closed [1.17 backport] #50611

Closed

bcmills mentioned this issue Jan 15, 2022

net: apparent deadlock in TestVariousDeadlines on arm64 #41863

Closed

This was referenced Jan 25, 2022

crypto/tls: apparent deadlock in (*Conn).Handshake via TestHostnameInSNI on arm64 #43915

Closed

all: plan9-arm builder failing frequently with a variety of errors #49338

Open

dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Feb 18, 2022

rhysh mentioned this issue Mar 14, 2022

runtime,net/http: apparent deadlock in (*connReader).abortPendingRead via (*sync.Cond).Wait on ARM and ARM64 #50469

Closed

golang locked and limited conversation to collaborators Sep 2, 2023

gopherbot added the FrozenDueToAge label Sep 2, 2023

internal/poll: deadlock in Read on arm64 when an FD is closed #45211

internal/poll: deadlock in Read on arm64 when an FD is closed #45211

Comments

bcmills commented Mar 24, 2021

neild commented Mar 24, 2021

bradfitz commented Mar 24, 2021 • edited Loading

rsc commented Mar 24, 2021

ianlancetaylor commented Mar 24, 2021

ianlancetaylor commented Mar 24, 2021

bradfitz commented Mar 24, 2021

neild commented Mar 24, 2021

ianlancetaylor commented Mar 24, 2021

ianlancetaylor commented Mar 24, 2021

ianlancetaylor commented Mar 24, 2021

josharian commented Mar 24, 2021

josharian commented Mar 24, 2021

bradfitz commented Mar 24, 2021 • edited Loading

neild commented Mar 24, 2021

ianlancetaylor commented Mar 25, 2021

ianlancetaylor commented Mar 25, 2021

bcmills commented Nov 12, 2021

bcmills commented Nov 12, 2021

ianlancetaylor commented Nov 24, 2021

heschi commented Nov 24, 2021

ianlancetaylor commented Nov 29, 2021 • edited Loading

toothrot commented Dec 8, 2021

bcmills commented Dec 8, 2021

bcmills commented Dec 9, 2021

heschi commented Jan 12, 2022

ianlancetaylor commented Jan 12, 2022

rsc commented Jan 12, 2022

rsc commented Jan 12, 2022

rsc commented Jan 13, 2022

rsc commented Jan 13, 2022

gopherbot commented Jan 13, 2022

bcmills commented Jan 14, 2022

gopherbot commented Jan 14, 2022

gopherbot commented Mar 14, 2022

gopherbot commented Mar 14, 2022

wangfakang commented Sep 2, 2022

bradfitz commented Mar 24, 2021 •

edited

Loading

bradfitz commented Mar 24, 2021 •

edited

Loading

ianlancetaylor commented Nov 29, 2021 •

edited

Loading