Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/poll: deadlock in Read on arm64 when an FD is closed #45211

Closed
bcmills opened this issue Mar 24, 2021 · 36 comments
Closed

internal/poll: deadlock in Read on arm64 when an FD is closed #45211

bcmills opened this issue Mar 24, 2021 · 36 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Mar 24, 2021

It's not obvious to me whether this deadlock is a bug in the specific test function, the net/http package, or *httptest.Server in particular.

2021-03-24T14:20:32-747f426/linux-arm64-aws

…

goroutine 3475 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0xffff94117d10, 0x72, 0xffffffffffffffff)
	/workdir/go/src/runtime/netpoll.go:229 +0xa4
internal/poll.(*pollDesc).wait(0x400030a898, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:84 +0x38
internal/poll.(*pollDesc).waitRead(...)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0x400030a880, 0x400048b000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/workdir/go/src/internal/poll/fd_unix.go:167 +0x1ec
net.(*netFD).Read(0x400030a880, 0x400048b000, 0x1000, 0x1000, 0xffff94e94f01, 0x40003bd680, 0x4000432668)
	/workdir/go/src/net/fd_posix.go:56 +0x44
net.(*conn).Read(0x400037c168, 0x400048b000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/workdir/go/src/net/net.go:183 +0x4c
net/http.(*connReader).Read(0x40000a1b60, 0x400048b000, 0x1000, 0x1000, 0xffff94275900, 0x1, 0x4000432788)
	/workdir/go/src/net/http/server.go:800 +0x268
bufio.(*Reader).fill(0x40005e6300)
	/workdir/go/src/bufio/bufio.go:101 +0x10c
bufio.(*Reader).ReadSlice(0x40005e6300, 0x1df0a, 0x40004327e8, 0x1000101d7b4, 0xffff94279340, 0x0, 0xf0)
	/workdir/go/src/bufio/bufio.go:360 +0x38
bufio.(*Reader).ReadLine(0x40005e6300, 0x40004ea000, 0x1000, 0x4000352b40, 0x77, 0xf0, 0x4000432898)
	/workdir/go/src/bufio/bufio.go:389 +0x30
net/textproto.(*Reader).readLineSlice(0x40000a1440, 0x40003a5200, 0x4000432968, 0xab164, 0x400015cc20, 0x0)
	/workdir/go/src/net/textproto/reader.go:57 +0x84
net/textproto.(*Reader).ReadLine(...)
	/workdir/go/src/net/textproto/reader.go:38
net/http.readRequest(0x40005e6300, 0x0, 0x40003a5200, 0x0, 0x0)
	/workdir/go/src/net/http/request.go:1028 +0x74
net/http.(*conn).readRequest(0x400011f900, 0x539c98, 0x40003bd640, 0x0, 0x0, 0x0)
	/workdir/go/src/net/http/server.go:986 +0x21c
net/http.(*conn).serve(0x400011f900, 0x539d40, 0x40003bd640)
	/workdir/go/src/net/http/server.go:1878 +0x844
created by net/http.(*Server).Serve
	/workdir/go/src/net/http/server.go:3013 +0x4b0

…

goroutine 3439 [semacquire, 2 minutes]:
internal/poll.runtime_Semacquire(0x400015ce28)
	/workdir/go/src/runtime/sema.go:61 +0x38
internal/poll.(*FD).Close(0x400015ce00, 0x400015ce00, 0x0)
	/workdir/go/src/internal/poll/fd_unix.go:116 +0x80
net.(*netFD).Close(0x400015ce00, 0x0, 0x40000ca1c8)
	/workdir/go/src/net/fd_posix.go:38 +0x48
net.(*TCPListener).close(...)
	/workdir/go/src/net/tcpsock_posix.go:157
net.(*TCPListener).Close(0x40003a1308, 0x4, 0x40004213a0)
	/workdir/go/src/net/tcpsock.go:275 +0x34
net/http/httptest.(*Server).Close(0x40004f6000)
	/workdir/go/src/net/http/httptest/server.go:204 +0xd0
net/http_test.TestTransportDialCancelRace(0x40004ec180)
	/workdir/go/src/net/http/transport_test.go:4015 +0x308
testing.tRunner(0x40004ec180, 0x4b91d8)
	/workdir/go/src/testing/testing.go:1235 +0x100
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1280 +0x350

…

goroutine 3490 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0xffff941180b0, 0x72, 0xffffffffffffffff)
	/workdir/go/src/runtime/netpoll.go:229 +0xa4
internal/poll.(*pollDesc).wait(0x400015cf18, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:84 +0x38
internal/poll.(*pollDesc).waitRead(...)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0x400015cf00, 0x40004df000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/workdir/go/src/internal/poll/fd_unix.go:167 +0x1ec
net.(*netFD).Read(0x400015cf00, 0x40004df000, 0x1000, 0x1000, 0x400035cc68, 0xe53, 0x0)
	/workdir/go/src/net/fd_posix.go:56 +0x44
net.(*conn).Read(0x40006a61f0, 0x40004df000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/workdir/go/src/net/net.go:183 +0x4c
net/http.(*persistConn).Read(0x40000b0900, 0x40004df000, 0x1000, 0x1000, 0x60, 0x0, 0x101801)
	/workdir/go/src/net/http/transport.go:1922 +0x64
bufio.(*Reader).fill(0x40004c68a0)
	/workdir/go/src/bufio/bufio.go:101 +0x10c
bufio.(*Reader).Peek(0x40004c68a0, 0x1, 0x40003fb620, 0x483939, 0x3, 0x486b25, 0xb)
	/workdir/go/src/bufio/bufio.go:139 +0x74
net/http.(*persistConn).readLoop(0x40000b0900)
	/workdir/go/src/net/http/transport.go:2083 +0x170
created by net/http.(*Transport).dialConn
	/workdir/go/src/net/http/transport.go:1743 +0x18b0

…

goroutine 3440 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0xffff94117c28, 0x72, 0x0)
	/workdir/go/src/runtime/netpoll.go:229 +0xa4
internal/poll.(*pollDesc).wait(0x400015ce18, 0x72, 0x0, 0x0, 0x484ab9)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:84 +0x38
internal/poll.(*pollDesc).waitRead(...)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0x400015ce00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/workdir/go/src/internal/poll/fd_unix.go:402 +0x204
net.(*netFD).accept(0x400015ce00, 0x4000182cb8, 0x3755834c00, 0x12b9526b)
	/workdir/go/src/net/fd_unix.go:173 +0x2c
net.(*TCPListener).accept(0x40003a1308, 0x400037c168, 0x40002a1e08, 0x53a74)
	/workdir/go/src/net/tcpsock_posix.go:140 +0x2c
net.(*TCPListener).Accept(0x40003a1308, 0x40002a1e70, 0x40002a1e78, 0x18, 0x40001f1680)
	/workdir/go/src/net/tcpsock.go:262 +0x34
net/http.(*Server).Serve(0x400067eee0, 0x539100, 0x40003a1308, 0x0, 0x0)
	/workdir/go/src/net/http/server.go:2981 +0x364
net/http/httptest.(*Server).goServe.func1(0x40004f6000)
	/workdir/go/src/net/http/httptest/server.go:308 +0x68
created by net/http/httptest.(*Server).goServe
	/workdir/go/src/net/http/httptest/server.go:306 +0x58

…

CC @neild @bradfitz @empijei

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 24, 2021
@bcmills bcmills added this to the Backlog milestone Mar 24, 2021
@neild
Copy link
Contributor

neild commented Mar 24, 2021

I think this is a internal/poll race, although I might be missing something:

goroutine 3439 [semacquire, 2 minutes]:
internal/poll.runtime_Semacquire(0x400015ce28)
	/workdir/go/src/runtime/sema.go:61 +0x38
internal/poll.(*FD).Close(0x400015ce00, 0x400015ce00, 0x0)
	/workdir/go/src/internal/poll/fd_unix.go:116 +0x80
net.(*netFD).Close(0x400015ce00, 0x0, 0x40000ca1c8)
	/workdir/go/src/net/fd_posix.go:38 +0x48
net.(*TCPListener).close(...)
	/workdir/go/src/net/tcpsock_posix.go:157
net.(*TCPListener).Close(0x40003a1308, 0x4, 0x40004213a0)
	/workdir/go/src/net/tcpsock.go:275 +0x34
net/http/httptest.(*Server).Close(0x40004f6000)
	/workdir/go/src/net/http/httptest/server.go:204 +0xd0
net/http_test.TestTransportDialCancelRace(0x40004ec180)
	/workdir/go/src/net/http/transport_test.go:4015 +0x308
testing.tRunner(0x40004ec180, 0x4b91d8)
	/workdir/go/src/testing/testing.go:1235 +0x100
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1280 +0x350

goroutine 3440 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0xffff94117c28, 0x72, 0x0)
	/workdir/go/src/runtime/netpoll.go:229 +0xa4
internal/poll.(*pollDesc).wait(0x400015ce18, 0x72, 0x0, 0x0, 0x484ab9)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:84 +0x38
internal/poll.(*pollDesc).waitRead(...)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0x400015ce00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/workdir/go/src/internal/poll/fd_unix.go:402 +0x204
net.(*netFD).accept(0x400015ce00, 0x4000182cb8, 0x3755834c00, 0x12b9526b)
	/workdir/go/src/net/fd_unix.go:173 +0x2c
net.(*TCPListener).accept(0x40003a1308, 0x400037c168, 0x40002a1e08, 0x53a74)
	/workdir/go/src/net/tcpsock_posix.go:140 +0x2c
net.(*TCPListener).Accept(0x40003a1308, 0x40002a1e70, 0x40002a1e78, 0x18, 0x40001f1680)
	/workdir/go/src/net/tcpsock.go:262 +0x34
net/http.(*Server).Serve(0x400067eee0, 0x539100, 0x40003a1308, 0x0, 0x0)
	/workdir/go/src/net/http/server.go:2981 +0x364
net/http/httptest.(*Server).goServe.func1(0x40004f6000)
	/workdir/go/src/net/http/httptest/server.go:308 +0x68
created by net/http/httptest.(*Server).goServe
	/workdir/go/src/net/http/httptest/server.go:306 +0x58

One goroutine is blocked in internal/poll.(*FD).Close waiting to acquire fd.csema:
https://go.googlesource.com/go/+/747f426944/src/internal/poll/fd_unix.go#116

The other goroutine is blocked in internal/poll.(*FD).Accept waiting for the FD to become readable:
https://go.googlesource.com/go/+/747f426944/src/internal/poll/fd_unix.go#402

@bradfitz
Copy link
Contributor

bradfitz commented Mar 24, 2021

Maybe unrelated, but we're actively debugging an issue (with Go 1.16.2) where some users get into a state where it seems internal/poll is forever broken with tens of thousands of UDP writes stuck in goroutines like:

goroutine 181363 [semacquire, 152 minutes]:
internal/poll.runtime_Semacquire(0xc00011028c)
	runtime/sema.go:61 +0x45
internal/poll.(*fdMutex).rwlock(0xc000110280, 0x10d599400, 0x4)
	internal/poll/fd_mutex.go:154 +0xb3
internal/poll.(*FD).writeLock(...)
	internal/poll/fd_mutex.go:239
internal/poll.(*FD).WriteTo(0xc000110280, 0xc0007809f0, 0x28, 0x28, 0x10d7080c8, 0xc000333d80, 0x0, 0x0, 0x0)
	internal/poll/fd_unix.go:331 +0x5a
net.(*netFD).writeTo(0xc000110280, 0xc0007809f0, 0x28, 0x28, 0x10d7080c8, 0xc000333d80, 0x10, 0x0, 0x4)
	net/fd_posix.go:79 +0x6f
net.(*UDPConn).writeTo(0xc00057e080, 0xc0007809f0, 0x28, 0x28, 0xc00071cff0, 0xc008948768, 0xc0089489a0, 0x3)
	net/udpsock_posix.go:79 +0xd6
net.(*UDPConn).WriteTo(0xc00057e080, 0xc0007809f0, 0x28, 0x28, 0x10d70a4a0, 0xc00071cff0, 0xad0000c005f028c0, 0x10d1f65b6, 0xadab3d94fe203440)
	net/udpsock.go:176 +0x86
tailscale.com/wgengine/magicsock.(*RebindingUDPConn).WriteTo(0xc000214120, 0xc0007809f0, 0x28, 0x28, 0x10d70a4a0, 0xc00071cff0, 0xc0003c4880, 0xc000679f01, 0x10d506c05)
	tailscale.com@v0.0.0-00010101000000-000000000000/wgengine/magicsock/magicsock.go:2947 +0xbe

There's no evidence (panics in logs) that we hit internal/poll's one million concurrent user limit of that UDP socket, though.

@rsc
Copy link
Contributor

rsc commented Mar 24, 2021

@bradfitz what architecture?

@ianlancetaylor
Copy link
Member

@neild What is supposed to happen there is that (*FD).Close earlier called fd.pd.evict which calls runtime_pollUnblock, and that is supposed to wake up the call sleeping in runtime_pollWait.

Looking at the netpoll code, it does seem possible that poll_runtime_pollUnblock sets fd.closing just before a different goroutine calls poll_runtime_pollWait. I'm not immediately sure what prevents that.

@ianlancetaylor
Copy link
Member

@bradfitz That seems like a different problem to me. Any idea what is holding the write lock? Do we even need to hold a write lock during (*FD).WriteTo and (*FD).WriteMsg?

@bradfitz
Copy link
Contributor

@rsc, at least amd64 for the user we're debugging.

@ianlancetaylor, it's not obvious from the stack traces. Nothing stuck in a system call doing a write, for instance. I can email it to you privately.

@neild
Copy link
Contributor

neild commented Mar 24, 2021

@ianlancetaylor I think the atomic.StorepNoWB in poll_runtime_Unblock is supposed to create a memory barrier ensuring that the write to pb.closing is observable in netpollblock after the atomic.Casuintptr call.

@ianlancetaylor
Copy link
Member

@neild Thanks. Now I wonder whether runtime.arm64HasATOMICS gets set to true on this system.

@ianlancetaylor
Copy link
Member

@bradfitz Sure, I can take a look though I don't suppose that I'll see anything. But something must be holding the lock, somehow.

@ianlancetaylor
Copy link
Member

@bradfitz sent a stack trace offline. The stack trace showed a goroutine holding the lock that all the other goroutines are waiting for:

goroutine 180969 [IO wait, 150 minutes]:
internal/poll.runtime_pollWait(0x10df1bb40, 0x77, 0x10d708d08)
	runtime/netpoll.go:222 +0x55
internal/poll.(*pollDesc).wait(0xc000110298, 0x77, 0x0, 0x200, 0x0)
	internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitWrite(...)
	internal/poll/fd_poll_runtime.go:96
internal/poll.(*FD).WriteTo(0xc000110280, 0xc00058e200, 0x5c, 0x200, 0x10d7080c8, 0xc0001547e0, 0x0, 0x0, 0x0)
	internal/poll/fd_unix.go:344 +0x1b5
net.(*netFD).writeTo(0xc000110280, 0xc00058e200, 0x5c, 0x200, 0x10d7080c8, 0xc0001547e0, 0xc0005a6001, 0x1da0acc6ae8f8, 0x11606b1c8)
	net/fd_posix.go:79 +0x6f
net.(*UDPConn).writeTo(0xc00057e080, 0xc00058e200, 0x5c, 0x200, 0xc0007d6090, 0x0, 0xc0005e9aa8, 0x10d1f74c8)
	net/udpsock_posix.go:79 +0xd6
net.(*UDPConn).WriteTo(0xc00057e080, 0xc00058e200, 0x5c, 0x200, 0x10d70a4a0, 0xc0007d6090, 0x0, 0xc0005e9b48, 0x10d1f6c8e)
	net/udpsock.go:176 +0x86

This means that the sendto system call returned EAGAIN, and the goroutine is waiting for epoll to report that the socket is available for writing.

Normally EAGAIN means that the socket buffer is full. However, looking at the man page for sendto, it appears that EAGAIN can be returned if the socket was not bound to a local port, and there are no ports available in the ephemeral range. Could that possibly be the case here? If so, that could be a problem, because we use epoll in edge-triggered mode, so if the socket was ready for writing although it did not have an ephemeral port, epoll would never tell the program to try the sendto again.

How is this UDP socket created? It is remotely plausible that the ephemeral ports were exhausted on this system?

@josharian
Copy link
Contributor

I just recalled from the distant past (#18541) that Little Snitch running on macOS can cause EAGAIN in unexpected places. It is possible that the user that those stack traces are from is running Little Snitch (or some other network filter extension).

Possibly related, I just encountered some runtime poller failures on a new M1, and I also run Little Snitch. I will write them up for you tomorrow, or you can ask Brad for details if you want them sooner.

@josharian
Copy link
Contributor

Also #18751. And FWIW, runtime.arm64HasATOMICS is true on my machine.

@bradfitz
Copy link
Contributor

bradfitz commented Mar 24, 2021

@ianlancetaylor, forgot to tell you in the email: those stacks were from macOS, not Linux. (not super obvious, but some have e.g. "golang.org/x/sys@v0.0.0-20210301091718-77cc2087c03b/unix/zsyscall_darwin_amd64.go:1691 +0x66").

The UDP socket is created from:

https://github.com/tailscale/tailscale/blob/c99f260e403f8aa6665c54bf9a7f25ca696fa480/wgengine/magicsock/magicsock.go#L2669

... which is effectively net.Listen("udp4", ":41641") in this case, which I just confirmed from the user logs corresponding to those stacks.

It is remotely plausible that the ephemeral ports were exhausted on this system?

That's quite likely actually.

@neild
Copy link
Contributor

neild commented Mar 24, 2021

FWIW, the net/http flake in TestTransportDialCancelRace is very reproducible on the linux-arm64-aws builder. This reproduces it for me every time:

gomote run $GOMOTE go/bin/go test net/http -run=DialCancelRace -cpu=4  -count=10000 -v -timeout=20s

@ianlancetaylor
Copy link
Member

@bradfitz @josharian If the UDP socket is being created using net.ListenPacket then it's not an ephemeral port problem. But if Little Snitch can cause EAGAIN errors then it could be a kqueue problem, as our use of kqueue is also edge-triggered rather than level-triggered. If sendto can return an EAGAIN error when the UDP socket is writable, we can indeed block forever waiting for the socket to become writable when it already is.

Unfortunately I'm not sure how to verify that that is the problem, other than observing that I don't see how it could be anything else. The goroutine is clearly blocking waiting for kqueue to say that the socket is writable, and that clearly isn't happening.

Even more unfortunately I'm not sure how to fix this. We could reduce the severity by changing (*FD).WriteTo and (*FD).WriteMsg to attempt the call to syscall.Sendto and syscall.SendmsgN before acquiring the write lock. There is really no need to acquire the write lock to write a single packet. But if the system call returns EAGAIN, then we more or less do have to take the write lock, and then try the system call again, and if it returns EAGAIN again, we have to call into the netpoll system. And if the problem is indeed a spurious EAGAIN, then we can wait there forever.

@ianlancetaylor
Copy link
Member

By the way I think we're pretty clearly tracking two independent problems in this one issue.

@bcmills
Copy link
Contributor Author

bcmills commented Nov 12, 2021

greplogs --dashboard -md -l -e '(?m)goroutine \d+ \[.+, \d+ minutes\]:\n(?:.+\n\t.+\n)*net\.\(\*netFD\)\.Close' --since=2021-01-01

2021-11-12T20:20:57-76fbd61/windows-arm64-10
2021-10-15T18:07:49-8331f25/windows-arm64-10
2021-10-15T03:49:17-8dab959/linux-arm64-packet
2021-10-13T18:15:48-40f82f8/windows-arm64-10
2021-10-13T18:15:16-4efa216/windows-arm64-10
2021-10-11T22:17:47-b41030e/windows-arm64-10
2021-10-11T22:17:41-662c5ee/windows-arm64-10
2021-10-08T16:26:20-99c1b24/windows-arm64-10
2021-10-05T15:58:05-f1f626d/linux-arm64-aws
2021-10-01T17:27:29-3357624/windows-arm64-10
2021-09-30T18:10:18-e81a3d9-205640e/linux-arm64-packet
2021-09-29T18:36:18-e48cf0d/android-arm64-corellium
2021-09-28T18:58:50-7d9f5e0-9c43872/windows-arm64-10
2021-09-28T17:19:08-1c6e50a/android-arm64-corellium
2021-09-27T22:22:04-52b23a5/android-arm64-corellium
2021-09-27T19:42:34-4e4d966-315cec2/linux-arm64-packet
2021-09-24T18:21:14-f6b5ffb/android-arm64-corellium
2021-09-22T20:53:48-41bb744/linux-arm64-packet
2021-09-18T06:04:41-c084706-c894b44/windows-arm64-10
2021-09-17T18:33:15-70493b3/ios-arm64-corellium
2021-09-16T23:13:58-e09dcc2/android-arm64-corellium
2021-09-09T21:28:56-c163c31-c981874/linux-arm64-aws
2021-09-07T23:27:08-3fff213/windows-arm64-10
2021-09-03T01:38:54-9f69a44/ios-arm64-corellium
2021-09-02T19:36:22-a8aa6cf/windows-arm64-10
2021-09-02T19:16:19-9633195/ios-arm64-corellium
2021-09-02T19:16:19-9633195/linux-arm64-packet
2021-09-02T16:59:57-2a463a2/android-arm64-corellium
2021-09-02T14:02:42-b8420ba/ios-arm64-corellium
2021-09-02T06:43:50-9afbf82/linux-arm64-aws
2021-09-01T22:28:21-0bfd6fc/android-arm64-corellium
2021-08-28T04:46:13-0108177/windows-arm64-10
2021-08-24T15:55:38-60bc85c-d70c69d/windows-arm64-10
2021-08-23T12:26:30-c7e354d/linux-arm64-packet
2021-08-21T00:55:22-e17439e/windows-arm64-10
2021-08-17T15:00:04-3001b0a/windows-arm64-10
2021-08-05T20:22:31-fd45e26/android-arm64-corellium
2021-07-31T23:59:40-c6fcb2d-b8ca6e5/windows-arm64-10
2021-07-26T22:15:24-bfbb288/ios-arm64-corellium
2021-07-20T19:54:36-d568e6e/ios-arm64-corellium
2021-07-16T02:43:48-aa4e0f5/linux-arm64-aws
2021-06-30T01:29:49-c45e800/windows-arm64-10
2021-05-04T14:38:36-5e4f9b0/linux-arm64-aws
2021-05-03T16:23:09-7b768d4/linux-arm64-aws
2021-04-29T17:08:05-e03cca6/linux-arm64-aws
2021-04-29T11:37:05-c4c68fb/linux-arm64-aws
2021-04-22T17:47:59-537cde0/linux-arm64-aws
2021-04-16T16:36:57-2f0e5bf/linux-arm64-packet
2021-04-14T03:15:00-2d4ba26/linux-arm64-packet
2021-04-13T21:13:09-e512bc2/ios-arm64-corellium
2021-04-09T18:49:05-c3faff7/linux-arm64-aws
2021-04-09T18:49:05-c3faff7/linux-arm64-packet
2021-04-09T14:33:44-a690a5d/android-arm64-corellium
2021-04-08T18:47:17-96a6745/linux-arm64-packet
2021-03-29T21:50:16-4e1bf8e/linux-arm64-packet
2021-03-24T14:20:32-747f426/linux-arm64-aws
2021-03-15T23:47:56-661f3f1/linux-arm64-aws
2021-03-12T21:52:25-5ea612d-f39c4de/linux-arm64-aws
2021-03-05T22:14:48-a829114/linux-arm64-aws
2021-01-25T22:54:28-e6b6d10/android-arm64-corellium

@bcmills
Copy link
Contributor Author

bcmills commented Nov 12, 2021

This issue is not only regularly occurring on the builders, but also highly reproducible. To my mind, that makes it a release blocker via #11811.

@bcmills bcmills modified the milestones: Backlog, Go1.18 Nov 12, 2021
@ianlancetaylor
Copy link
Member

We are only seeing a problem on arm64. The problem seems to occur when a call to interna/poll.(*FD).Close hangs waiting for all uses of FD to be released, while simultaneously internal/poll.(*FD).Read hangs waiting for some data to arrive (which will never happen). What is supposed to happen here is that Close will wake up any readers, and the readers won't hang if the FD is being closed.

Schematically the code looks like:

FD.Close -> fd.pd.evict -> runtime_pollUnblock:

lock(&pd.lock)
pd.closing = true
atomic.StorepNoWB() // full memory barrier?
for {
    old := atomic.Loaduintptr(&pd.rg)
    if old == pdReady { rg = nil; break }
    if old == 0 { rg = nil; break }
    if atomic.Casuintptr(&pd.rg, old, 0) {
        if old == pdWait { old = 0 }
        rg = old
        break
    }
}
unlock(&pd.lock)
if rg != nil {
    netpollgoready(rg)
}

FD.Read -> fd.pd.waitRead -> runtime_pollWait:

if pd.closing {
    return pollErrClosing
}
for !netpollblock() {
    if pd.closing { return pollErrClosing }
}

netpollblock:

for {
    if atomic.Casuintptr(&pd.rg, pdReady, 0) { return true }
    if atomic.Casuintptr(&pd.rg, 0, pdWait) { break }
    if v := atomic.Loaduintptr(&pd.rg); v != pdReady && v != 0 { throw() }
}
if !pd.closing { gopark(if atomic.Casuintptr(&pd.rg, pdWait, g)) }
old := atomic.Xchguintptr(&pd.rg, 0)
if old > pdWait { throw() }
return old == pdReady

For this particular issue, I think the key point is: will the call to runtime_pollWait reliably see that pd.closing is set by runtime_pollUnblock? If it doesn't see that, then if the reader calls netpollblock before the close completes, which is possible, then the reader can park itself after the last attempt to wake it up. I'm focusing on pd.closing because it is not accessed atomically.

On arm64 the atomic operations are implemented as follows:

Go ASM
atomic.StorepNoWB stlr
atomic.Loaduintptr ldar
atomic.Casuintptr casal
atomic.Xchguintptr swpal

For example, consider this sequence:

  • pollUnblock sets pd.closing, finds that pd.rg is nil, and returns
  • pollWait sets pd.rg to pdWait, checks pd.closing, sets pd.rg to g, parks

If that sequence can happen, the program can hang. The synchronization point here is pd.rg. I think the minimal set of relevant operations is:

  • pollUnblock: strb to pd.closing, stlr to the stack, ldar of pd.rg
  • pollWait: casal to pd.rg, another casal to pd.rg, ldrb of pd.closing, a third casal to pd.rg

The ldar and stlr instructions are described as one-way barriers. An ldar, which is a load-acquire, means that all loads and stores that appear later in the program will at least see the memory as of the ldar instruction. An stlr, which is a store-release, means that all loads and stores that appear earlier in the program must be completed as of the stlr instruction. A casal instruction is both a load-acquire and a store-release.

It's clear that the stlr to the stack ensures that the strb to pd.closing is visible. The ldar after that on pollUnblock doesn't have any relevant effect. The question is whether the casal in pollWait is enough to ensure that when the pollWait thread does an ldrb of pd.closing, it will see true. On amd64 this is guaranteed by the total store ordering: if the Loaduintptr by pollUnblock saw nil, then the store of the Casuintptr must follow the store to pd.closing. But I don't know if the stlr to the stack is enough to ensure that the store to pd.closing is seen after the casal instruction of a different memory location.

In the C++ memory model, if stlr is store-release and casal is both load-acquire and store-release, I don't think that this is guaranteed to work, because the stlr and the casal are to different memory locations. And the fact that there is a ldar of that same memory location doesn't provide the necessary association. Or so it seems to me.

But stlr is not a pure store-release, so this may be sufficient.

By the way, the stlr to the stack appears in the source code as

	atomic.StorepNoWB(noescape(unsafe.Pointer(&rg)), nil) // full memory barrier between store to closing and read of rg/wg in netpollunblock

So it's clear that we expect a full memory barrier at that point, which is indeed what is required to make this work.

So, does the stlr provide that memory barrier? I think so but I'm not sure.

@heschi heschi added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Nov 24, 2021
@heschi
Copy link
Contributor

heschi commented Nov 24, 2021

Long-standing issue, okay after beta.

@ianlancetaylor
Copy link
Member

ianlancetaylor commented Nov 29, 2021

If anybody can suggest a way to reproduce this problem reliably, that would help. The command in #45211 (comment) does not fail for me. In fact, I don't think these tests ever failed when I ran them. Thanks.

@toothrot
Copy link
Contributor

toothrot commented Dec 8, 2021

Checking on this as a release blocker. Has anyone been able to reproduce this?

@bcmills
Copy link
Contributor Author

bcmills commented Dec 8, 2021

FWIW, the failure rate in the builders seems to be highest these days on the windows-arm64-10 builder.

greplogs --dashboard -md -l -e '(?m)goroutine \d+ \[.+, \d+ minutes\]:\n(?:.+\n\t.+\n)*net\.\(\*netFD\)\.Close' --since=2021-11-13

2021-11-17T17:04:16-54b9cb8/windows-arm64-10

@bcmills
Copy link
Contributor Author

bcmills commented Dec 9, 2021

Another one today:

greplogs --dashboard -md -l -e '(?m)goroutine \d+ \[.+, \d+ minutes\]:\n(?:.+\n\t.+\n)*net\.\(\*netFD\)\.Close' --since=2021-12-09

2021-12-09T17:16:12-78b4518/windows-arm64-10

@cherrymui cherrymui removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Dec 14, 2021
@heschi
Copy link
Contributor

heschi commented Jan 12, 2022

We don't have a concrete action item and this is a release old, so it can't be a release blocker.

@ianlancetaylor
Copy link
Member

Here is a question I asked in #45211 (comment):

So, does the stlr provide that memory barrier? I think so but I'm not sure.

If anybody knows someone familiar with the arm64 hardware memory model, that would help. I do not know anyone. Thanks.

@rsc
Copy link
Contributor

rsc commented Jan 12, 2022

I don't believe the non-atomic accesses to closing are guaranteed to be observed by the Go memory model.
I don't believe they are guaranteed by the underlying ARM64 memory model either.

The pollUnblock can reduce to

lock(&pd.lock)
pd.closing = true
atomic.StorepNoWB() // full memory barrier?
old := atomic.Loaduintptr(&pd.rg) // == 0
unlock(&pd.lock)

The only atomic store here is in StorepNoWB on &rg. But then netpollblock is looking at pd.closing and &pd.rg, which are both different addreses and cache lines from &rg. So I don't believe there is any requirement that the barrier in pollUnblock and the atomics in netpollblock see each other.

It seems to me that either pd.closing should be made into its own atomic, or the "grab the last pd.rg and wake it up" in pollUnblock should leave pd.rg = pdClosed (a new sentinel value) instead of leaving it pd.rg = 0. Then if there is a racing netpollblock, it is definitely guaranteed to see the pdClosed sentinel from pd.rg. pd.closing becomes an optimization that no longer contributes at all to correctness.

@rsc
Copy link
Contributor

rsc commented Jan 12, 2022

Looking more at this, there are only two un-locked accesses to pd.closing and neither seems performance critical enough to justify a race instead of an atomic. If we make pd.closing an atomic bool then the questions all go away. I will send a CL for that.

@rsc
Copy link
Contributor

rsc commented Jan 13, 2022

There were other racy fields too but I did basically what I said I'd do - move all the relevant bits into a single atomic word.

@rsc
Copy link
Contributor

rsc commented Jan 13, 2022

The bot seems to have gone to sleep - the CL is https://go-review.googlesource.com/c/go/+/378234.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/378234 mentions this issue: runtime: fix net poll races

@bcmills bcmills changed the title net/http: apparent deadlock in TestTransportDialCancelRace internal/poll: deadlock in Read on arm64 when an FD is closed Jan 14, 2022
@bcmills
Copy link
Contributor Author

bcmills commented Jan 14, 2022

@gopherbot, please backport to 1.17 and 1.16: this is a race condition that can result in deadlocks on ARM64 in any process that relies on Close unblocking a read. It is tricky to diagnose (and hard to reproduce reliably), and there is no apparent workaround for users on ARM64 machines.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #50610 (for 1.16), #50611 (for 1.17).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/392576 mentions this issue: [release-branch.go1.16] runtime: fix net poll races

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/392714 mentions this issue: [release-branch.go1.17] runtime: fix net poll races

gopherbot pushed a commit that referenced this issue Mar 28, 2022
The netpoll code was written long ago, when the
only multiprocessors that Go ran on were x86.
It assumed that an atomic store would trigger a
full memory barrier and then used that barrier
to order otherwise racy access to a handful of fields,
including pollDesc.closing.

On ARM64, this code has finally failed, because
the atomic store is on a value completely unrelated
to any of the racily-accessed fields, and the ARMv8
hardware, unlike x86, is clever enough not to do a
full memory barrier for a simple atomic store.
We are seeing a constant background rate of trybot
failures where the net/http tests deadlock - a netpollblock
has clearly happened after the pollDesc has begun to close.

The code that does the racy reads is netpollcheckerr,
which needs to be able to run without acquiring a lock.
This CL fixes the race, without introducing unnecessary
inefficiency or deadlock, by arranging for every updater
of the relevant fields to publish a summary as a single
atomic uint32, and then having netpollcheckerr use a
single atomic load to fetch the relevant bits and then
proceed as before.

For #45211
Fixes #50611

Change-Id: Ib6788c8da4d00b7bda84d55ca3fdffb5a64c1a0a
Reviewed-on: https://go-review.googlesource.com/c/go/+/378234
Trust: Russ Cox <rsc@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
Trust: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
(cherry picked from commit 17b2fb1)
Reviewed-on: https://go-review.googlesource.com/c/go/+/392714
Trust: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Emmanuel Odeke <emmanuel@orijtech.com>
jproberts pushed a commit to jproberts/go that referenced this issue Jun 21, 2022
The netpoll code was written long ago, when the
only multiprocessors that Go ran on were x86.
It assumed that an atomic store would trigger a
full memory barrier and then used that barrier
to order otherwise racy access to a handful of fields,
including pollDesc.closing.

On ARM64, this code has finally failed, because
the atomic store is on a value completely unrelated
to any of the racily-accessed fields, and the ARMv8
hardware, unlike x86, is clever enough not to do a
full memory barrier for a simple atomic store.
We are seeing a constant background rate of trybot
failures where the net/http tests deadlock - a netpollblock
has clearly happened after the pollDesc has begun to close.

The code that does the racy reads is netpollcheckerr,
which needs to be able to run without acquiring a lock.
This CL fixes the race, without introducing unnecessary
inefficiency or deadlock, by arranging for every updater
of the relevant fields to publish a summary as a single
atomic uint32, and then having netpollcheckerr use a
single atomic load to fetch the relevant bits and then
proceed as before.

Fixes golang#45211 (until proven otherwise!).

Change-Id: Ib6788c8da4d00b7bda84d55ca3fdffb5a64c1a0a
Reviewed-on: https://go-review.googlesource.com/c/go/+/378234
Trust: Russ Cox <rsc@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
Trust: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@wangfakang
Copy link

mark

@golang golang locked and limited conversation to collaborators Sep 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests