-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
internal/poll: deadlock in Read on arm64 when an FD is closed #45211
Comments
I think this is a
One goroutine is blocked in The other goroutine is blocked in |
Maybe unrelated, but we're actively debugging an issue (with Go 1.16.2) where some users get into a state where it seems
There's no evidence (panics in logs) that we hit |
@bradfitz what architecture? |
@neild What is supposed to happen there is that Looking at the netpoll code, it does seem possible that |
@bradfitz That seems like a different problem to me. Any idea what is holding the write lock? Do we even need to hold a write lock during |
@rsc, at least @ianlancetaylor, it's not obvious from the stack traces. Nothing stuck in a system call doing a write, for instance. I can email it to you privately. |
@ianlancetaylor I think the |
@neild Thanks. Now I wonder whether |
@bradfitz Sure, I can take a look though I don't suppose that I'll see anything. But something must be holding the lock, somehow. |
@bradfitz sent a stack trace offline. The stack trace showed a goroutine holding the lock that all the other goroutines are waiting for:
This means that the Normally How is this UDP socket created? It is remotely plausible that the ephemeral ports were exhausted on this system? |
I just recalled from the distant past (#18541) that Little Snitch running on macOS can cause EAGAIN in unexpected places. It is possible that the user that those stack traces are from is running Little Snitch (or some other network filter extension). Possibly related, I just encountered some runtime poller failures on a new M1, and I also run Little Snitch. I will write them up for you tomorrow, or you can ask Brad for details if you want them sooner. |
Also #18751. And FWIW, |
@ianlancetaylor, forgot to tell you in the email: those stacks were from macOS, not Linux. (not super obvious, but some have e.g. The UDP socket is created from: ... which is effectively
That's quite likely actually. |
FWIW, the
|
@bradfitz @josharian If the UDP socket is being created using Unfortunately I'm not sure how to verify that that is the problem, other than observing that I don't see how it could be anything else. The goroutine is clearly blocking waiting for Even more unfortunately I'm not sure how to fix this. We could reduce the severity by changing |
By the way I think we're pretty clearly tracking two independent problems in this one issue. |
This issue is not only regularly occurring on the builders, but also highly reproducible. To my mind, that makes it a release blocker via #11811. |
We are only seeing a problem on arm64. The problem seems to occur when a call to Schematically the code looks like:
For this particular issue, I think the key point is: will the call to On arm64 the atomic operations are implemented as follows:
For example, consider this sequence:
If that sequence can happen, the program can hang. The synchronization point here is
The It's clear that the In the C++ memory model, if But By the way, the atomic.StorepNoWB(noescape(unsafe.Pointer(&rg)), nil) // full memory barrier between store to closing and read of rg/wg in netpollunblock So it's clear that we expect a full memory barrier at that point, which is indeed what is required to make this work. So, does the |
Long-standing issue, okay after beta. |
If anybody can suggest a way to reproduce this problem reliably, that would help. The command in #45211 (comment) does not fail for me. In fact, I don't think these tests ever failed when I ran them. Thanks. |
Checking on this as a release blocker. Has anyone been able to reproduce this? |
FWIW, the failure rate in the builders seems to be highest these days on the
|
Another one today:
|
We don't have a concrete action item and this is a release old, so it can't be a release blocker. |
Here is a question I asked in #45211 (comment):
If anybody knows someone familiar with the arm64 hardware memory model, that would help. I do not know anyone. Thanks. |
I don't believe the non-atomic accesses to closing are guaranteed to be observed by the Go memory model. The pollUnblock can reduce to
The only atomic store here is in StorepNoWB on &rg. But then netpollblock is looking at pd.closing and &pd.rg, which are both different addreses and cache lines from &rg. So I don't believe there is any requirement that the barrier in pollUnblock and the atomics in netpollblock see each other. It seems to me that either pd.closing should be made into its own atomic, or the "grab the last pd.rg and wake it up" in pollUnblock should leave pd.rg = pdClosed (a new sentinel value) instead of leaving it pd.rg = 0. Then if there is a racing netpollblock, it is definitely guaranteed to see the pdClosed sentinel from pd.rg. pd.closing becomes an optimization that no longer contributes at all to correctness. |
Looking more at this, there are only two un-locked accesses to pd.closing and neither seems performance critical enough to justify a race instead of an atomic. If we make pd.closing an atomic bool then the questions all go away. I will send a CL for that. |
There were other racy fields too but I did basically what I said I'd do - move all the relevant bits into a single atomic word. |
The bot seems to have gone to sleep - the CL is https://go-review.googlesource.com/c/go/+/378234. |
Change https://golang.org/cl/378234 mentions this issue: |
@gopherbot, please backport to 1.17 and 1.16: this is a race condition that can result in deadlocks on ARM64 in any process that relies on Close unblocking a read. It is tricky to diagnose (and hard to reproduce reliably), and there is no apparent workaround for users on ARM64 machines. |
Backport issue(s) opened: #50610 (for 1.16), #50611 (for 1.17). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases. |
Change https://go.dev/cl/392576 mentions this issue: |
Change https://go.dev/cl/392714 mentions this issue: |
The netpoll code was written long ago, when the only multiprocessors that Go ran on were x86. It assumed that an atomic store would trigger a full memory barrier and then used that barrier to order otherwise racy access to a handful of fields, including pollDesc.closing. On ARM64, this code has finally failed, because the atomic store is on a value completely unrelated to any of the racily-accessed fields, and the ARMv8 hardware, unlike x86, is clever enough not to do a full memory barrier for a simple atomic store. We are seeing a constant background rate of trybot failures where the net/http tests deadlock - a netpollblock has clearly happened after the pollDesc has begun to close. The code that does the racy reads is netpollcheckerr, which needs to be able to run without acquiring a lock. This CL fixes the race, without introducing unnecessary inefficiency or deadlock, by arranging for every updater of the relevant fields to publish a summary as a single atomic uint32, and then having netpollcheckerr use a single atomic load to fetch the relevant bits and then proceed as before. For #45211 Fixes #50611 Change-Id: Ib6788c8da4d00b7bda84d55ca3fdffb5a64c1a0a Reviewed-on: https://go-review.googlesource.com/c/go/+/378234 Trust: Russ Cox <rsc@golang.org> Run-TryBot: Russ Cox <rsc@golang.org> Trust: Bryan Mills <bcmills@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org> (cherry picked from commit 17b2fb1) Reviewed-on: https://go-review.googlesource.com/c/go/+/392714 Trust: Ian Lance Taylor <iant@golang.org> Run-TryBot: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Emmanuel Odeke <emmanuel@orijtech.com>
The netpoll code was written long ago, when the only multiprocessors that Go ran on were x86. It assumed that an atomic store would trigger a full memory barrier and then used that barrier to order otherwise racy access to a handful of fields, including pollDesc.closing. On ARM64, this code has finally failed, because the atomic store is on a value completely unrelated to any of the racily-accessed fields, and the ARMv8 hardware, unlike x86, is clever enough not to do a full memory barrier for a simple atomic store. We are seeing a constant background rate of trybot failures where the net/http tests deadlock - a netpollblock has clearly happened after the pollDesc has begun to close. The code that does the racy reads is netpollcheckerr, which needs to be able to run without acquiring a lock. This CL fixes the race, without introducing unnecessary inefficiency or deadlock, by arranging for every updater of the relevant fields to publish a summary as a single atomic uint32, and then having netpollcheckerr use a single atomic load to fetch the relevant bits and then proceed as before. Fixes golang#45211 (until proven otherwise!). Change-Id: Ib6788c8da4d00b7bda84d55ca3fdffb5a64c1a0a Reviewed-on: https://go-review.googlesource.com/c/go/+/378234 Trust: Russ Cox <rsc@golang.org> Run-TryBot: Russ Cox <rsc@golang.org> Trust: Bryan Mills <bcmills@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org>
mark |
It's not obvious to me whether this deadlock is a bug in the specific test function, the
net/http
package, or*httptest.Server
in particular.2021-03-24T14:20:32-747f426/linux-arm64-aws
CC @neild @bradfitz @empijei
The text was updated successfully, but these errors were encountered: