-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: memory corruption on Linux 5.2+ #35777
Comments
@aclements for your records, #35328 and #35776 might be related as well. Those two were on the same Linux 5.3.x machine of mine. |
Thanks @mvdan. I've folded those into the list above. |
#35621 from me. One time, no repro. |
@aclements just saw #35783 for the record. If you think we have enough "evidence" please say and I'll stop creating issues for now 😄 |
Have we roughly bisected which Linux versions are affected? Looking at the kernel changes in that region might yield a clue about where and whose the bug is. 5.3 = bad. |
In #35326 (comment), I used Arch's 4.19 LTS and could not reproduce the bexport corruption. However, the kernel configuration differs between 4.19 and 5.3, so that may be unscientific. ( What set of kernels do the current Linux builders use? That might provide a lower bound, as I've never seen the issue there. (I'd bring up #9505 to advocate for an Arch builder, but that issue is more about everything but the kernel version. I feel like there should be some builder which is at the latest Linux kernel, whatever that may be.) |
The existing Go Linux builders use Container Optimized OS with a Linux kernel |
Thanks @myitcv, I think we have enough reports. If you do happen to find another one that's reproducible, that would be very helpful, though. |
To recap experiments last Friday (and I rechecked the test for the more mystifying of these Sunday afternoon), Cherry and I tried the following: Double the size of the sigaltstack, just in case. Also sanity check the bounds within gdb, they were okay. Modified the definition of fpstate to conform to what is defined in the linux header files.
Wrote a method to allow us to store the ymm registers that were supplied (as registers) to the signal handler,
I spent some time Saturday looking for "interesting" comments in the Linux git log, I have some to review. What I am wondering is if there was some attempt to optimize saving of the ymm registers and that got fouled up. One thing I wonder a little about was what they are doing for power management with AVX use, I saw some mention of that.
|
An update from over in #35326: I've bisected the issue to kernel commit torvalds/linux@d9c9ce3, which happened between v5.1 and v5.2. It also requires the kernel to be built with GCC 9 (GCC 8 does not reproduce the issue). |
Not sure where Austin's reporting this or if he had time today, but:
|
All of the progress updates have been going on #35326. (Most recently, #35326 (comment).) |
There is this commit that clams to be fixing something in the culprit commit: |
I think that commit is already included in 5.2 and 5.3 kernel, which still has the problem. |
Thanks @dvyukov. I just re-confirmed that I can still reproduce it in the same way on 5.3, which includes that commit. I'll double check that I can still reproduce right at that commit, just in case it was somehow re-introduced later. |
Reproduced at torvalds/linux@b81ff10, as well as v5.4, which was just released. I've filed the upstream kernel bug here: https://bugzilla.kernel.org/show_bug.cgi?id=205663 |
You can disable preemption by setting the environment variable But the key point here is that that doesn't avoid random corruption. The random corruption can occur with any program in any language. Using async preemption does make the random corruption more likely. But it can happen regardless. Therefore, since the |
We put off moving to go1.14+ to give time things to settle related to the (largely patched) kernel bug that go1.14 tickles more due to the signals generated by the preemptive scheduler (see golang/go#35777). There is a small risk of unpatched kernels out there. Also, go1.15 comes out in roughly a month and we'll need to move to at least go1.14 by then to continue to get security updates (since go1.13.x will no longer be maintained). We've watched the ecosystem and waited for large infrastructure products to move to go1.14. Kubernetes and etcd, among others, have made the plunge. Now feels like a good time. Signed-off-by: Andrew Harding <andrew.harding@hpe.com>
Change https://golang.org/cl/243658 mentions this issue: |
Change https://golang.org/cl/244059 mentions this issue: |
For #35777 For #37436 Fixes #40184 Change-Id: I68561497d9258e994d1c6c48d4fb41ac6130ee3a Reviewed-on: https://go-review.googlesource.com/c/go/+/244059 Run-TryBot: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
Change https://golang.org/cl/246200 mentions this issue: |
Go 1.14 included a (rather awful) workaround for a Linux kernel bug that corrupted vector registers on x86 CPUs during signal delivery (https://bugzilla.kernel.org/show_bug.cgi?id=205663). This bug was introduced in Linux 5.2 and fixed in 5.3.15, 5.4.2 and all 5.5 and later kernels. The fix was also back-ported by major distros. This workaround was necessary, but had unfortunate downsides, including causing Go programs to exceed the mlock ulimit in many configurations (#37436). We're reasonably confident that by the Go 1.16 release, the number of systems running affected kernels will be vanishingly small. Hence, this CL removes this workaround. This effectively reverts CLs 209597 (version parser), 209899 (mlock top of signal stack), 210299 (better failure message), 223121 (soft mlock failure handling), and 244059 (special-case patched Ubuntu kernels). The one thing we keep is the osArchInit function. It's empty everywhere now, but is a reasonable hook to have. Updates #35326, #35777 (the original register corruption bugs). Updates #40184 (request to revert in 1.15). Fixes #35979. Change-Id: Ie213270837095576f1f3ef46bf3de187dc486c50 Reviewed-on: https://go-review.googlesource.com/c/go/+/246200 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>
We've had several reports of memory corruption on Linux 5.3.x (or later) kernels from people running tip since asynchronous preemption was committed. This is a super-bug to track these issues. I suspect they all have one root cause.
Typically these are "runtime error: invalid memory address or nil pointer dereference" or "runtime: unexpected return pc" or "segmentation violation" panics. They can also appear as self-detected data corruption.
If you encounter a crash that could be random memory corruption, are running Linux 5.3.x or later, and are running a recent tip Go (after commit 62e53b7), please file a new issue and add a comment here. If you can reproduce it, please try setting "GODEBUG=asyncpreemptoff=1" in your environment and seeing if you can still reproduce it.
Duplicate issues (I'll edit this comment to keep this up-to-date):
runtime: corrupt binary export data seen after signal preemption CL (#35326): Corruption in file version header observed by vet. Medium reproducible. Strong leads.
cmd/compile: panic during early copyelim crash (#35658): Invalid memory address in cmd/compile/internal/ssa.copyelim. Not reproducible. Nothing obvious in stack trace. Haven't dug into assembly.
runtime: SIGSEGV in mapassign_fast64 during cmd/vet (#35689): Invalid memory address in runtime.mapassign_fast64 in vet. Stack trace includes random pointers. Some assembly decoding work.
runtime: unexpected return pc for runtime.(*mheap).alloc (#35328): Unexpected return pc. Stack trace includes random pointers. Not reproducible.
cmd/dist: I/O error: read src/xxx.go: is a directory (#35776): Random misbehavior. Not reproducible.
runtime: "fatal error: mSpanList.insertBack" in mallocgc (#35771): Bad mspan next pointer (random and unaligned). Not reproducible.
cmd/compile: invalid memory address or nil pointer dereference in gc.convlit1 (#35621): Invalid memory address in cmd/compile/internal/gc.convlit1. Evidence of memory corruption, though no obvious random pointers. Not reproducible.
cmd/go: unexpected signal during runtime execution (#35783): Corruption in file version header observed by vet. Not reproducible.
runtime: unexpected return pc for runtime.systemstack_switch (#35592): Unexpected return pc. Stack trace includes random pointers. Not reproducible.
cmd/compile: random compile error running tests (#35760): Compiler data corruption. Not reproducible.
The text was updated successfully, but these errors were encountered: