-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime,cmd/compile: apparent memory corruption in compress/flate #54596
Comments
All three failures seem to have in common:
One possibility for the concurrent write of 0 is allocation. It may be that the GC scans the object when (or before) Why the GC sees the object before The span is from In the third failure https://build.golang.org/log/4c7618e90e934e0158912f4fb949875d171bd521 it is complaining about the first word of the object. If the object is really a In Go 1.19, where Maybe we should increment I haven't reproduced the failure myself (didn't try very hard either). This is just a hypothesis. I may have missed something... |
One issue with this is while we're still in some runtime frame a conservative scan shouldn't happen. Asynchronous preemption backs out, I think. But I agree with the rest of your thinking, i.e. the fact that it's a large object and that it seems like a pointer might somehow be getting published early (while freeindex is still unset). Either that, or there's a pointer into that space that was for some old object before the span was freed, and then materialized after the space for it had died (and it's just likely to land on the large object because it's large). I think you're also onto something with the fact that the bitmap isn't cleared. Maybe there's yet-another place somewhere that we might hold onto a stale pointer that's otherwise safe because we assume the bitmap is going to get cleared (this was the case with noscan objects and cgo not long ago)? We can try always zeroing the bitmap (unfortunate...) and see if the problem still shows up. That's data, but it would take a while to be sure given how rare this is. |
Oh, I didn't mean conservatively scanning the runtime frame (sorry for being unclear). I think it is possible that a conservative scan of a (completely separate) user frame, on a different thread, happens to see a value (possibly a dead pointer) that is within the range of the newly allocated span. That said, I don't really know that it was conservative scan. There might be other possibilities for a GC to see such pointer that I couldn't think of.
I can try that. Last night I ran Among the failures the object hex dump are not always all 0. It could be that it is not (always) allocation. Or it may still be allocation, just that when it throws the allocating goroutine has already moved on, returned from mallocgc and started filling in the content.
Yeah, this is unfortunate. But I'm also not really sure how safe it is for the "all pointer" case. If the above is possible, I guess it is still possible for the GC to see the newly allocated but not yet zeroed pointer word? Just much harder as it is just one word? |
Ohhhhhhh, I see. That's a really good point. It didn't occur to me that pointers could in effect get "published" this way. We probably do need to be defensive about it, as you suggest. |
Change https://go.dev/cl/426834 mentions this issue: |
With the CL above, before the gomote was killed by coordinator restart, I got
compared to previously
It seems positive that zeroing heap bitmap helps. Independently I'm going to try if delaying incrementing |
Change https://go.dev/cl/427619 mentions this issue: |
With CL 427619 (without CL 426834), it also doesn't fail running overnight ( As another data point, disabling async preemption (without either CL above) also seems to stop the failure ( I think these all support the hypothesis above, i.e. conservative scan, GC seeing the newly allocated object before zeroing, and heap bitmap not set. |
|
Found new dashboard test flakes for:
2022-08-17 20:19 freebsd-amd64-12_3 go@9c2b481b compress/flate.TestDeflateInflate (log)
2022-08-26 17:48 darwin-amd64-10_14 go@296c40db compress/flate.TestBestSpeed (log)
2022-09-07 06:18 freebsd-amd64-13_0 go@6375f508 compress/flate.TestDeflateInflate (log)
2022-09-16 16:32 linux-386-buster go@686b38b5 compress/flate.TestDeflateInflate (log)
2022-09-29 19:35 linux-386-softfloat go@545adcfe compress/flate.TestDeflateInflate (log)
|
Found new dashboard test flakes for:
2022-11-03 15:15 windows-amd64-2008 go@582a6c2d compress/flate.TestDeflateInflate (log)
|
Just hit on a trybot: https://storage.googleapis.com/go-build-log/65e11c86/windows-amd64-2016_3020f2c6.log Marking release-blocker, since this can cause random crashes on any platform. It sounds like we have a pretty good hypothesis as to the cause. We should finish the fix and land it. |
Yeah, I'm working on the fix. Hope to land soon-ish. |
Change https://go.dev/cl/449017 mentions this issue: |
@aclements @mknyszek CL 427619 and CL 449017 are two possible fixes. I stress-tested on gomotes and both runs 11+ hours, 300000+ runs with no failure, so both seem to fix the issue. Let me know which one do you think is better. Thanks. (Probably still need some benchmarks to make sure the fast path performance doesn't regress. Will do.) |
I accidentally clicked submit in the wrong Gerrit tab. I sent a revert: https://go.dev/cl/449501. |
@gopherbot please backport this to previous releases. This may cause (rare) GC crashes or memory corruptions. The large object case is new for Go 1.20, but it could still happen with small objects in previous releases. Thanks. |
Backport issue(s) opened: #56750 (for 1.20). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
Well, gopherbot picks the "Go 1.20" from my previous comment and opens 1.20 backport, which is not what we want... I'll open 1.18 and 1.19 ones manually... |
Change https://go.dev/cl/453235 mentions this issue: |
Change https://go.dev/cl/453255 mentions this issue: |
…r it is initialized When the GC is scanning some memory (possibly conservatively), finding a pointer, while concurrently another goroutine is allocating an object at the same address as the found pointer, the GC may see the pointer before the object and/or the heap bits are initialized. This may cause the GC to see bad pointers and possibly crash. To prevent this, we make it that the scanner can only see the object as allocated after the object and the heap bits are initialized. Currently the allocator uses freeindex to find the next available slot, and that code is coupled with updating the free index to a new slot past it. The scanner also uses the freeindex to determine if an object is allocated. This is somewhat racy. This CL makes the scanner use a different field, which is only updated after the object initialization (and a memory barrier). Updates #54596. Fixes #56751. Change-Id: I2a57a226369926e7192c253dd0d21d3faf22297c Reviewed-on: https://go-review.googlesource.com/c/go/+/449017 Reviewed-by: Austin Clements <austin@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Run-TryBot: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> (cherry picked from commit febe7b8) Reviewed-on: https://go-review.googlesource.com/c/go/+/453255
…r it is initialized When the GC is scanning some memory (possibly conservatively), finding a pointer, while concurrently another goroutine is allocating an object at the same address as the found pointer, the GC may see the pointer before the object and/or the heap bits are initialized. This may cause the GC to see bad pointers and possibly crash. To prevent this, we make it that the scanner can only see the object as allocated after the object and the heap bits are initialized. Currently the allocator uses freeindex to find the next available slot, and that code is coupled with updating the free index to a new slot past it. The scanner also uses the freeindex to determine if an object is allocated. This is somewhat racy. This CL makes the scanner use a different field, which is only updated after the object initialization (and a memory barrier). Updates #54596. Fixes #56752. Change-Id: I2a57a226369926e7192c253dd0d21d3faf22297c Reviewed-on: https://go-review.googlesource.com/c/go/+/449017 Reviewed-by: Austin Clements <austin@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Run-TryBot: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> (cherry picked from commit febe7b8) Reviewed-on: https://go-review.googlesource.com/c/go/+/453235
…r it is initialized When the GC is scanning some memory (possibly conservatively), finding a pointer, while concurrently another goroutine is allocating an object at the same address as the found pointer, the GC may see the pointer before the object and/or the heap bits are initialized. This may cause the GC to see bad pointers and possibly crash. To prevent this, we make it that the scanner can only see the object as allocated after the object and the heap bits are initialized. Currently the allocator uses freeindex to find the next available slot, and that code is coupled with updating the free index to a new slot past it. The scanner also uses the freeindex to determine if an object is allocated. This is somewhat racy. This CL makes the scanner use a different field, which is only updated after the object initialization (and a memory barrier). Updates golang#54596. Fixes golang#56752. Change-Id: I2a57a226369926e7192c253dd0d21d3faf22297c Reviewed-on: https://go-review.googlesource.com/c/go/+/449017 Reviewed-by: Austin Clements <austin@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Run-TryBot: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> (cherry picked from commit febe7b8) Reviewed-on: https://go-review.googlesource.com/c/go/+/453235
Found new dashboard test flakes for:
2022-11-07 19:46 windows-amd64-2012 go@601ad2e4 compress/flate.TestWriteError (log)
|
Failure is before the fix. |
greplogs -l -e \(\?ms\)runtime\\.throw.\*FAIL\\s+compress/flate --since=2021-10-09
2022-08-17T20:19:28-9c2b481/freebsd-amd64-12_3
2022-08-17T03:15:44-e1b62ef/netbsd-arm64-bsiegert
A very similar failure mode was reported in #48846 and thought to have been fixed by @cherrymui.
(attn @golang/runtime)
The text was updated successfully, but these errors were encountered: