-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: concurrent GC causes SIGBUS, SIGSEGV #10984
Comments
Have you run your application under the replace detector? |
Does the program use cgo? unsafe? Custom assembly functions?
Have you tried running it with the race detector?
What's the full stack trace?
|
We haven't run it with It does not use cgo. The only use of unsafe is via github.com/oschwald/maxminddb-golang (https://github.com/oschwald/maxminddb-golang/blob/master/key_other.go). There's no assembler in there.
|
|
Can you provide a minimal reproducible snippet for us? The dying messages clearly said that it went wrong in (*campaign).loop(); your code, runtime or whatever touched non-permitted/unmapped stuff on addressing space. |
On Fri, May 29, 2015 at 5:53 AM, Robert notifications@github.com wrote:
I don't know why would kernel send a synchronous signal like SIGSEGV. |
Perhaps 0x80 comes from:
|
I'll try boiling it down. Note that the crash is on linux/amd64, it's just that the first crashing version was cross-compiled from darwin/amd64. The second one (with |
There's rather reduced test case in https://github.com/robx/gocrash. It reliably crashes for me when running
on OS X, go version devel +8b186df Thu May 28 02:30:26 2015 +0000 darwin/amd64. It's likely that there's more that can be removed, but I've tried a couple of things that I tried towards the end seemed to be necessary to cause the problems. |
This part of the test makes no sense
reps := make(chan Response, 1)
r.rep <- reps
this will always block the goroutine with no possibility of waking it up.
|
Thanks for the gocrash. I just saw the crash w/ tip on freebsd/amd64.
and
I guess this is a memory corruption issue by new garbage collector. If you never see this sort of crash when you run your software with GOGC=off, it's likely. |
@davecheney I don't think so? I agree that the code is a bit weird right now, it used to do more (particularly collect responses from multiple handlers). |
I am sorry, I mistook reading from the channel and placing that value on r.resp with placing the channel on r.resp itself. |
I've reduced it a little further, though removing that reps business makes the bug go away. And I can confirm that it doesn't crash with GOGC=off. |
Some further observations:
|
Very rarely, I get a different crash:
|
Any bisects? On Monday, June 1, 2015, Robert notifications@github.com wrote:
|
I haven't bisected further than to know that it crashes with |
I failed to bisect:
Clearly, that change doesn't introduce the problem. However, for weird reasons, the crash triggers so rarely before this that it's hard to avoid false negatives. |
This one seems more likely:
|
We sometimes run with GOGC=10 or even GOGC=1 to increase the chances of failure related to the GC. |
Nope, got another crash, with the supposedly-good version before that commit... Low GOGC doesn't seem to make the test fail more reliably, unfortunately. |
Third time's a charm?
Interestingly, with the previous commit, the test seems get stuck occasionally. |
We simplified the testcase in robx/gocrash a bit. It now uses one channel less. |
@RLH, Looks like db7fd1c is the culprit. On db7fd1c, gocrash crashes with:
after 5254b7e, with reorganized gc code, gocrash crashes with:
PS: I used the following script and didn't use "git bisect run" because the single-train source tree contains a few pitfalls such as "scheduler deadlock issues."
|
@rsc and I may know what's going on here. Channel send uses typedmemmove to store the value. In this case, it's storing to directly to the receiving G's stack, so the typedmemmove does not generate write barriers. If that send happens between the scan and mark termination phases, and the receiver G doesn't get scheduled between these phases, mark termination won't rescan the receiver G's stack and won't discover the sent pointer. If that's the only remaining reference to the object, it will be incorrectly collected. |
If our diagnosis is correct, this is related to issue #11084, though the fix we've been entertaining for that issue isn't quite general enough to fix this issue, since it only fixes writes within the same stack. |
I saw some related changes, so tried current master; it's still crashing:
( |
@robx, can you give current master another try? The problem I mentioned in #10984 (comment) was fixed by 80ec711 just yesterday. |
gocrash doesn't crash for me any more with current master (7768296). Thank you!
|
It sounds like this is fixed. If not, please reopen. Thanks! |
We experimented with current master, and are experiencing regular crashes under reasonably high load. I haven't been able to reduce this yet, but maybe it's helpful as is? Let me know if there's something else I can provide.
Cross-compiled to linux/amd64.
The relevant code:
The text was updated successfully, but these errors were encountered: