Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: concurrent GC causes SIGBUS, SIGSEGV #10984

Closed
robx opened this issue May 29, 2015 · 31 comments
Closed

runtime: concurrent GC causes SIGBUS, SIGSEGV #10984

robx opened this issue May 29, 2015 · 31 comments
Milestone

Comments

@robx
Copy link

robx commented May 29, 2015

We experimented with current master, and are experiencing regular crashes under reasonably high load. I haven't been able to reduce this yet, but maybe it's helpful as is? Let me know if there's something else I can provide.

$ go version
go version devel +8b186df Thu May 28 02:30:26 2015 +0000 darwin/amd64

Cross-compiled to linux/amd64.

unexpected fault address 0xc2126257a0
fatal error: fault
[signal 0xb code=0x2 addr=0xc2126257a0 pc=0xc2126257a0]

goroutine 31 [running]:
runtime.throw(0xad7840, 0x5)
        /s/go/src/runtime/panic.go:527 +0x96 fp=0xc20f3eb8c0 sp=0xc20f3eb8a8
runtime.sigpanic()
        /s/go/src/runtime/sigpanic_unix.go:27 +0x2ae fp=0xc20f3eb910 sp=0xc20f3eb8c0
created by realzeit.io/rtb/decision-engine/server.NewCampaign
        /go/src/realzeit.io/rtb/decision-engine/server/campaign.go:40 +0x15e

goroutine 1 [select, 1 minutes]:
main.main()
        /go/src/realzeit.io/rtb/one/main.go:117 +0x10e0

...

The relevant code:

    28  func NewCampaign(de *bidding.DecisionEngine) *campaign {
    29      c := campaign{
    30          de:         de,
    31          requests:   make(chan crequest),
    32          updCamp:    make(chan rzcampaign.Config),
    33          updBrain:   make(chan rzcampaign.BrainConfig),
    34          debugStats: make(chan chan DebugStats),
    35      }
    36      c.wg.Add(1)
    37      go func() {
    38          c.loop()
    39          c.wg.Done()
    40      }()
    41      return &c
    42  }
@davecheney
Copy link
Contributor

Have you run your application under the replace detector?

@minux
Copy link
Member

minux commented May 29, 2015 via email

@robx
Copy link
Author

robx commented May 29, 2015

We haven't run it with -race under proper load, will do that once I get it compiled (currently stuck on #9918). No complaints in tests. (I assume "replace" above means "race".)

It does not use cgo. The only use of unsafe is via github.com/oschwald/maxminddb-golang (https://github.com/oschwald/maxminddb-golang/blob/master/key_other.go). There's no assembler in there.

The full stack trace is at https://gist.github.com/robx/ee71a5868ff136f04aab

@robx
Copy link
Author

robx commented May 29, 2015

-race doesn't complain, here's three more stack traces, thanks @alicebob.

go version devel +80864cf Fri May 29 04:33:15 2015 +0000 linux/amd64
unexpected fault address 0x0
fatal error: fault
[signal 0xb code=0x80 addr=0x0 pc=0x589e61]

goroutine 132 [running]:
runtime.throw(0xcb4a60, 0x5)
        /home/harmen/src/go/src/runtime/panic.go:527 +0x96 fp=0xc20f4d58c8 sp=0xc20f4d58b0
runtime.sigpanic()
        /home/harmen/src/go/src/runtime/sigpanic_unix.go:27 +0x2ae fp=0xc20f4d5918 sp=0xc20f4d58c8
realzeit.io/rtb/decision-engine/server.(*campaign).loop(0xc2082caa80)
        /home/harmen/play/src/realzeit.io/rtb/decision-engine/server/campaign.go:51 +0x231 fp=0xc20f4d5fc8 sp=0
xc20f4d5918
realzeit.io/rtb/decision-engine/server.NewCampaign.func1(0xc2082caa80)
        /home/harmen/play/src/realzeit.io/rtb/decision-engine/server/campaign.go:38 +0x36 fp=0xc20f4d5fd8 sp=0x
c20f4d5fc8
runtime.goexit()
        /home/harmen/src/go/src/runtime/asm_amd64.s:1670 +0x1 fp=0xc20f4d5fe0 sp=0xc20f4d5fd8
created by realzeit.io/rtb/decision-engine/server.NewCampaign
        /home/harmen/play/src/realzeit.io/rtb/decision-engine/server/campaign.go:40 +0x183

goroutine 1 [select]:
main.main()
        /home/harmen/play/src/realzeit.io/rtb/one/main.go:117 +0x1404
unexpected fault address 0xc2132024e0
fatal error: fault
[signal 0xb code=0x2 addr=0xc2132024e0 pc=0xc2132024e0]

goroutine 57 [running]:
runtime.throw(0xcb4a60, 0x5)
        /home/harmen/src/go/src/runtime/panic.go:527 +0x96 fp=0xc20e5478c0 sp=0xc20e5478a8
runtime.sigpanic()
        /home/harmen/src/go/src/runtime/sigpanic_unix.go:27 +0x2ae fp=0xc20e547910 sp=0xc20e5478c0
created by realzeit.io/rtb/decision-engine/server.NewCampaign
        /home/harmen/play/src/realzeit.io/rtb/decision-engine/server/campaign.go:40 +0x183

goroutine 1 [select]:
main.main()
        /home/harmen/play/src/realzeit.io/rtb/one/main.go:117 +0x1404

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
        /home/harmen/src/go/src/runtime/asm_amd64.s:1670 +0x1
unexpected fault address 0xc20edef460
fatal error: fault
[signal 0xb code=0x2 addr=0xc20edef460 pc=0xc20edef460]

goroutine 197 [running]:
runtime.throw(0xcb4a60, 0x5)
        /home/harmen/src/go/src/runtime/panic.go:527 +0x96 fp=0xc2119518c0 sp=0xc2119518a8
runtime.sigpanic()
        /home/harmen/src/go/src/runtime/sigpanic_unix.go:27 +0x2ae fp=0xc211951910 sp=0xc2119518c0
created by realzeit.io/rtb/decision-engine/server.NewCampaign
        /home/harmen/play/src/realzeit.io/rtb/decision-engine/server/campaign.go:40 +0x183

goroutine 1 [select, 1 minutes]:
main.main()
        /home/harmen/play/src/realzeit.io/rtb/one/main.go:117 +0x1404

@mikioh mikioh changed the title crash, fatal error: fault runtime: SIGSEGV, SEGV_ACCERR [signal 0xb code=0x2] on darwin/amd64 May 29, 2015
@mikioh
Copy link
Contributor

mikioh commented May 29, 2015

Can you provide a minimal reproducible snippet for us? The dying messages clearly said that it went wrong in (*campaign).loop(); your code, runtime or whatever touched non-permitted/unmapped stuff on addressing space.

@minux
Copy link
Member

minux commented May 30, 2015

On Fri, May 29, 2015 at 5:53 AM, Robert notifications@github.com wrote:

go version devel +80864cf Fri May 29 04:33:15 2015 +0000 linux/amd64

unexpected fault address 0x0
fatal error: fault
[signal 0xb code=0x80 addr=0x0 pc=0x589e61]

This fault is interesting. Note the code is 0x80, normally, for SIGSEGV,
the code
should be 0/1/2 for user triggered fault. 0x80 means it's sent from kernel.

I don't know why would kernel send a synchronous signal like SIGSEGV.

@mikioh
Copy link
Contributor

mikioh commented May 30, 2015

Perhaps 0x80 comes from:

/usr/include/sys/wait.h:
#define WCOREFLAG   0200

@robx
Copy link
Author

robx commented May 30, 2015

I'll try boiling it down.

Note that the crash is on linux/amd64, it's just that the first crashing version was cross-compiled from darwin/amd64. The second one (with -race) was built on linux/amd64 directly.

@robx
Copy link
Author

robx commented May 30, 2015

There's rather reduced test case in https://github.com/robx/gocrash. It reliably crashes for me when running

go test -bench=. -cpu=2

on OS X, go version devel +8b186df Thu May 28 02:30:26 2015 +0000 darwin/amd64.

It's likely that there's more that can be removed, but I've tried a couple of things that I tried towards the end seemed to be necessary to cause the problems.

@davecheney
Copy link
Contributor

davecheney commented May 30, 2015 via email

@mikioh mikioh changed the title runtime: SIGSEGV, SEGV_ACCERR [signal 0xb code=0x2] on darwin/amd64 runtime: SIGSEGV, SEGV_ACCERR [signal 0xb code=0x2] May 31, 2015
@mikioh
Copy link
Contributor

mikioh commented May 31, 2015

Thanks for the gocrash. I just saw the crash w/ tip on freebsd/amd64.

fatal error: fault
[signal 0xa code=0xc addr=0xc20a610a40 pc=0xc20a610a40]

goroutine 16371 [running]:
runtime.throw(0x583510, 0x5)
    /go/src/runtime/panic.go:527 +0x96 fp=0xc2083d9e28 sp=0xc2083d9e10
runtime.sigpanic()
    /go/src/runtime/sigpanic_unix.go:21 +0x1e7 fp=0xc2083d9e78 sp=0xc2083d9e28
created by _/home/mikioh/gocrash.NewHandler
    /gocrash/handler.go:25 +0xd5

and

TEXT _/gocrash.NewHandler(SB) /gocrash/handler.go
        handler.go:17   0x476d80        64488b0c25f8ffffff      FS MOVQ FS:0xfffffff8, CX
        handler.go:17   0x476d89        483b6110                CMPQ 0x10(CX), SP
(snip)
        handler.go:25   0x476e4b        4889442408              MOVQ AX, 0x8(SP)
        handler.go:25   0x476e50        e8bb82fbff              CALL runtime.newproc(SB)
        handler.go:26   0x476e55        488b5c2418              MOVQ 0x18(SP), BX
        handler.go:26   0x476e5a        48895c2448              MOVQ BX, 0x48(SP)

I guess this is a memory corruption issue by new garbage collector. If you never see this sort of crash when you run your software with GOGC=off, it's likely.

@mikioh mikioh added this to the Go1.5 milestone May 31, 2015
@robx
Copy link
Author

robx commented May 31, 2015

@davecheney I don't think so? manager.Handle receives from r.rep as far as I can tell. But probably I misunderstand.

I agree that the code is a bit weird right now, it used to do more (particularly collect responses from multiple handlers).

@davecheney
Copy link
Contributor

I am sorry, I mistook reading from the channel and placing that value on r.resp with placing the channel on r.resp itself.

@robx
Copy link
Author

robx commented May 31, 2015

I've reduced it a little further, though removing that reps business makes the bug go away.

And I can confirm that it doesn't crash with GOGC=off.

@robx
Copy link
Author

robx commented May 31, 2015

Some further observations:

  • The reduced version seems to be crashing less reliably than the initial one... Probably timing issues.
  • As I reduce the number of fields in Response, the bug gets harder to trigger.
  • If I don't return the string from run(), the bug hardly triggers any more, but I've still seen it.
  • If I replace the done callback by a simple channel, the bug disappears.
  • The string building in the test seems to be necessary, but just to get garbage collection going.

@robx
Copy link
Author

robx commented Jun 1, 2015

Very rarely, I get a different crash:

Benchmark100-2  panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x0]

goroutine 10129 [running]:
github.com/robx/gocrash.(*handler).loop(0x2087bc008)
        /go/src/github.com/robx/gocrash/handler.go:21 +0x10b
created by github.com/robx/gocrash.NewHandler
        /go/src/github.com/robx/gocrash/handler.go:16 +0x8c

goroutine 1 [chan receive]:
testing.(*B).run(0x20833e000, 0x0, 0x0, 0x0, 0x0, 0x0)
        /s/go/src/testing/benchmark.go:180 +0x7e
testing.RunBenchmarks(0x1d0950, 0x250390, 0x1, 0x1)
        /s/go/src/testing/benchmark.go:332 +0x762
testing.(*M).Run(0x2082eff38, 0x2082bc630)
        /s/go/src/testing/testing.go:501 +0x1bb
main.main()
        github.com/robx/gocrash/_test/_testmain.go:56 +0x119

@RLH
Copy link
Contributor

RLH commented Jun 1, 2015

Any bisects?

On Monday, June 1, 2015, Robert notifications@github.com wrote:

Very rarely, I get a different crash:

Benchmark100-2 panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x0]

goroutine 10129 [running]:github.com/robx/gocrash.(*handler).loop(0x2087bc008)
/go/src/github.com/robx/gocrash/handler.go:21 +0x10b
created by github.com/robx/gocrash.NewHandler
/go/src/github.com/robx/gocrash/handler.go:16 +0x8c

goroutine 1 [chan receive]:
testing.(_B).run(0x20833e000, 0x0, 0x0, 0x0, 0x0, 0x0)
/s/go/src/testing/benchmark.go:180 +0x7e
testing.RunBenchmarks(0x1d0950, 0x250390, 0x1, 0x1)
/s/go/src/testing/benchmark.go:332 +0x762
testing.(_M).Run(0x2082eff38, 0x2082bc630)
/s/go/src/testing/testing.go:501 +0x1bb
main.main()
github.com/robx/gocrash/_test/_testmain.go:56 +0x119


Reply to this email directly or view it on GitHub
#10984 (comment).

@robx
Copy link
Author

robx commented Jun 1, 2015

I haven't bisected further than to know that it crashes with devel +8b186df and doesn't with go1.4.2.

@robx
Copy link
Author

robx commented Jun 1, 2015

I failed to bisect:

$ git bisect bad
85a15778005c8412005ef7366cf40b0a50ace5ba is the first bad commit
commit 85a15778005c8412005ef7366cf40b0a50ace5ba
Author: Shenghou Ma <minux@golang.org>
Date:   Wed May 13 20:28:05 2015 -0400

    math/big, cmd/internal/gc/big: fix vet detected printf problem

    Change-Id: I54425d8cbe0277d7a0c9d66c37f2128a0dfa6441
    Reviewed-on: https://go-review.googlesource.com/10041
    Run-TryBot: Minux Ma <minux@golang.org>
    Reviewed-by: Robert Griesemer <gri@golang.org>

:040000 040000 df5254b3ce30ab97eb0f00c3b13f5de0c343a749 287517adef20d9e05aa96a9233b622d234564486 M  src

Clearly, that change doesn't introduce the problem. However, for weird reasons, the crash triggers so rarely before this that it's hard to avoid false negatives.

@robx
Copy link
Author

robx commented Jun 1, 2015

This one seems more likely:

6ad33be2d9d6b24aa741b3007a4bcd52db222c41 is the first bad commit
commit 6ad33be2d9d6b24aa741b3007a4bcd52db222c41
Author: Srdjan Petrovic <spetrovic@google.com>
Date:   Thu Apr 16 14:32:18 2015 -0700

    runtime: implement xadduintptr and update system mstats using it

    The motivation is that sysAlloc/Free() currently aren't safe to be
    called without a valid G, because arm's xadd64() uses locks that require
    a valid G.

    The solution here was proposed by Dmitry Vyukov: use xadduintptr()
    instead of xadd64(), until arm can support xadd64 on all of its
    architectures (not a trivial task for arm).

    Change-Id: I250252079357ea2e4360e1235958b1c22051498f
    Reviewed-on: https://go-review.googlesource.com/9002
    Reviewed-by: Dmitry Vyukov <dvyukov@google.com>

:040000 040000 ddacd34d33aa42b134646c7065c036d40d39e9e2 91ff003353f0e2f68e2eaba52afafb6212af024a M  src

@RLH
Copy link
Contributor

RLH commented Jun 1, 2015

We sometimes run with GOGC=10 or even GOGC=1 to increase the chances of failure related to the GC.

@robx
Copy link
Author

robx commented Jun 1, 2015

Nope, got another crash, with the supposedly-good version before that commit... Low GOGC doesn't seem to make the test fail more reliably, unfortunately.

@robx
Copy link
Author

robx commented Jun 1, 2015

Third time's a charm?

e870f06c3f49ed63960a2575e330c2c75fc54a34 is the first bad commit
commit e870f06c3f49ed63960a2575e330c2c75fc54a34
Author: Austin Clements <austin@google.com>
Date:   Wed Apr 22 14:42:26 2015 -0400

    runtime: yield time slice to most recently readied G

    Currently, when the runtime ready()s a G, it adds it to the end of the
    current P's run queue and continues running. If there are many other
    things in the run queue, this can result in a significant delay before
    the ready()d G actually runs and can hurt fairness when other Gs in
    the run queue are CPU hogs. For example, if there are three Gs sharing
    a P, one of which is a CPU hog that never voluntarily gives up the P
    and the other two of which are doing small amounts of work and
    communicating back and forth on an unbuffered channel, the two
    communicating Gs will get very little CPU time.

    Change this so that when G1 ready()s G2 and then blocks, the scheduler
    immediately hands off the remainder of G1's time slice to G2. In the
    above example, the two communicating Gs will now act as a unit and
    together get half of the CPU time, while the CPU hog gets the other
    half of the CPU time.

    This fixes the problem demonstrated by the ping-pong benchmark added
    in the previous commit:

    benchmark                old ns/op     new ns/op     delta
    BenchmarkPingPongHog     684287        825           -99.88%

    On the x/benchmarks suite, this change improves the performance of
    garbage by ~6% (for GOMAXPROCS=1 and 4), and json by 28% and 36% for
    GOMAXPROCS=1 and 4. It has negligible effect on heap size.

    This has no effect on the go1 benchmark suite since those benchmarks
    are mostly single-threaded.

    Change-Id: I858a08eaa78f702ea98a5fac99d28a4ac91d339f
    Reviewed-on: https://go-review.googlesource.com/9289
    Reviewed-by: Rick Hudson <rlh@golang.org>
    Reviewed-by: Russ Cox <rsc@golang.org>

:040000 040000 bbd15006954b05b00cf4c75514209b3153455458 413aa33afef655aca1494a3116850a2a3713f4ce M  src

Interestingly, with the previous commit, the test seems get stuck occasionally.

@alicebob
Copy link
Contributor

alicebob commented Jun 1, 2015

We simplified the testcase in robx/gocrash a bit. It now uses one channel less.

@mikioh
Copy link
Contributor

mikioh commented Jun 2, 2015

@RLH,

Looks like db7fd1c is the culprit. On db7fd1c, gocrash crashes with:

runtime:greyobject: checkmarks finds unexpected unmarked object obj=0x2082ec070, mbits->bits=0x1 *mbits->bitp=0xb9
runtime: found obj at *(0x2083021e0+0x8)
runtime:greyobject Span: obj=0x2082ec070 k=0x104176 s.start=0x2082ec000 s.limit=0x2082ee000 s.sizeclass=1 s.elemsize=8
 *(obj+0) = 0x208310d80
fatal error: checkmark found unmarked object

after 5254b7e, with reorganized gc code, gocrash crashes with:

unexpected fault address 0x2082df7b0
fatal error: fault
[signal 0xa code=0x2 addr=0x2082df7b0 pc=0x2082df7b0]

PS: I used the following script and didn't use "git bisect run" because the single-train source tree contains a few pitfalls such as "scheduler deadlock issues."

#!/usr/bin/env bash

cd $GOROOT/src
./make.bash
cd /path/to/gocrash
go test -c

for i in {1..10}; do
        GODEBUG=gccheckmark=1 GOGC=1 ./gocrash.test -test.bench=. -test.cpu=2,4,8 || exit 1
done

exit 0

@mikioh mikioh changed the title runtime: SIGSEGV, SEGV_ACCERR [signal 0xb code=0x2] runtime: concurrent GC causes SIGBUS, SIGSEGV Jun 2, 2015
@aclements
Copy link
Member

@rsc and I may know what's going on here. Channel send uses typedmemmove to store the value. In this case, it's storing to directly to the receiving G's stack, so the typedmemmove does not generate write barriers. If that send happens between the scan and mark termination phases, and the receiver G doesn't get scheduled between these phases, mark termination won't rescan the receiver G's stack and won't discover the sent pointer. If that's the only remaining reference to the object, it will be incorrectly collected.

@aclements
Copy link
Member

If our diagnosis is correct, this is related to issue #11084, though the fix we've been entertaining for that issue isn't quite general enough to fix this issue, since it only fixes writes within the same stack.

@robx
Copy link
Author

robx commented Jun 9, 2015

I saw some related changes, so tried current master; it's still crashing:

$ go version
go version devel +aff7a57 Tue Jun 9 08:21:21 2015 +0000 darwin/amd64
$ go test -bench=. -cpu=4 2>&1 
PASS
Benchmark100-4  unexpected fault address 0xb01dfacedebac1e
fatal error: fault
[signal 0xb code=0x1 addr=0xb01dfacedebac1e pc=0x7bc44]

goroutine 9742 [running]:
runtime.throw(0x1888b0, 0x5)
    /s/go/src/runtime/panic.go:527 +0x96 fp=0x2082df630 sp=0x2082df618
runtime.sigpanic()
    /s/go/src/runtime/sigpanic_unix.go:27 +0x2ba fp=0x2082df680 sp=0x2082df630
github.com/robx/gocrash.handler.loop(0x2083d60c0)
    /go/src/github.com/robx/gocrash/manager.go:52 +0x144 fp=0x2082df7b8 sp=0x2082df680
runtime.goexit()
    /s/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0x2082df7c0 sp=0x2082df7b8
created by github.com/robx/gocrash.NewHandler
    /go/src/github.com/robx/gocrash/manager.go:42 +0x53

...

(signal=0xb code=0x1 seems to be the exception, most of the time it's signal=0xa code=0x2).

@aclements
Copy link
Member

@robx, can you give current master another try? The problem I mentioned in #10984 (comment) was fixed by 80ec711 just yesterday.

@robx
Copy link
Author

robx commented Jun 17, 2015

gocrash doesn't crash for me any more with current master (7768296). Thank you!

I'll test the original under load in a bit. It also runs stably in production. Fixed, from my perspective.

@aclements
Copy link
Member

It sounds like this is fixed. If not, please reopen. Thanks!

@golang golang locked and limited conversation to collaborators Jun 25, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants