Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: lock ordering problem between trace and wbufSpans #56554

Closed
bcmills opened this issue Nov 3, 2022 · 12 comments
Closed

runtime: lock ordering problem between trace and wbufSpans #56554

bcmills opened this issue Nov 3, 2022 · 12 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Nov 3, 2022

#!watchflakes
post <- `fatal error: lock ordering problem` && `0 : trace` && `1 : wbufSpans`

https://build.golang.org/log/af083d40156b011db2e0cadd6040692cf579967c:

99928  ======
0 : trace 42 0x657b20
1 : wbufSpans 39 0x651030
fatal error: lock ordering problem

runtime stack:
runtime.throw({0x553dbd?, 0x0?})
	/workdir/go/src/runtime/panic.go:1047 +0x5d fp=0x7f153fffeb58 sp=0x7f153fffeb28 pc=0x4392fd
runtime.checkRanks(0xc0001021a0, 0xc000067000?, 0x4cb503?)
	/workdir/go/src/runtime/lockrank_on.go:156 +0x22d fp=0x7f153fffebb8 sp=0x7f153fffeb58 pc=0x40de8d
runtime.lockWithRankMayAcquire.func1()
	/workdir/go/src/runtime/lockrank_on.go:229 +0x89 fp=0x7f153fffebe8 sp=0x7f153fffebb8 pc=0x40e5c9
runtime.lockWithRankMayAcquire(0x40e0cc?, 0xc0001021a0?)
	/workdir/go/src/runtime/lockrank_on.go:218 +0x65 fp=0x7f153fffec20 sp=0x7f153fffebe8 pc=0x40e505
runtime.getempty()
	/workdir/go/src/runtime/mgcwork.go:362 +0x3f fp=0x7f153fffec80 sp=0x7f153fffec20 pc=0x42881f
runtime.(*gcWork).init(0xc000030638)
	/workdir/go/src/runtime/mgcwork.go:101 +0x1c fp=0x7f153fffec90 sp=0x7f153fffec80 pc=0x4280dc
runtime.(*gcWork).putBatch(0xc000030638, {0xc000030670, 0x1, 0x200})
	/workdir/go/src/runtime/mgcwork.go:175 +0x46 fp=0x7f153fffece8 sp=0x7f153fffec90 pc=0x4282c6
runtime.wbBufFlush1(0xc00002f400)
	/workdir/go/src/runtime/mwbbuf.go:287 +0x24f fp=0x7f153fffed30 sp=0x7f153fffece8 pc=0x4348ef
runtime.wbBufFlush.func1()
	/workdir/go/src/runtime/mwbbuf.go:201 +0x25 fp=0x7f153fffed48 sp=0x7f153fffed30 pc=0x4671c5
runtime.wbBufFlush(0x40e280?, 0xc0001021a0?)
	/workdir/go/src/runtime/mwbbuf.go:200 +0x6c fp=0x7f153fffed68 sp=0x7f153fffed48 pc=0x43464c
runtime.wbBufFlush(0x657b38, 0x0)
	<autogenerated>:1 +0x2a fp=0x7f153fffed88 sp=0x7f153fffed68 pc=0x47052a
runtime.gcWriteBarrier()
	/workdir/go/src/runtime/asm_amd64.s:1692 +0xa3 fp=0x7f153fffee08 sp=0x7f153fffed88 pc=0x46dfc3
runtime.gcWriteBarrierDX()
	/workdir/go/src/runtime/asm_amd64.s:1720 +0x7 fp=0x7f153fffee10 sp=0x7f153fffee08 pc=0x46e047
runtime.readTrace0()
	/workdir/go/src/runtime/trace.go:535 +0x49a fp=0x7f153fffee50 sp=0x7f153fffee10 pc=0x45c59a
runtime.ReadTrace.func1()
	/workdir/go/src/runtime/trace.go:439 +0x2a fp=0x7f153fffee70 sp=0x7f153fffee50 pc=0x45c0aa
runtime.systemstack()
	/workdir/go/src/runtime/asm_amd64.s:492 +0x49 fp=0x7f153fffee78 sp=0x7f153fffee70 pc=0x46bd29

goroutine 188 [running]:
runtime.systemstack_switch()
	/workdir/go/src/runtime/asm_amd64.s:459 fp=0xc0001f2738 sp=0xc0001f2730 pc=0x46bcc0
runtime.ReadTrace()
	/workdir/go/src/runtime/trace.go:438 +0x88 fp=0xc0001f27a0 sp=0xc0001f2738 pc=0x45c028
runtime/trace.Start.func1()
	/workdir/go/src/runtime/trace/trace.go:130 +0x47 fp=0xc0001f27e0 sp=0xc0001f27a0 pc=0x4beb67
runtime.goexit()
	/workdir/go/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0001f27e8 sp=0xc0001f27e0 pc=0x46dee1
created by runtime/trace.Start
	/workdir/go/src/runtime/trace/trace.go:128 +0xef

(attn @golang/runtime)

@bcmills bcmills added this to the Go1.20 milestone Nov 3, 2022
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Nov 3, 2022
@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 3, 2022
@mknyszek mknyszek self-assigned this Nov 9, 2022
@mknyszek
Copy link
Contributor

I still plan to look into this.

@gopherbot gopherbot modified the milestones: Go1.20, Go1.21 Feb 1, 2023
@mknyszek
Copy link
Contributor

mknyszek commented Jun 8, 2023

Oh no. This is real. You can't actually have a write barrier while holding trace.lock. Thing is, I don't see where there would be a write barrier by just looking at the source.

@cherrymui
Copy link
Member

It seems the write barrier is https://cs.opensource.google/go/go/+/release-branch.go1.20:src/runtime/trace.go;l=535 (Note: I used 1.20 release branch to point to old code). That code is gone with your CL https://golang.org/cl/496296 . So it is probably fixed.

@mknyszek
Copy link
Contributor

mknyszek commented Jun 8, 2023

Thanks for digging it up. I tried the 1.20 release branch but couldn't find it; probably a misclick and I was on the wrong branch in the end.

Anyway, yes! Indeed, looks like this problem is solved on tip.

@gopherbot
Copy link
Contributor

Found new dashboard test flakes for:

#!watchflakes
post <- `fatal error: lock ordering problem` && `0 : trace` && `1 : wbufSpans`
2023-10-25 19:47 linux-amd64-staticlockranking go@a57c5736 runtime/trace.TestTraceFutileWakeup (log)
84254  ======
0 : traceStackTab 43 0x7265e0
1 : wbufSpans 38 0x71f0b0
fatal error: lock ordering problem

runtime stack:
runtime.throw({0x5ce82a?, 0x0?})
	/workdir/go/src/runtime/panic.go:1016 +0x5c fp=0x7fd1da7fb850 sp=0x7fd1da7fb820 pc=0x43b4bc
runtime.checkRanks(0xc000105520, 0x0?, 0x0?)
	/workdir/go/src/runtime/lockrank_on.go:162 +0x236 fp=0x7fd1da7fb8b0 sp=0x7fd1da7fb850 pc=0x40e7b6
...
	/workdir/go/src/runtime/proc.go:406
runtime.semacquire1(0xc0000781e8, 0x0, 0x1, 0x0, 0x12)
	/workdir/go/src/runtime/sema.go:160 +0x22f fp=0xc000067d60 sp=0xc000067d00 pc=0x44fbcf
sync.runtime_Semacquire(0xc00008c060?)
	/workdir/go/src/runtime/sema.go:62 +0x25 fp=0xc000067d98 sp=0xc000067d60 pc=0x46c4e5
sync.(*WaitGroup).Wait(0x8?)
	/workdir/go/src/sync/waitgroup.go:116 +0x48 fp=0xc000067dc0 sp=0xc000067d98 pc=0x4782c8
runtime/trace_test.TestTraceFutileWakeup(0xc000324680)
	/workdir/go/src/runtime/trace/trace_test.go:558 +0x372 fp=0xc000067f70 sp=0xc000067dc0 pc=0x574e92
testing.tRunner(0xc000324680, 0x5d9648)

watchflakes

@gopherbot gopherbot reopened this Oct 25, 2023
@mknyszek mknyszek modified the milestones: Go1.21, Go1.22 Nov 1, 2023
@mknyszek
Copy link
Contributor

I think this is fixed as of the new tracer, and a bunch of the recent staticlockranking fixes for it.

@github-project-automation github-project-automation bot moved this from Todo to Done in Go Compiler / Runtime Nov 28, 2023
@gopherbot
Copy link
Contributor

Found new dashboard test flakes for:

#!watchflakes
post <- `fatal error: lock ordering problem` && `0 : trace` && `1 : wbufSpans`
2024-01-23 00:14 linux-amd64-staticlockranking go@4605ce2d os/signal (log)
76557  ======
0 : traceStackTab 48 0x6aa2f0
1 : wbufSpans 43 0x689270
fatal error: lock ordering problem

runtime stack:
runtime.throw({0x56b26a?, 0x2c000360eb?})
	/workdir/go/src/runtime/panic.go:1011 +0x5c fp=0x7f01f77fdaa0 sp=0x7f01f77fda70 pc=0x43c73c
runtime.checkRanks(0xc0001021c0, 0x5adc39?, 0x1f?)
	/workdir/go/src/runtime/lockrank_on.go:162 +0x236 fp=0x7f01f77fdb00 sp=0x7f01f77fdaa0 pc=0x40f076
...
	/workdir/go/src/runtime/trace2.go:885 +0x53 fp=0xc0001b7fe0 sp=0xc0001b7fb0 pc=0x462093
runtime.goexit({})
	/workdir/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0001b7fe8 sp=0xc0001b7fe0 pc=0x4765e1
created by runtime.(*traceAdvancerState).start in goroutine 119
	/workdir/go/src/runtime/trace2.go:877 +0x168

goroutine 95 gp=0xc0000d8a80 m=nil [running]:
	goroutine running on other thread; stack unavailable
created by runtime/trace.Start in goroutine 119
	/workdir/go/src/runtime/trace/trace.go:128 +0xd0

watchflakes

@gopherbot gopherbot reopened this Jan 23, 2024
@github-project-automation github-project-automation bot moved this from Done to In Progress in Go Compiler / Runtime Jan 23, 2024
@gopherbot
Copy link
Contributor

Found new dashboard test flakes for:

#!watchflakes
post <- `fatal error: lock ordering problem` && `0 : trace` && `1 : wbufSpans`
2024-01-25 09:18 gotip-linux-amd64-staticlockranking go@cad66291 os/signal.TestSignalTrace (log)
=== RUN   TestSignalTrace
748108  ======
0 : traceStackTab 48 0x6ad2f0
1 : wbufSpans 43 0x68c270
fatal error: lock ordering problem

runtime stack:
runtime.throw({0x56b26a?, 0x6990000d5ac?})
	/home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/panic.go:1011 +0x5c fp=0x7f58bfd7eaa0 sp=0x7f58bfd7ea70 pc=0x43c73c
runtime.checkRanks(0xc000006fc0, 0x45da6b?, 0x4bfc1?)
...
runtime.chansend(0xc00022c310, 0xc0003c8fc7, 0x1, 0xc0003c8fd0?)
	/home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/chan.go:259 +0x38d fp=0xc0003c8f80 sp=0xc0003c8f10 pc=0x40800d
runtime.chansend1(0x4478bd?, 0xc0002b0380?)
	/home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/chan.go:145 +0x17 fp=0xc0003c8fb0 sp=0xc0003c8f80 pc=0x407c77
runtime.(*traceAdvancerState).start.func1()
	/home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/trace2.go:885 +0x53 fp=0xc0003c8fe0 sp=0xc0003c8fb0 pc=0x462093
runtime.goexit({})
	/home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0003c8fe8 sp=0xc0003c8fe0 pc=0x4765e1
created by runtime.(*traceAdvancerState).start in goroutine 56
	/home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/trace2.go:877 +0x168

watchflakes

@mknyszek
Copy link
Contributor

I think I we just need to be more careful about write barriers deep down in this code. I don't think that should be too hard to do. I'll look into it.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/560216 mentions this issue: runtime: clear trace map without write barriers

@github-project-automation github-project-automation bot moved this from In Progress to Done in Go Compiler / Runtime Feb 1, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/559957 mentions this issue: [release-branch.go1.22] runtime: clear trace map without write barriers

gopherbot pushed a commit that referenced this issue Feb 1, 2024
Currently the trace map is cleared with an assignment, but this ends up
invoking write barriers. Theoretically, write barriers could try to
write a trace event and eventually try to acquire the same lock. The
static lock ranking expresses this constraint.

This change replaces the assignment with a call to memclrNoHeapPointer
to clear the map, removing the write barriers.

Note that technically this problem is purely theoretical. The way the
trace maps are used today is such that reset is only ever called when
the tracer is no longer writing events that could emit data into a map.
Furthermore, reset is never called from an event-writing context.

Therefore another way to resolve this is to simply not hold the trace
map lock over the reset operation. However, this makes the trace map
implementation less robust because it needs to be used in a very
specific way. Furthermore, the rest of the trace map code avoids write
barriers already since its internal structures are all notinheap, so
it's actually more consistent to just avoid write barriers in the reset
method.

Fixes #56554.

Change-Id: Icd86472e75e25161b2c10c1c8aaae2c2fed4f67f
Reviewed-on: https://go-review.googlesource.com/c/go/+/560216
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
(cherry picked from commit 829f2ce)
Reviewed-on: https://go-review.googlesource.com/c/go/+/559957
Auto-Submit: Michael Knyszek <mknyszek@google.com>
ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
Currently the trace map is cleared with an assignment, but this ends up
invoking write barriers. Theoretically, write barriers could try to
write a trace event and eventually try to acquire the same lock. The
static lock ranking expresses this constraint.

This change replaces the assignment with a call to memclrNoHeapPointer
to clear the map, removing the write barriers.

Note that technically this problem is purely theoretical. The way the
trace maps are used today is such that reset is only ever called when
the tracer is no longer writing events that could emit data into a map.
Furthermore, reset is never called from an event-writing context.

Therefore another way to resolve this is to simply not hold the trace
map lock over the reset operation. However, this makes the trace map
implementation less robust because it needs to be used in a very
specific way. Furthermore, the rest of the trace map code avoids write
barriers already since its internal structures are all notinheap, so
it's actually more consistent to just avoid write barriers in the reset
method.

Fixes golang#56554.

Change-Id: Icd86472e75e25161b2c10c1c8aaae2c2fed4f67f
Reviewed-on: https://go-review.googlesource.com/c/go/+/560216
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Archived in project
Development

No branches or pull requests

5 participants