-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: ThreadSanitizer CHECK failed #27660
Comments
And here's a different tsan failure I just observed with the same command:
|
And another slightly different presentation:
|
CC @dvyukov |
These instructions fail for me:
This is on commit:
|
Yes, I’m afraid you’ll need to run the `make test` command I posted at
least once to generate some Go code and build some C/C++ dependencies.
After that, you can run just the `go test` command.
I apologize that the surface area of this reproduction is so large, but I
really don’t know how to whittle it down. We run the race detector on every
other CockroachDB package with no problems. Let me know if there’s anything
I can do to assist in tracking this down!
…On Fri, Sep 14, 2018 at 6:00 AM Dmitry Vyukov ***@***.***> wrote:
These instructions fail for me:
cockroach$ go test -race -tags ' make x86_64_linux_gnu' -ldflags '-X github.com/cockroachdb/cockroach/pkg/build.typ=development -extldflags "" -X "github.com/cockroachdb/cockroach/pkg/build.tag=v2.2.0-alpha.00000000-793-g33c7d27d82-dirty" -X "github.com/cockroachdb/cockroach/pkg/build.rev=33c7d27d8216b543ac77f0fe39d440ebebfa9e70" -X "github.com/cockroachdb/cockroach/pkg/build.cgoTargetTriple=x86_64-linux-gnu" ' -run "TestLogic" -timeout 25m ./pkg/sql/logictest
# github.com/cockroachdb/cockroach/pkg/sql/lex
pkg/sql/lex/predicates.go:60:14 <http://github.com/cockroachdb/cockroach/pkg/sql/lexpkg/sql/lex/predicates.go:60:14>: undefined: reservedKeywords
FAIL github.com/cockroachdb/cockroach/pkg/sql/logictest [build failed]
This is on commit:
cockroach$ git show HEAD
commit 418b9169b547069cd43252055196204af8f18040 (HEAD -> master, origin/staging, origin/master, origin/HEAD)
Merge: a9c7295dd6 5504e3f9ae
Author: craig[bot] ***@***.***>
Date: Fri Sep 14 01:24:32 2018 +0000
Merge #29163
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#27660 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA15IByVPvtHvHaZVvkSEownHzmlif_tks5ua348gaJpZM4Wn-Lc>
.
|
I did that. The error I posted happens during go test. |
Are you sure the |
Right, it failed somewhere in the beginning:
|
Oh, yeah, you’ll need autoconf and CMake installed. It looks like you have
CMake but not autoconf.
…On Fri, Sep 14, 2018 at 9:43 AM Dmitry Vyukov ***@***.***> wrote:
Right, it failed somewhere in the beginning:
cockroach$ IGNORE_GOVERS=1 make testrace PKG=./pkg/sql/logictest TESTS=TestLogic IGNORE_GOVERS=1
Running make with -j4
GOPATH set to /go
Detected change in build system. Rebooting Make.
Running make with -j4
GOPATH set to /go
cd /go/src/github.com/cockroachdb/cockroach/c-deps/jemalloc && autoconf
rm -rf /go/native/x86_64-linux-gnu/protobuf
bash: autoconf: command not found
Makefile:511: recipe for target '/go/src/github.com/cockroachdb/cockroach/c-deps/jemalloc/configure' failed
make: *** [/go/src/github.com/cockroachdb/cockroach/c-deps/jemalloc/configure] Error 127
make: *** Waiting for unfinished jobs....
go install -v uptodate
bin/prereqs ./pkg/cmd/uptodate > bin/uptodate.d.tmp
mkdir -p /go/native/x86_64-linux-gnu/protobuf
cd /go/native/x86_64-linux-gnu/protobuf && cmake -DCMAKE_TARGET_MESSAGES=OFF -Dprotobuf_BUILD_TESTS=OFF /go/src/github.com/cockroachdb/cockroach/c-deps/protobuf/cmake \
-DCMAKE_BUILD_TYPE=Release
mv -f bin/uptodate.d.tmp bin/uptodate.dgit.luolix.top/cockroachdb/cockroach/vendor/github.com/MichaelTJones/walk
-- The C compiler identification is GNU 7.3.0git.luolix.top/cockroachdb/cockroach/vendor/github.com/spf13/pflaggit.luolix.top/cockroachdb/cockroach/pkg/cmd/uptodate
-- The CXX compiler identification is GNU 7.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.8")
-- Configuring done
-- Generating done
-- Build files have been written to: /go/native/x86_64-linux-gnu/protobuf
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#27660 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA15INHhn9MKCMjIVRqDwGFG2k8XPr6aks5ua7J4gaJpZM4Wn-Lc>
.
|
How long does it generally take to crash? I've run it twice and both times got:
|
It's been reproducing for me in less than a minute. Oddly enough, it's failing to reproduce for me on the latest master. Could you try commit 33c7d27d8216b543ac77f0fe39d440ebebfa9e70? And just to double check, you are running go1.11, right? I'm also having trouble reproducing on go1.10.3. In the meantime I'm going to bisect to see if there's something obviously wrong that we fixed. |
Bisecting proved challenging thanks to some actual races that were introduced and fixed within the range of commits I was bisecting. Regardless, I can semi-reliably repro on 33c7d27d8216b543ac77f0fe39d440ebebfa9e70 on a fresh Ubuntu machine with Go 1.11. The problem seems to occur early or not at all. In five runs, this is what I observed:
I could have sworn this reproduced more reliably when I original filed the bug. Still, I'm hopeful that you might be able to reproduce if you run a few more trials with a short timeout. The magic incantation to thread in a timeout via the Makefile, if that's easier, is: $ make testrace PKG=./pkg/sql/logictest TESTS=TestLogic TESTTIMEOUT=2m |
I have no luck reproducing this so far:
p.s. stress is: |
My bets would be on:
|
Oh, cool, I didn't know about the Anyway, thanks again for looking into this. We upgrade our CI environment to go1.11 today and various other tests started producing assertions like this under the race detector. We also started seeing outright segfaults:
I bisected this down to a commit in Go itself: 8994444, titled "runtime/race: update most syso files to compiler-rt". That sure looks like our culprit, though I'm willing to believe we're still doing something wrong (e.g., corrupting memory, like you suggest) and the update to the race detector just happened to tickle this preexisting condition. I'm going to try to get somewhere by bisecting compiler-rt itself. I may need your help in producing those syso files. |
Whew, that was an exercise in patience. Bisecting was complicated by the fact that a number of revisions of compiler-rt between 68e15324 and fe2c72c5 did not compile with my GCC toolchain. Regardless, I managed to pinpoint the problem to this commit: llvm-mirror/compiler-rt@12d1690. Still unclear what the actual problem is. I'm not in the position to understand that tsan code, I'm afraid. By the way, I'm now using this to repro:
The I am once again feeling that this is more likely to be a bug in tsan than in CockroachDB. @dvyukov let me know if you disagree. My offer of providing you a VM that can reproduce this issue out of the box still stands. |
Maybe. Hard to say. What's strange is that nobody reported any corruptions before and that I can't reproduce this on my machine. Can anybody else reproduce this? |
I'm going to check if any of my fellow Linux-using engineers can reproduce this issue on their laptop. Everyone else uses the same VM configuration (Ubuntu 18.04 on Google Compute Engine), so it's not all that interesting that they can reproduce. Here's the output of
Are you willing to share the output of |
Ok, @mberhault was able to reproduce on his Linux desktop running Ubuntu 16.04 (kernel 4.4.0-135-generic) using the |
Running
|
@benesch do you know if native part of cockroachdb was testing with asan/tsan/msan? If not, that would be the first thing I would do. |
Ok, great, you can reproduce! Do you know how to read that stack trace you posted?
I don't understand how we go from stopper.go:192 (which launches a goroutine) to
It's not tested with any of the sanitizers, unfortunately. Most of our serious testing is done in Go, so we exercise the C/C++ code via cgo. That makes it tricky to use any of the sanitizers. I've tried to get msan to work but it triggers a number of false positives due to structs with padding: #26167. I suppose it should be possible to use asan and tsan on our native code by plumbing in That said, there are only four C/C++ libraries that we link: snappy, protobuf, RocksDB, and jemalloc. I've verified that this reproduces without jemalloc, so I think we can rule that one out. Snappy and protobuf are both used extensively at Google, so I'd be surprised if they were to blame. RocksDB is extremely complicated, but they test upstream with asan/tsan/ubsan. So that leaves our first-party C/C++ that glues everything together. It's certainly possible that we've had a lurking memory bug there, but a) the code's not that complicated, and b) I feel like we'd have seen it crop up somewhere before. This is all speculation, of course. |
Hi, @dvyukov . Any update on this issue? We got a similar problem in our project after we updated go to version 1.11. And I just confirm that this issue exists in 1.11.2 too. |
Indeed, this seems to be a straightforward double free. I'm going to try to write up a targeted test case and figure out how to submit a patch to LLVM. |
You debugged the root cause? That was brave! You already know how to build it, right? There is probably lots of irrelevant stuff. The main part is to send a Phabricator review. |
And a test would be good. |
I think I spoke too soon. I thought the bug was a simple reference counting error with shared SyncClocks, but the problem appears to be deeper than that. In the course of my investigation I added a bit of logging to the DenseSlabAllocator. Here's a snapshot of the period before an assertion failure:
Notice how slot 4441 is allocated twice without an intervening free! This is a guaranteed disaster. The problem seems to be that the proc struct So is it possible that the Go runtime is sometimes failing to update the proc struct when it moves a goroutine between pthreads? That's my best guess as to what's happening. It explains why this badness has only been observed in Go programs, and not any C++ programs. Disclaimer that I'm not 100% confident in my instrumentation, so I might be a little bit off here. |
I gathered some more evidence for the theory that the Go runtime is sharing processor data across multiple threads. I started by annotating several of the calls into tsan. Here's a snippet. diff --git a/lib/tsan/rtl/tsan_rtl.h b/lib/tsan/rtl/tsan_rtl.h
index 5e2a745c9..130b33609 100644
--- a/lib/tsan/rtl/tsan_rtl.h
+++ b/lib/tsan/rtl/tsan_rtl.h
@@ -356,6 +356,7 @@ struct Processor {
DenseSlabAllocCache sync_cache;
DenseSlabAllocCache clock_cache;
DDPhysicalThread *dd_pt;
+ bool inuse;
};
#if !SANITIZER_GO
diff --git a/lib/tsan/go/tsan_go.cc b/lib/tsan/go/tsan_go.cc
index 71a660683..ba0dd4a08 100644
--- a/lib/tsan/go/tsan_go.cc
+++ b/lib/tsan/go/tsan_go.cc
@@ -231,48 +231,84 @@ void __tsan_proc_destroy(Processor *proc) {
}
void __tsan_acquire(ThreadState *thr, void *addr) {
+ CHECK(!thr->proc()->inuse);
+ thr->proc()->inuse = true;
Acquire(thr, 0, (uptr)addr);
+ CHECK(thr->proc()->inuse);
+ thr->proc()->inuse = false;
} The full diff is here: https://gist.github.com/benesch/cb3258a3eb4b573b5a3b7891db000e66 The reproductions of the original bug now fail with
instead of the various assertion errors they would previously fail with. That strongly implies that the root cause of all the assertion failures is the incorrect sharing of I've started poking around the Go scheduler but I'm way out of my depth here. I'll note that this diff diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index f82014eb92..5993e2a6da 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -1081,7 +1081,7 @@ func stopTheWorldWithSema() {
// try to retake all P's in Psyscall status
for _, p := range allp {
s := p.status
- if s == _Psyscall && atomic.Cas(&p.status, s, _Pgcstop) {
+ if s == _Psyscall && false && atomic.Cas(&p.status, s, _Pgcstop) {
if trace.enabled {
traceGoSysBlock(p)
traceProcStop(p)
@@ -1455,7 +1455,7 @@ func forEachP(fn func(*p)) {
// off to induce safe point function execution.
for _, p := range allp {
s := p.status
- if s == _Psyscall && p.runSafePointFn == 1 && atomic.Cas(&p.status, s, _Pidle) {
+ if s == _Psyscall && false && p.runSafePointFn == 1 && atomic.Cas(&p.status, s, _Pidle) {
if trace.enabled {
traceGoSysBlock(p)
traceProcStop(p)
@@ -4457,7 +4457,7 @@ func retake(now int64) uint32 {
}
pd := &_p_.sysmontick
s := _p_.status
- if s == _Psyscall {
+ if s == _Psyscall && false {
// Retake P from syscall if it's there for more than 1 sysmon tick (at least 20us).
t := int64(_p_.syscalltick)
if int64(pd.syscalltick) != t { vastly decreases the probability that xenomint/sqlite will hit the tsan assertion. (I initially thought it had fixed the problem, but alas.) With the tsan patch but not the Go scheduler patch, running xenomint/sqlite's tests under race reliably hits the tsan assertion on every run. With both patches, I managed to get 187 runs before it failed. I'm not sure what to make of that. |
If 2 threads use the same Processor, it would explain all kinds of weird memory corruptions in tsan runtime. thr->proc() obtains current Processor from Go runtime as g.m.p.racectx. The firing CHECK means that either 2 goroutines run with the same g.m, or 2 m's runtime with the same m.p. Both of which looks badly wrong. I am thinking of some way to check this condition reliably in Go runtime and catch it es early as possible. But so far have not come up with anything. |
This does not happen with Go1.10, right? Then one option is to bisect Go from Go1.10 to Go1.11 to find the root cause. |
Actually, I think this does still happen with Go 1.10, just far, far less frequently. I checked my recollection with @petermattis today, and we remembered seeing spurious failures from race-enabled CockroachDB builds at least as far back as early 2017. The difference was that those failures would only show up once every few weeks, back when we used to run a several-node CockroachDB cluster with race binaries. As far as I can remember, the symptoms were the same: various assertion failures inside tsan that were rather inexplicable. I'll double-check now. |
Yep, xenomint/sqlite blows up on go1.10 if you add the additional assertions to the race detector. Looks like this has been a latent bug in the Go runtime. |
Perhaps this isn't so bad. On a hunch, I added a bit more information to the new assertions: diff --git a/lib/tsan/go/tsan_go.cc b/lib/tsan/go/tsan_go.cc
index 5f2507b7d..d3345c6b7 100644
--- a/lib/tsan/go/tsan_go.cc
+++ b/lib/tsan/go/tsan_go.cc
void __tsan_acquire(ThreadState *thr, void *addr) {
+ auto inuse_goid = thr->proc()->inuse_goid;
+ if (inuse_goid != 0) {
+ __sanitizer::Printf("tsan_acquire(%p): proc in use by %d (this goid: %d)\n", addr, inuse_goid, thr->goid);
+ __sanitizer::Die();
+ }
+ thr->proc()->inuse_goid = thr->goid;
Acquire(thr, 0, (uptr)addr);
+ CHECK_EQ(thr->proc()->inuse_goid, thr->goid);
+ thr->proc()->inuse_goid = 0;
} Then I noticed that all of the failures seemed to be triggered by a call to __tsan_acquire:
Even more suspiciously, every failure was an acquisition of the exactly the same memory address,
Looks like cgocall is just calling I'm hopeful that I've actually found the root cause this time. |
Just a minute ago I started looking at:
:) |
It needs to do raceacquire after exitsyscall. I think we beaten by the dangling g.m.p pointer left in reentersyscall. |
cgocallbackg is also badly broken. |
@benesch do you want to fix yourself? Since you already done so much work on this, don't want to steal it from you. I think we also need to change reentersyscall to leave the dangling p pointer in some shadow variable instead of the g.m.p. g.m.p set makes impression that m and p are wired when they are actually not. Then only exitsyscallfast and retake will use the shadow p variable. This should have been caught this bug. |
Whoops, sorry, missed your messages! About to send a CL. I like your idea of a shadow variable but I didn't want to embark on that refactor myself, so I took a less intrusive approach. Now that I have your blessing for the refactor I'm happy to do that instead since it will prevent future mistakes of this kind—I agree that it is extremely confusing that an M can point to a P that it doesn't currently own. But I'll send the CL for what I have in the meantime to get the discussion started. |
Change https://golang.org/cl/148717 mentions this issue: |
Ok, updated the CL to take this approach. |
Nice sleuthing! Gentle plea for this to be back-ported to go1.11 as this bug is preventing CockroachDB from upgrading to go1.11 (yes, it only affects race builds, but we run |
@gopherbot Please open an issue to backport to 1.11 |
Backport issue(s) opened: #28690 (for 1.11). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases. |
Change https://golang.org/cl/148899 mentions this issue: |
Change https://golang.org/cl/148823 mentions this issue: |
When a goroutine enters a syscall, its M unwires from its P to allow the P to be retaken by another M if the syscall is slow. The M retains a reference to its old P, however, so that if its old P has not been retaken when the syscall returns, it can quickly reacquire that P. The implementation, however, was confusing, as it left the reference to the potentially-retaken P in m.p, which implied that the P was still wired. Make the code clearer by enforcing the invariant that m.p is never stale. entersyscall now moves m.p to m.oldp and sets m.p to 0; exitsyscall does the reverse, provided m.oldp has not been retaken. With this scheme in place, the issue described in #27660 (assertion failures in the race detector) would have resulted in a clean segfault instead of silently corrupting memory. Change-Id: Ib3e03623ebed4f410e852a716919fe4538858f0a Reviewed-on: https://go-review.googlesource.com/c/148899 Run-TryBot: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
related issues: golang/go#23899, golang/go#28458, golang/go#27660 Update #3
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?What did you do?
This builds some C and C++ dependencies and ultimately runs:
What did you expect to see?
Either a successful test run or an actual race.
What did you see instead?
I tried to dig into this, but gotsan.cc is apparently generated deep inside the build process for the race detector and it wasn't clear to me how to generate a gotsan.cc that matched my Go version.
Totally possible that this our (CockroachDB's) bug, but there's not a lot here for me to go on, so figured I'd file here and see if someone with more expertise in the race detector had any insight.
The text was updated successfully, but these errors were encountered: