Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: ThreadSanitizer CHECK failed #27660

Closed
benesch opened this issue Sep 13, 2018 · 54 comments
Closed

runtime: ThreadSanitizer CHECK failed #27660

benesch opened this issue Sep 13, 2018 · 54 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. RaceDetector
Milestone

Comments

@benesch
Copy link
Contributor

benesch commented Sep 13, 2018

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

$ go version
go version go1.11 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/benesch/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/benesch/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build445193843=/tmp/go-build -gno-record-gcc-switches"

What did you do?

go get -u github.com/cockroachdb/cockroach
cd $(go env GOPATH)/src/github.com/cockroachdb/cockroach
make testrace PKG=./pkg/sql/logictest TESTS=TestLogic

This builds some C and C++ dependencies and ultimately runs:

go test -race -tags ' make x86_64_linux_gnu' -ldflags '-X github.com/cockroachdb/cockroach/pkg/build.typ=development -extldflags "" -X "github.com/cockroachdb/cockroach/pkg/build.tag=v2.2.0-alpha.00000000-793-g33c7d27d82-dirty" -X "github.com/cockroachdb/cockroach/pkg/build.rev=33c7d27d8216b543ac77f0fe39d440ebebfa9e70" -X "github.com/cockroachdb/cockroach/pkg/build.cgoTargetTriple=x86_64-linux-gnu"  ' -run "TestLogic"  -timeout 25m ./pkg/sql/logictest 

What did you expect to see?

Either a successful test run or an actual race.

What did you see instead?

FATAL: ThreadSanitizer CHECK failed: ./gotsan.cc:380 "((v)) > ((0))" (0x0, 0x0)

I tried to dig into this, but gotsan.cc is apparently generated deep inside the build process for the race detector and it wasn't clear to me how to generate a gotsan.cc that matched my Go version.

Totally possible that this our (CockroachDB's) bug, but there's not a lot here for me to go on, so figured I'd file here and see if someone with more expertise in the race detector had any insight.

@benesch
Copy link
Contributor Author

benesch commented Sep 13, 2018

And here's a different tsan failure I just observed with the same command:

FATAL: ThreadSanitizer CHECK failed: ./gotsan.cc:828 "((get_block(bi))) == ((0))" (0x603f, 0x0)

@benesch
Copy link
Contributor Author

benesch commented Sep 13, 2018

And another slightly different presentation:

0x00000000026cb4e6 in __tsan::SyncClock::Unshare(__tsan::DenseSlabAllocCache*) ()
(gdb) bt
#0  0x00000000026cb4e6 in __tsan::SyncClock::Unshare(__tsan::DenseSlabAllocCache*) ()
#1  0x00000000026cb773 in __tsan::ThreadClock::ReleaseStore(__tsan::DenseSlabAllocCache*, __tsan::SyncClock*) ()
#2  0x00000000026dd43a in __tsan_go_atomic32_compare_exchange ()
#3  0x000000000062cfc3 in racecall () at /usr/local/go/src/runtime/race_amd64.s:381
#4  0x00007ffff6a906d0 in ?? ()

@ianlancetaylor ianlancetaylor added RaceDetector NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Sep 13, 2018
@ianlancetaylor ianlancetaylor added this to the Go1.12 milestone Sep 13, 2018
@ianlancetaylor
Copy link
Member

CC @dvyukov

@dvyukov
Copy link
Member

dvyukov commented Sep 14, 2018

These instructions fail for me:

cockroach$ go test -race -tags ' make x86_64_linux_gnu' -ldflags '-X github.com/cockroachdb/cockroach/pkg/build.typ=development -extldflags "" -X "github.com/cockroachdb/cockroach/pkg/build.tag=v2.2.0-alpha.00000000-793-g33c7d27d82-dirty" -X "github.com/cockroachdb/cockroach/pkg/build.rev=33c7d27d8216b543ac77f0fe39d440ebebfa9e70" -X "github.com/cockroachdb/cockroach/pkg/build.cgoTargetTriple=x86_64-linux-gnu"  ' -run "TestLogic"  -timeout 25m ./pkg/sql/logictest 
# github.com/cockroachdb/cockroach/pkg/sql/lex
pkg/sql/lex/predicates.go:60:14: undefined: reservedKeywords
FAIL	github.com/cockroachdb/cockroach/pkg/sql/logictest [build failed]

This is on commit:

cockroach$ git show HEAD
commit 418b9169b547069cd43252055196204af8f18040 (HEAD -> master, origin/staging, origin/master, origin/HEAD)
Merge: a9c7295dd6 5504e3f9ae
Author: craig[bot] <bors@cockroachlabs.com>
Date:   Fri Sep 14 01:24:32 2018 +0000

    Merge #29163

@benesch
Copy link
Contributor Author

benesch commented Sep 14, 2018 via email

@dvyukov
Copy link
Member

dvyukov commented Sep 14, 2018

I did that. The error I posted happens during go test.

@benesch
Copy link
Contributor Author

benesch commented Sep 14, 2018

Are you sure the make testrace command succeeded? It spews a lot of output from parallel commands so the error can really get buried. I just created a fresh VM and verified that the instructions I posted work with commit 418b9169, so I'm not sure what else it could be. I can give you access to this VM if that would be helpful.

@dvyukov
Copy link
Member

dvyukov commented Sep 14, 2018

Right, it failed somewhere in the beginning:

cockroach$ IGNORE_GOVERS=1 make testrace PKG=./pkg/sql/logictest TESTS=TestLogic IGNORE_GOVERS=1
Running make with -j4
GOPATH set to /go
Detected change in build system. Rebooting Make.
Running make with -j4
GOPATH set to /go
cd /go/src/github.com/cockroachdb/cockroach/c-deps/jemalloc && autoconf
rm -rf /go/native/x86_64-linux-gnu/protobuf
bash: autoconf: command not found
Makefile:511: recipe for target '/go/src/github.com/cockroachdb/cockroach/c-deps/jemalloc/configure' failed
make: *** [/go/src/github.com/cockroachdb/cockroach/c-deps/jemalloc/configure] Error 127
make: *** Waiting for unfinished jobs....
go install -v uptodate
bin/prereqs ./pkg/cmd/uptodate > bin/uptodate.d.tmp
mkdir -p /go/native/x86_64-linux-gnu/protobuf
cd /go/native/x86_64-linux-gnu/protobuf && cmake -DCMAKE_TARGET_MESSAGES=OFF   -Dprotobuf_BUILD_TESTS=OFF /go/src/github.com/cockroachdb/cockroach/c-deps/protobuf/cmake \
  -DCMAKE_BUILD_TYPE=Release
mv -f bin/uptodate.d.tmp bin/uptodate.d
github.com/cockroachdb/cockroach/vendor/github.com/MichaelTJones/walk
-- The C compiler identification is GNU 7.3.0
github.com/cockroachdb/cockroach/vendor/github.com/spf13/pflag
github.com/cockroachdb/cockroach/pkg/cmd/uptodate
-- The CXX compiler identification is GNU 7.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.8") 
-- Configuring done
-- Generating done
-- Build files have been written to: /go/native/x86_64-linux-gnu/protobuf

@benesch
Copy link
Contributor Author

benesch commented Sep 14, 2018 via email

@dvyukov
Copy link
Member

dvyukov commented Sep 14, 2018

How long does it generally take to crash? I've run it twice and both times got:

cockroach$ go test -race -tags ' make x86_64_linux_gnu' -ldflags '-X github.com/cockroachdb/cockroach/pkg/build.typ=development -extldflags "" -X "github.com/cockroachdb/cockroach/pkg/build.tag=v2.2.0-alpha.00000000-793-g33c7d27d82-dirty" -X "github.com/cockroachdb/cockroach/pkg/build.rev=33c7d27d8216b543ac77f0fe39d440ebebfa9e70" -X "github.com/cockroachdb/cockroach/pkg/build.cgoTargetTriple=x86_64-linux-gnu"  ' -run "TestLogic"  -timeout 25m ./pkg/sql/logictest 
I180914 14:16:05.303611 1 rand.go:75  Random seed: -5069250136795247745
E180914 14:19:23.006360 77 sql/logictest/logic.go:2228  

testdata/logic_test/crdb_internal:295: SELECT node_id, network, regexp_replace(address, '\d+$', '<port>') as address, attrs, locality, regexp_replace(server_version, '^\d+\.\d+(-\d+)?$', '<server_version>') as server_version, regexp_replace(go_version, '^go.+$', '<go_version>') as go_version
FROM crdb_internal.kv_node_status WHERE node_id = 1
expected:
    node_id  network  address           attrs  locality            server_version    go_version
    1        tcp      127.0.0.1:<port>  []     {"region": "test"}  <server_version>  <go_version>
    
but found (query options: "colnames") :
    node_id  network  address           attrs  locality            server_version    go_version
    1        tcp      127.0.0.1:<port>  []     {"region": "test"}  <server_version>  devel +9f59918cae Thu Sep 13 21:34:56 2018 +0000
    
E180914 14:19:23.006776 77 sql/logictest/logic.go:2259  testdata/logic_test/crdb_internal:302: too many errors encountered, skipping the rest of the input
E180914 14:19:32.896862 50624 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:49210,user=root,scExec] error setting up flow: column "d" does not exist
E180914 14:19:34.520305 50624 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:49210,user=root,scExec] error setting up flow: error type checking constant value: could not parse "a" as type int: strconv.ParseInt: parsing "a": invalid syntax
E180914 14:19:35.740206 50624 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:49210,user=root,scExec] error setting up flow: could not parse "a" as type int: strconv.ParseInt: parsing "a": invalid syntax
E180914 14:19:36.830273 50624 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:49210,user=root,scExec] error setting up flow: unsupported binary operator: <int> / <int> (desired <int>)
E180914 14:24:01.186527 68088 storage/consistency_queue.go:128  [n1,consistencyChecker,s1,r1/1:/M{in-ax}] computing own checksum: could not dial node ID 1: initial connection heartbeat failed: rpc error: code = Canceled desc = context canceled
E180914 14:24:01.186855 68088 storage/queue.go:788  [n1,consistencyChecker,s1,r1/1:/M{in-ax}] computing own checksum: could not dial node ID 1: initial connection heartbeat failed: rpc error: code = Canceled desc = context canceled
E180914 14:26:08.285725 96089 storage/queue.go:788  [n1,raftlog,s1,r1/1:/M{in-ax}] result is ambiguous (server shutdown)
E180914 14:26:15.409641 64257 sql/logictest/logic.go:2228  

testdata/logic_test/crdb_internal:295: SELECT node_id, network, regexp_replace(address, '\d+$', '<port>') as address, attrs, locality, regexp_replace(server_version, '^\d+\.\d+(-\d+)?$', '<server_version>') as server_version, regexp_replace(go_version, '^go.+$', '<go_version>') as go_version
FROM crdb_internal.kv_node_status WHERE node_id = 1
expected:
    node_id  network  address           attrs  locality            server_version    go_version
    1        tcp      127.0.0.1:<port>  []     {"region": "test"}  <server_version>  <go_version>
    
but found (query options: "colnames") :
    node_id  network  address           attrs  locality            server_version    go_version
    1        tcp      127.0.0.1:<port>  []     {"region": "test"}  <server_version>  devel +9f59918cae Thu Sep 13 21:34:56 2018 +0000
    
E180914 14:26:15.410048 64257 sql/logictest/logic.go:2259  testdata/logic_test/crdb_internal:302: too many errors encountered, skipping the rest of the input
E180914 14:26:27.787026 99569 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:36516,user=root,scExec] error setting up flow: column "d" does not exist
E180914 14:26:29.171953 99569 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:36516,user=root,scExec] error setting up flow: error type checking constant value: could not parse "a" as type int: strconv.ParseInt: parsing "a": invalid syntax
E180914 14:26:30.247536 99569 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:36516,user=root,scExec] error setting up flow: could not parse "a" as type int: strconv.ParseInt: parsing "a": invalid syntax
E180914 14:26:31.426094 99569 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:36516,user=root,scExec] error setting up flow: unsupported binary operator: <int> / <int> (desired <int>)
E180914 14:37:20.722139 157227 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:35082,user=root,scExec] error setting up flow: column "d" does not exist
E180914 14:37:22.977964 157227 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:35082,user=root,scExec] error setting up flow: error type checking constant value: could not parse "a" as type int: strconv.ParseInt: parsing "a": invalid syntax
E180914 14:37:26.350066 157227 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:35082,user=root,scExec] error setting up flow: could not parse "a" as type int: strconv.ParseInt: parsing "a": invalid syntax
E180914 14:37:28.390702 157227 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:35082,user=root,scExec] error setting up flow: unsupported binary operator: <int> / <int> (desired <int>)
panic: test timed out after 25m0s

goroutine 175652 [running]:
testing.(*M).startAlarm.func1()
	/src/go/src/testing/testing.go:1303 +0x13a
created by time.goFunc
	/src/go/src/time/sleep.go:172 +0x52

goroutine 1 [chan receive, 25 minutes]:
testing.(*T).Run(0xc00034e300, 0x2ee939f, 0x9, 0x3010de0, 0xc0006b7bd0)
	/src/go/src/testing/testing.go:886 +0x689
testing.runTests.func1(0xc00034e300)
	/src/go/src/testing/testing.go:1126 +0xa9
testing.tRunner(0xc00034e300, 0xc0006b7d40)
	/src/go/src/testing/testing.go:834 +0x163
testing.runTests(0xc000299ba0, 0x4f9b7a0, 0x5, 0x5, 0x0)
	/src/go/src/testing/testing.go:1124 +0x4ef
testing.(*M).Run(0xc0004c8580, 0x0)
	/src/go/src/testing/testing.go:1041 +0x2ef
github.com/cockroachdb/cockroach/pkg/sql/logictest.TestMain(0xc0004c8580)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/logictest/main_test.go:37 +0x15b
main.main()
	_testmain.go:48 +0x222

@benesch
Copy link
Contributor Author

benesch commented Sep 17, 2018

It's been reproducing for me in less than a minute. Oddly enough, it's failing to reproduce for me on the latest master. Could you try commit 33c7d27d8216b543ac77f0fe39d440ebebfa9e70? And just to double check, you are running go1.11, right? I'm also having trouble reproducing on go1.10.3.

In the meantime I'm going to bisect to see if there's something obviously wrong that we fixed.

@benesch
Copy link
Contributor Author

benesch commented Sep 17, 2018

Bisecting proved challenging thanks to some actual races that were introduced and fixed within the range of commits I was bisecting. Regardless, I can semi-reliably repro on 33c7d27d8216b543ac77f0fe39d440ebebfa9e70 on a fresh Ubuntu machine with Go 1.11.

The problem seems to occur early or not at all. In five runs, this is what I observed:

  1. Test timed out after 2m.
  2. ThreadSanitizer CHECK failed: ./gotsan.cc:746 "((v)) > ((0))" (0x0, 0x0) after 54.42s.
  3. Test timed out after 2m.
  4. ThreadSanitizer CHECK failed: ./gotsan.cc:380 "((v)) > ((0))" (0x0, 0x0) after 113.751s.
  5. Test timed out after 2m.

I could have sworn this reproduced more reliably when I original filed the bug. Still, I'm hopeful that you might be able to reproduce if you run a few more trials with a short timeout. The magic incantation to thread in a timeout via the Makefile, if that's easier, is:

$ make testrace PKG=./pkg/sql/logictest TESTS=TestLogic TESTTIMEOUT=2m

@dvyukov
Copy link
Member

dvyukov commented Sep 17, 2018

I have no luck reproducing this so far:

logictest$ go version
go version go1.11 linux/amd64
logictest$ stress -failure="ThreadSanitizer" -p=10 -timeout=2m ./logictest.test -test.run=TestLogic -test.timeout=2m
...
70 runs so far, 0 failures

p.s. stress is:
https://github.com/golang/tools/blob/master/cmd/stress/stress.go

@dvyukov
Copy link
Member

dvyukov commented Sep 17, 2018

My bets would be on:

  1. Memory corruption in native code (there is some, right). Can make sense to test with ASan/TSan/MSan.
  2. Kernel bug.

@benesch
Copy link
Contributor Author

benesch commented Sep 17, 2018

Oh, cool, I didn't know about the -failure flag to stress.

Anyway, thanks again for looking into this. We upgrade our CI environment to go1.11 today and various other tests started producing assertions like this under the race detector. We also started seeing outright segfaults:

unexpected fault address 0x27400
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x27400 pc=0x2bf0df6]

goroutine 1487 [running]:
runtime: unexpected return pc for runtime.throw called from 0x361d16b
stack: frame={sp:0x7fc94e9fe990, fp:0x7fc94e9fe9c0} stack=[0xc000ee3000,0xc000ee3800)

I bisected this down to a commit in Go itself: 8994444, titled "runtime/race: update most syso files to compiler-rt". That sure looks like our culprit, though I'm willing to believe we're still doing something wrong (e.g., corrupting memory, like you suggest) and the update to the race detector just happened to tickle this preexisting condition.

I'm going to try to get somewhere by bisecting compiler-rt itself. I may need your help in producing those syso files.

@benesch
Copy link
Contributor Author

benesch commented Sep 18, 2018

Whew, that was an exercise in patience. Bisecting was complicated by the fact that a number of revisions of compiler-rt between 68e15324 and fe2c72c5 did not compile with my GCC toolchain. Regardless, I managed to pinpoint the problem to this commit: llvm-mirror/compiler-rt@12d1690. Still unclear what the actual problem is. I'm not in the position to understand that tsan code, I'm afraid.

By the way, I'm now using this to repro:

make stressrace PKG=./pkg/ccl/backupccl TESTS=TestBackupRestoreResume STRESSFLAGS="-p 8" IGNORE_GOVERS=1 GO=~/dev/go/bin/go

The -p 8 avoids overloading my 24 CPU VM; each test execution seems to require about 3 CPUs. The IGNORE_GOVERS=1 GO=~/dev/go/bin/go is only necessary because I'm pointing at a custom version of Go. I've found that this reliably repros in under 5m.

I am once again feeling that this is more likely to be a bug in tsan than in CockroachDB. @dvyukov let me know if you disagree. My offer of providing you a VM that can reproduce this issue out of the box still stands.

@dvyukov
Copy link
Member

dvyukov commented Sep 18, 2018

I am once again feeling that this is more likely to be a bug in tsan than in CockroachDB.

Maybe. Hard to say. What's strange is that nobody reported any corruptions before and that I can't reproduce this on my machine. Can anybody else reproduce this?

@benesch
Copy link
Contributor Author

benesch commented Sep 18, 2018

Can anybody else reproduce this?

I'm going to check if any of my fellow Linux-using engineers can reproduce this issue on their laptop. Everyone else uses the same VM configuration (Ubuntu 18.04 on Google Compute Engine), so it's not all that interesting that they can reproduce. Here's the output of uname -r:

Linux gceworker-benesch 4.15.0-1018-gcp #19-Ubuntu SMP Thu Aug 16 13:38:55 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Are you willing to share the output of uname -r on your machine?

@benesch
Copy link
Contributor Author

benesch commented Sep 18, 2018

Ok, @mberhault was able to reproduce on his Linux desktop running Ubuntu 16.04 (kernel 4.4.0-135-generic) using the make stressrace PKG=./pkg/ccl/backupccl... with go1.11 and the latest CockroachDB master.

@dvyukov
Copy link
Member

dvyukov commented Sep 19, 2018

Running make stressrace PKG=./pkg/ccl/backupccl... 3 times I got:

unexpected fault address 0x2600
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x2600 pc=0x2bff9a6]

goroutine 5587 [running]:
runtime: unexpected return pc for runtime.throw called from 0x360c2ed
stack: frame={sp:0x7f7e059ec9b0, fp:0x7f7e059ec9e0} stack=[0xc0048c4000,0xc0048cc000)

runtime.throw(0x3, 0x3)
	src/runtime/panic.go:608 +0x72 fp=0x7f7e059ec9e0 sp=0x7f7e059ec9b0 pc=0x697d72
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	gopath/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xc4

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000244200, 0x36521a6, 0x23, 0x374daf0, 0xc0007c1b01)
	src/testing/testing.go:879 +0x689
testing.runTests.func1(0xc000244200)
	src/testing/testing.go:1119 +0xa9
testing.tRunner(0xc000244200, 0xc0007c1d40)
	src/testing/testing.go:827 +0x163
testing.runTests(0xc0004de4e0, 0x5bacac0, 0x2c, 0x2c, 0x0)
	src/testing/testing.go:1117 +0x4ef
testing.(*M).Run(0xc0007de000, 0x0)
	src/testing/testing.go:1034 +0x2ef
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.TestMain(0xc0007de000)
	gopath/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/main_test.go:31 +0x17d
main.main()
	_testmain.go:138 +0x222



unexpected fault address 0xe400
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0xe400 pc=0x2bff519]

goroutine 7540 [running]:
runtime: unexpected return pc for runtime.throw called from 0x4
stack: frame={sp:0x7f46ad9fe930, fp:0x7f46ad9fe960} stack=[0xc0036da000,0xc0036fa000)

runtime.throw(0x4, 0x7f46ad9fe9a0)
	src/runtime/panic.go:608 +0x72 fp=0x7f46ad9fe960 sp=0x7f46ad9fe930 pc=0x697d72
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl
	gopath/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:267 +0x13f4

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000232100, 0x36521a6, 0x23, 0x374daf0, 0xc0007f1b01)
	src/testing/testing.go:879 +0x689
testing.runTests.func1(0xc000232100)
	src/testing/testing.go:1119 +0xa9
testing.tRunner(0xc000232100, 0xc0007f1d40)
	src/testing/testing.go:827 +0x163
testing.runTests(0xc0002af640, 0x5bacac0, 0x2c, 0x2c, 0x0)
	src/testing/testing.go:1117 +0x4ef
testing.(*M).Run(0xc00058e000, 0x0)
	src/testing/testing.go:1034 +0x2ef
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.TestMain(0xc00058e000)
	gopath/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/main_test.go:31 +0x17d
main.main()
	_testmain.go:138 +0x222



unexpected fault address 0x26a00
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x26a00 pc=0x2bff519]

goroutine 6520 [running]:
runtime: unexpected return pc for runtime.throw called from 0x4
stack: frame={sp:0x7fec833fe930, fp:0x7fec833fe960} stack=[0xc002736000,0xc002738000)

runtime.throw(0x4, 0x7fec833fe9a0)
	src/runtime/panic.go:608 +0x72 fp=0x7fec833fe960 sp=0x7fec833fe930 pc=0x697d72
created by github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).handleRawConn
	gopath/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:638 +0x7c2

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000754000, 0x36521a6, 0x23, 0x374daf0, 0xc0007cdb01)
	src/testing/testing.go:879 +0x689
testing.runTests.func1(0xc000754000)
	src/testing/testing.go:1119 +0xa9
testing.tRunner(0xc000754000, 0xc0007cdd40)
	src/testing/testing.go:827 +0x163
testing.runTests(0xc00031b840, 0x5bacac0, 0x2c, 0x2c, 0x0)
	src/testing/testing.go:1117 +0x4ef
testing.(*M).Run(0xc00074a100, 0x0)
	src/testing/testing.go:1034 +0x2ef
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.TestMain(0xc00074a100)
	gopath/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/main_test.go:31 +0x17d
main.main()
	_testmain.go:138 +0x222

@dvyukov
Copy link
Member

dvyukov commented Sep 19, 2018

@benesch do you know if native part of cockroachdb was testing with asan/tsan/msan? If not, that would be the first thing I would do.

@benesch
Copy link
Contributor Author

benesch commented Sep 19, 2018

Ok, great, you can reproduce!

Do you know how to read that stack trace you posted?

unexpected fault address 0x2600
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x2600 pc=0x2bff9a6]

goroutine 5587 [running]:
runtime: unexpected return pc for runtime.throw called from 0x360c2ed
stack: frame={sp:0x7f7e059ec9b0, fp:0x7f7e059ec9e0} stack=[0xc0048c4000,0xc0048cc000)

runtime.throw(0x3, 0x3)
	src/runtime/panic.go:608 +0x72 fp=0x7f7e059ec9e0 sp=0x7f7e059ec9b0 pc=0x697d72
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	gopath/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xc4

I don't understand how we go from stopper.go:192 (which launches a goroutine) to runtime.throw. (Other segfaults I've triggered in Go are understandable, by contrast.) I suppose the answer is probably memory corruption?

@benesch do you know if native part of cockroachdb was testing with asan/tsan/msan? If not, that would be the first thing I would do.

It's not tested with any of the sanitizers, unfortunately. Most of our serious testing is done in Go, so we exercise the C/C++ code via cgo. That makes it tricky to use any of the sanitizers. I've tried to get msan to work but it triggers a number of false positives due to structs with padding: #26167. I suppose it should be possible to use asan and tsan on our native code by plumbing in -fsanitize during compilation. I'll see if I can find some time to do that today.

That said, there are only four C/C++ libraries that we link: snappy, protobuf, RocksDB, and jemalloc. I've verified that this reproduces without jemalloc, so I think we can rule that one out. Snappy and protobuf are both used extensively at Google, so I'd be surprised if they were to blame. RocksDB is extremely complicated, but they test upstream with asan/tsan/ubsan. So that leaves our first-party C/C++ that glues everything together. It's certainly possible that we've had a lurking memory bug there, but a) the code's not that complicated, and b) I feel like we'd have seen it crop up somewhere before. This is all speculation, of course.

@leventeliu
Copy link

Hi, @dvyukov . Any update on this issue? We got a similar problem in our project after we updated go to version 1.11. And I just confirm that this issue exists in 1.11.2 too.

@benesch
Copy link
Contributor Author

benesch commented Nov 7, 2018

Indeed, this seems to be a straightforward double free. I'm going to try to write up a targeted test case and figure out how to submit a patch to LLVM.

@dvyukov
Copy link
Member

dvyukov commented Nov 7, 2018

You debugged the root cause? That was brave!

You already know how to build it, right?
Here are some docs on contributing:
https://github.com/google/sanitizers/wiki/AddressSanitizerHowToContribute
https://llvm.org/docs/Contributing.html
https://llvm.org/docs/Phabricator.html

There is probably lots of irrelevant stuff. The main part is to send a Phabricator review.

@dvyukov
Copy link
Member

dvyukov commented Nov 7, 2018

And a test would be good.

@benesch
Copy link
Contributor Author

benesch commented Nov 7, 2018

I think I spoke too soon. I thought the bug was a simple reference counting error with shared SyncClocks, but the problem appears to be deeper than that. In the course of my investigation I added a bit of logging to the DenseSlabAllocator. Here's a snapshot of the period before an assertion failure:

clock allocator: alloc 2522 (curproc 0x00000279c798, tid e7fc700, pos 60)
clock allocator: free  4442 (curproc 0x00000279c798, tid a1ffb700, pos 60)
clock allocator: free  4441 (curproc 0x00000279c798, tid a1ffb700, pos 61)
clock allocator: free  4440 (curproc 0x00000279c798, tid a1ffb700, pos 62)
clock allocator: free  4439 (curproc 0x00000279c798, tid a1ffb700, pos 63)
clock allocator: alloc 4441 (curproc 0x00000279c798, tid e7fc700, pos 61)
clock allocator: free  4443 (curproc 0x00000279c798, tid a1ffb700, pos 64)
clock allocator: alloc 4443 (curproc 0x00000279c798, tid e7fc700, pos 64)
clock allocator: alloc 4439 (curproc 0x00000279c798, tid e7fc700, pos 63)
clock allocator: alloc 4440 (curproc 0x00000279c798, tid e7fc700, pos 62)
clock allocator: alloc 4441 (curproc 0x00000279c798, tid e7fc700, pos 61)
clock allocator: alloc 5257 (curproc 0x00000279da58, tid 42ffd700, pos 9)
FATAL: ThreadSanitizer CHECK failed: ../rtl/tsan_dense_alloc.h:86 "((idx)) != ((0))" (0x0, 0x0)

Notice how slot 4441 is allocated twice without an intervening free! This is a guaranteed disaster.
curproc is the address of the proc struct returned by get_cur_proc(), while tid is the opaque ID reported pthread_self().

The problem seems to be that the proc struct 0x0279c798, which contains a non-threadsafe allocator cache, is simultaneously in use by goroutines e7fc700 and a1ffb700. (It's possible that those two goroutines were quickly scheduled and rescheduled on the same pthread, but I doubt it.)

So is it possible that the Go runtime is sometimes failing to update the proc struct when it moves a goroutine between pthreads? That's my best guess as to what's happening. It explains why this badness has only been observed in Go programs, and not any C++ programs.

Disclaimer that I'm not 100% confident in my instrumentation, so I might be a little bit off here.

@benesch
Copy link
Contributor Author

benesch commented Nov 8, 2018

I gathered some more evidence for the theory that the Go runtime is sharing processor data across multiple threads. I started by annotating several of the calls into tsan. Here's a snippet.

diff --git a/lib/tsan/rtl/tsan_rtl.h b/lib/tsan/rtl/tsan_rtl.h
index 5e2a745c9..130b33609 100644
--- a/lib/tsan/rtl/tsan_rtl.h
+++ b/lib/tsan/rtl/tsan_rtl.h
@@ -356,6 +356,7 @@ struct Processor {
   DenseSlabAllocCache sync_cache;
   DenseSlabAllocCache clock_cache;
   DDPhysicalThread *dd_pt;
+  bool inuse;
 };
 
 #if !SANITIZER_GO

diff --git a/lib/tsan/go/tsan_go.cc b/lib/tsan/go/tsan_go.cc
index 71a660683..ba0dd4a08 100644
--- a/lib/tsan/go/tsan_go.cc
+++ b/lib/tsan/go/tsan_go.cc
@@ -231,48 +231,84 @@ void __tsan_proc_destroy(Processor *proc) {
 }
 
 void __tsan_acquire(ThreadState *thr, void *addr) {
+  CHECK(!thr->proc()->inuse);
+  thr->proc()->inuse = true;
   Acquire(thr, 0, (uptr)addr);
+  CHECK(thr->proc()->inuse);
+  thr->proc()->inuse = false;
 }

The full diff is here: https://gist.github.com/benesch/cb3258a3eb4b573b5a3b7891db000e66

The reproductions of the original bug now fail with

FATAL: ThreadSanitizer CHECK failed: ./gotsan.cc:234 "((!thr->proc()->inuse)) != (0)" (0x0, 0x0)

instead of the various assertion errors they would previously fail with. That strongly implies that the root cause of all the assertion failures is the incorrect sharing of thr->proc(). And programs which did not trigger the previous assertions do not trigger this new assertion.

I've started poking around the Go scheduler but I'm way out of my depth here. I'll note that this diff
to disable stealing of Ps in syscalls

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index f82014eb92..5993e2a6da 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -1081,7 +1081,7 @@ func stopTheWorldWithSema() {
 	// try to retake all P's in Psyscall status
 	for _, p := range allp {
 		s := p.status
-		if s == _Psyscall && atomic.Cas(&p.status, s, _Pgcstop) {
+		if s == _Psyscall && false && atomic.Cas(&p.status, s, _Pgcstop) {
 			if trace.enabled {
 				traceGoSysBlock(p)
 				traceProcStop(p)
@@ -1455,7 +1455,7 @@ func forEachP(fn func(*p)) {
 	// off to induce safe point function execution.
 	for _, p := range allp {
 		s := p.status
-		if s == _Psyscall && p.runSafePointFn == 1 && atomic.Cas(&p.status, s, _Pidle) {
+		if s == _Psyscall && false && p.runSafePointFn == 1 && atomic.Cas(&p.status, s, _Pidle) {
 			if trace.enabled {
 				traceGoSysBlock(p)
 				traceProcStop(p)
@@ -4457,7 +4457,7 @@ func retake(now int64) uint32 {
 		}
 		pd := &_p_.sysmontick
 		s := _p_.status
-		if s == _Psyscall {
+		if s == _Psyscall && false {
 			// Retake P from syscall if it's there for more than 1 sysmon tick (at least 20us).
 			t := int64(_p_.syscalltick)
 			if int64(pd.syscalltick) != t {

vastly decreases the probability that xenomint/sqlite will hit the tsan assertion. (I initially thought it had fixed the problem, but alas.) With the tsan patch but not the Go scheduler patch, running xenomint/sqlite's tests under race reliably hits the tsan assertion on every run. With both patches, I managed to get 187 runs before it failed. I'm not sure what to make of that.

@dvyukov
Copy link
Member

dvyukov commented Nov 8, 2018

If 2 threads use the same Processor, it would explain all kinds of weird memory corruptions in tsan runtime.

thr->proc() obtains current Processor from Go runtime as g.m.p.racectx. The firing CHECK means that either 2 goroutines run with the same g.m, or 2 m's runtime with the same m.p. Both of which looks badly wrong.

I am thinking of some way to check this condition reliably in Go runtime and catch it es early as possible. But so far have not come up with anything.

@dvyukov
Copy link
Member

dvyukov commented Nov 8, 2018

This does not happen with Go1.10, right? Then one option is to bisect Go from Go1.10 to Go1.11 to find the root cause.

@benesch
Copy link
Contributor Author

benesch commented Nov 8, 2018

This does not happen with Go1.10, right? Then one option is to bisect Go from Go1.10 to Go1.11 to find the root cause.

Actually, I think this does still happen with Go 1.10, just far, far less frequently. I checked my recollection with @petermattis today, and we remembered seeing spurious failures from race-enabled CockroachDB builds at least as far back as early 2017. The difference was that those failures would only show up once every few weeks, back when we used to run a several-node CockroachDB cluster with race binaries. As far as I can remember, the symptoms were the same: various assertion failures inside tsan that were rather inexplicable.

I'll double-check now.

@benesch
Copy link
Contributor Author

benesch commented Nov 8, 2018

Yep, xenomint/sqlite blows up on go1.10 if you add the additional assertions to the race detector. Looks like this has been a latent bug in the Go runtime.

@benesch
Copy link
Contributor Author

benesch commented Nov 9, 2018

Perhaps this isn't so bad. On a hunch, I added a bit more information to the new assertions:

diff --git a/lib/tsan/go/tsan_go.cc b/lib/tsan/go/tsan_go.cc
index 5f2507b7d..d3345c6b7 100644
--- a/lib/tsan/go/tsan_go.cc
+++ b/lib/tsan/go/tsan_go.cc
 void __tsan_acquire(ThreadState *thr, void *addr) {
+  auto inuse_goid = thr->proc()->inuse_goid;
+  if (inuse_goid != 0) {
+    __sanitizer::Printf("tsan_acquire(%p): proc in use by %d (this goid: %d)\n", addr, inuse_goid, thr->goid);
+    __sanitizer::Die();
+  }
+  thr->proc()->inuse_goid = thr->goid;
   Acquire(thr, 0, (uptr)addr);
+  CHECK_EQ(thr->proc()->inuse_goid, thr->goid);
+  thr->proc()->inuse_goid = 0;
 }

Then I noticed that all of the failures seemed to be triggered by a call to __tsan_acquire:

tsan_acquire(0x000005035be8): proc in use by 48 (this goid: 49)

Even more suspiciously, every failure was an acquisition of the exactly the same memory address, 0x5035be8. Kind of seems like a global variable...

$ nm sqlite.test | grep 000005035be8
0000000005035be8 s runtime.racecgosync

Looks like cgocall is just calling raceacquire(&racecgosync) before it allocates a P (via exitsyscall), and that's not valid.

I'm hopeful that I've actually found the root cause this time.

@dvyukov
Copy link
Member

dvyukov commented Nov 9, 2018

Just a minute ago I started looking at:

func endcgo(mp *m) {
	mp.incgo = false
	mp.ncgo--

	if raceenabled {
		raceacquire(unsafe.Pointer(&racecgosync))
	}
}

:)

@dvyukov
Copy link
Member

dvyukov commented Nov 9, 2018

It needs to do raceacquire after exitsyscall. I think we beaten by the dangling g.m.p pointer left in reentersyscall.

@dvyukov
Copy link
Member

dvyukov commented Nov 9, 2018

cgocallbackg is also badly broken.

@dvyukov
Copy link
Member

dvyukov commented Nov 9, 2018

@benesch do you want to fix yourself? Since you already done so much work on this, don't want to steal it from you.

I think we also need to change reentersyscall to leave the dangling p pointer in some shadow variable instead of the g.m.p. g.m.p set makes impression that m and p are wired when they are actually not. Then only exitsyscallfast and retake will use the shadow p variable. This should have been caught this bug.

@benesch
Copy link
Contributor Author

benesch commented Nov 9, 2018

Whoops, sorry, missed your messages! About to send a CL.

I like your idea of a shadow variable but I didn't want to embark on that refactor myself, so I took a less intrusive approach. Now that I have your blessing for the refactor I'm happy to do that instead since it will prevent future mistakes of this kind—I agree that it is extremely confusing that an M can point to a P that it doesn't currently own. But I'll send the CL for what I have in the meantime to get the discussion started.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/148717 mentions this issue: runtime: never call into race detector without runnable P

@benesch
Copy link
Contributor Author

benesch commented Nov 9, 2018

I think we also need to change reentersyscall to leave the dangling p pointer in some shadow variable instead of the g.m.p. g.m.p set makes impression that m and p are wired when they are actually not. Then only exitsyscallfast and retake will use the shadow p variable. This should have been caught this bug.

Ok, updated the CL to take this approach.

@petermattis
Copy link

Nice sleuthing! Gentle plea for this to be back-ported to go1.11 as this bug is preventing CockroachDB from upgrading to go1.11 (yes, it only affects race builds, but we run go test -race during CI).

@ianlancetaylor
Copy link
Member

@gopherbot Please open an issue to backport to 1.11

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #28690 (for 1.11).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/148899 mentions this issue: runtime: ensure m.p is never stale

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/148823 mentions this issue: runtime: wire g and p

gopherbot pushed a commit that referenced this issue Nov 10, 2018
When a goroutine enters a syscall, its M unwires from its P to allow
the P to be retaken by another M if the syscall is slow. The M retains a
reference to its old P, however, so that if its old P has not been
retaken when the syscall returns, it can quickly reacquire that P.

The implementation, however, was confusing, as it left the reference to
the potentially-retaken P in m.p, which implied that the P was still
wired.

Make the code clearer by enforcing the invariant that m.p is never
stale. entersyscall now moves m.p to m.oldp and sets m.p to 0;
exitsyscall does the reverse, provided m.oldp has not been retaken.

With this scheme in place, the issue described in #27660 (assertion
failures in the race detector) would have resulted in a clean segfault
instead of silently corrupting memory.

Change-Id: Ib3e03623ebed4f410e852a716919fe4538858f0a
Reviewed-on: https://go-review.googlesource.com/c/148899
Run-TryBot: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
changkun added a commit to golang-design/under-the-hood that referenced this issue Jan 27, 2019
@golang golang locked and limited conversation to collaborators Nov 10, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. RaceDetector
Projects
None yet
Development

No branches or pull requests

6 participants