roachtest: ycsb/F/nodes=3/cpu=32 failed #70019

cockroach-teamcity · 2021-09-10T13:03:33Z

roachtest.ycsb/F/nodes=3/cpu=32 failed with artifacts on release-21.2 @ 99a4816fc272228a63df20dae3cc41d235e705f3:

		  |   113.0s        0         6523.2         4500.1      1.0     26.2    184.5    285.2 read
		  |   113.0s        0         6384.2         4499.8     10.0     22.0    109.1    352.3 readModifyWrite
		  |   114.0s        0         6008.2         4513.3      1.0     18.9    209.7    302.0 read
		  |   114.0s        0         6136.2         4514.2      9.4     19.9    125.8    419.4 readModifyWrite
		  |   115.0s        0         6031.3         4526.6      1.0     22.0    192.9    268.4 read
		  |   115.0s        0         6008.3         4527.2      9.4     22.0    125.8    671.1 readModifyWrite
		  |   116.0s        0         6414.7         4542.8      1.0     21.0    176.2    285.2 read
		  |   116.0s        0         6269.7         4542.2      9.4     21.0    121.6    771.8 readModifyWrite
		  |   117.0s        0         6044.7         4555.7      1.0     18.9    201.3    302.0 read
		  |   117.0s        0         6015.7         4554.8      9.4     21.0    142.6    838.9 readModifyWrite
		  |   118.0s        0         6215.2         4569.7      1.0     24.1    184.5    260.0 read
		  |   118.0s        0         6105.2         4567.9      9.4     22.0    151.0    838.9 readModifyWrite
		  |   119.0s        0         6350.0         4584.7      1.0     23.1    201.3    302.0 read
		  |   119.0s        0         6230.1         4581.9      9.4     21.0    142.6    671.1 readModifyWrite
		  |   120.0s        0         6071.1         4597.1      1.0     14.7    234.9    369.1 read
		  |   120.0s        0         6283.9         4596.1      9.4     21.0    176.2    704.6 readModifyWrite
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,ycsb.go:73,ycsb.go:90,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:73
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:90
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Reproduce

See: roachtest README

Same failure on other branches

roachtest: ycsb/F/nodes=3/cpu=32 failed [bad connection] [similar to 69523] #69545 roachtest: ycsb/F/nodes=3/cpu=32 failed [bad connection] [similar to 69523] [C-test-failure O-roachtest O-robot branch-master]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

The text was updated successfully, but these errors were encountered:

tbg · 2021-09-13T11:15:45Z

	  | Error: driver: bad connection

tbg · 2021-09-13T11:52:22Z

Moving to SQL-Queries for the same reasons as the master sibling, see #69545 (comment)

rytaft · 2021-09-13T14:08:49Z

This does not seem like a blocker, but I'm also not sure this is a SQL Queries issue. @rafiss you seem to have the most familiarity with this issue -- can you comment on this?

rafiss · 2021-09-14T18:31:49Z

Sure - we have seen the same type of failure in #68574. That was (supposed to be) fixed by #68665 which is because of a lib/pq issue. I don't yet know how it still could be happening, but I'll look more closely again.

rytaft · 2021-09-20T13:04:46Z

@rafiss did you get a chance to take a look at this at all?

rafiss · 2021-09-20T13:13:45Z

I did look for a few hours and couldn't repro or find anything helpful yet.

tbg · 2021-09-20T13:18:21Z

Have you tried reproducing via something like GCE_PROJECT=andrei-jepsen COUNT=50 SHORT=y LOCAL=n pkg/cmd/roachtest/roachstress.sh while on 99a4816? Anecdotally this seems to happen every ~10-20 days so 50 iterations should give you something (or suggest that somehow, the problem has disappeared).

rafiss · 2021-09-20T13:53:33Z

I ran with roachstress locally and with a smaller count; I didn't know I could use a GCE project. I'll try that

cucaroach · 2021-09-20T13:58:03Z

I spent a couple hours pouring over the artifacts but couldn't find anything. The only idea I had was to come up with some way for the client to signal all the nodes in the database to panic all at once when an unexpected error like this occurs, with the understanding that panic will dump a core when GOTRACEBACK=crash is specified.

tbg · 2021-09-20T16:03:59Z

FYI using this test as a guinea pig for stressing a roachtest in CI: #70435 (comment)

Will let you know if that turns up a repro.

@cucaroach when the roachtest times out, the test harness sends SIGSEGV to ~everything to get stack traces, and we already collect the cores (though not as artifacts; with --debug you'll find them in /mnt/data1/cores or thereabouts).
Seems difficult to trigger this at the right time though, won't the connection already be gone by the time you realize something's amiss? Maybe running workload with pretty intense debugging can help? Anyway, first order of the day is to see if this even repros any more I guess.

cucaroach · 2021-09-20T16:13:06Z

@tbg, thanks I didn't know that. I have some changes I'm playing with to re-enable core gathering and enable gzip compression of cores in order to streamline OOM investigations (#69317). Maybe --debug could drive core gathering as well if we don't want to do it all the time? They usually compress very well. I'm assuming that in cases like this even thought we got a "bad connection" the node is still running because there doesn't appear to be anything amiss in the logs but yeah any hints about why the connection died could be long gone by the the core is generated but who knows we might get lucky.

rafiss · 2021-09-20T20:30:59Z

With GCE_PROJECT=andrei-jepsen COUNT=50 SHORT=y LOCAL=n pkg/cmd/roachtest/roachstress.sh I finally got a repro on the second to last test run (out of 50 total). So @tbg's estimate of the flake rate is pretty spot on. Will try to look at the artifacts again to figure anything out

cockroach-teamcity · 2021-09-24T11:34:42Z

roachtest.ycsb/F/nodes=3 failed with artifacts on release-21.2 @ c46f5a5a098577b936e56f03d20c97300b4cce61:

		  |   113.0s        0         3490.9         2734.8      1.8     39.8    113.2    218.1 read
		  |   113.0s        0         3525.9         2737.1     15.2     41.9     75.5    285.2 readModifyWrite
		  |   114.0s        0         3548.0         2741.9      1.7     31.5    121.6    192.9 read
		  |   114.0s        0         3646.0         2745.1     14.2     41.9    100.7    243.3 readModifyWrite
		  |   115.0s        0         3466.5         2748.2      2.0     31.5    121.6    184.5 read
		  |   115.0s        0         3471.5         2751.4     16.8     41.9     79.7    243.3 readModifyWrite
		  |   116.0s        0         3219.9         2752.3      2.2     39.8     96.5    151.0 read
		  |   116.0s        0         3255.9         2755.7     18.9     48.2     71.3    243.3 readModifyWrite
		  |   117.0s        0         3455.5         2758.3      1.8     29.4    134.2    251.7 read
		  |   117.0s        0         3517.5         2762.2     15.7     41.9     88.1    352.3 readModifyWrite
		  |   118.0s        0         3765.9         2766.8      1.9     25.2    100.7    218.1 read
		  |   118.0s        0         3700.9         2770.2     16.3     41.9     75.5    251.7 readModifyWrite
		  |   119.0s        0         3563.3         2773.5      1.8     39.8    125.8    226.5 read
		  |   119.0s        0         3655.3         2777.6     15.2     41.9     83.9    218.1 readModifyWrite
		  |   120.0s        0         3635.6         2780.7      2.0     35.7     92.3    151.0 read
		  |   120.0s        0         3704.6         2785.4     16.8     44.0     65.0    184.5 readModifyWrite
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,ycsb.go:73,ycsb.go:90,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:73
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:90
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Reproduce

See: roachtest README

Same failure on other branches

roachtest: ycsb/F/nodes=3/cpu=32 failed [bad connection] [similar to 69523] #69545 roachtest: ycsb/F/nodes=3/cpu=32 failed [bad connection] [similar to 69523] [C-test-failure O-roachtest O-robot T-sql-experience T-sql-queries branch-master]
roachtest: ycsb/F/nodes=3 failed [bad connection] #69523 roachtest: ycsb/F/nodes=3 failed [bad connection] [C-test-failure O-roachtest O-robot branch-master]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

tbg · 2021-09-24T13:14:32Z

Something I noticed the other day while using the debugger on some local unit test was that if you pause the process for a while, you can also get a bad connection from pgx (I think). So I wonder if there's some timeout in pgx that, if passes, abandons the connection.

rafiss · 2021-09-24T16:05:52Z

Worth looking into -- though to clarify, our ycsb workload does not use pgx. Only database/sql+lib/pq.

cockroach-teamcity · 2021-09-25T10:42:08Z

roachtest.ycsb/F/nodes=3 failed with artifacts on release-21.2 @ 24021ba163e4ac438b169d575cf1527a4aae394d:

		  |   113.0s        0         4498.9         3766.9      2.0     15.2     35.7     60.8 read
		  |   113.0s        0         4608.9         3757.8     14.7     35.7     46.1     96.5 readModifyWrite
		  |   114.0s        0         4275.2         3771.3      2.1     14.7     30.4     79.7 read
		  |   114.0s        0         4286.2         3762.5     16.3     39.8     52.4    121.6 readModifyWrite
		  |   115.0s        0         4659.6         3779.0      2.2     18.9     44.0     83.9 read
		  |   115.0s        0         4606.6         3769.8     14.2     33.6     48.2     79.7 readModifyWrite
		  |   116.0s        0         4673.0         3786.7      2.2     17.8     39.8     71.3 read
		  |   116.0s        0         4769.0         3778.4     14.2     32.5     44.0     75.5 readModifyWrite
		  |   117.0s        0         4760.2         3795.1      2.1     17.8     39.8     71.3 read
		  |   117.0s        0         4729.2         3786.6     14.2     33.6     46.1     71.3 readModifyWrite
		  |   118.0s        0         4603.0         3801.9      2.2     16.8     32.5     65.0 read
		  |   118.0s        0         4775.9         3794.9     14.7     33.6     46.1     67.1 readModifyWrite
		  |   119.0s        0         4253.1         3805.7      2.4     17.8     39.8     83.9 read
		  |   119.0s        0         4334.1         3799.5     15.7     39.8     56.6     79.7 readModifyWrite
		  |   120.0s        0         4339.4         3810.2      2.4     16.3     37.7     79.7 read
		  |   120.0s        0         4280.5         3803.5     16.8     39.8     50.3     75.5 readModifyWrite
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,ycsb.go:73,ycsb.go:90,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:73
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:90
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Reproduce

See: roachtest README

Same failure on other branches

roachtest: ycsb/F/nodes=3/cpu=32 failed [bad connection] [similar to 69523] #69545 roachtest: ycsb/F/nodes=3/cpu=32 failed [bad connection] [similar to 69523] [C-test-failure O-roachtest O-robot T-sql-experience T-sql-queries branch-master]
roachtest: ycsb/F/nodes=3 failed [bad connection] #69523 roachtest: ycsb/F/nodes=3 failed [bad connection] [C-test-failure O-roachtest O-robot branch-master]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

rytaft · 2021-09-27T12:26:39Z

Any updates to report for the triage meeting?

rafiss · 2021-09-27T21:20:46Z

I'm wondering if it still could be lib/pq#1000

I previously made #68665 to handle this, but maybe the

if ctx.Err() != nil && (errors.Is(err, ctx.Err()) || errors.Is(err, driver.ErrBadConn))

check is too strict. perhaps it should be

if (ctx.Err() != nil && errors.Is(err, ctx.Err()) || errors.Is(err, driver.ErrBadConn)

rafiss · 2021-09-27T21:23:13Z

The reason it could be that is because the error always happens at 120.0s after the test starts -- which is how long the ramp-up period for this test is. After the ramp period, the context is cancelled and a new context is created. see

cockroach/pkg/workload/cli/run.go

Lines 443 to 477 in b50be15

    
           var rampDone chan struct{} 
        
           if *ramp > 0 { 
        
           	// Create a channel to signal when the ramp period finishes. Will 
        
           	// be reset to nil when consumed by the process loop below. 
        
           	rampDone = make(chan struct{}) 
        
           } 
        
           workersCtx, cancelWorkers := context.WithCancel(ctx) 
        
           defer cancelWorkers() 
        
           var wg sync.WaitGroup 
        
           wg.Add(len(ops.WorkerFns)) 
        
           go func() { 
        
           	// If a ramp period was specified, start all of the workers gradually 
        
           	// with a new context. 
        
           	var rampCtx context.Context 
        
           	if rampDone != nil { 
        
           		var cancel func() 
        
           		rampCtx, cancel = context.WithTimeout(workersCtx, *ramp) 
        
           		defer cancel() 
        
           	} 
        
           	for i, workFn := range ops.WorkerFns { 
        
           		go func(i int, workFn func(context.Context) error) { 
        
           			// If a ramp period was specified, start all of the workers 
        
           			// gradually with a new context. 
        
           			if rampCtx != nil { 
        
           				rampPerWorker := *ramp / time.Duration(len(ops.WorkerFns)) 
        
           				time.Sleep(time.Duration(i) * rampPerWorker) 
        
           				workerRun(rampCtx, errCh, nil /* wg */, limiter, workFn) 
        
           			} 
        
           			// Start worker again, this time with the main context. 
        
           			workerRun(workersCtx, errCh, &wg, limiter, workFn) 
        
           		}(i, workFn) 
        
           	}

tbg · 2021-09-28T08:27:45Z

So you're suggesting we ignore ErrBadConn "always"? That seems wrong, won't this be the error you get if the node is down etc? If I understand the issue correctly it also seems like the best we can do, right? When the soon-to-be-canceled connection gets returned to the connection pool, whoever picks it up will not have a canceled context, so we're essentially checking the wrong context for ctx.Err(). Silly question, why can't we pick up the PR you linked in lib/pq and be done with it?

rafiss · 2021-09-28T15:36:03Z

So you're suggesting we ignore ErrBadConn "always"?

Yeah, my thinking was that that there might be a weird case where the client hasn't seen the context cancelled but the connection has been closed already. Wasn't able to repro that in any smaller examples though.

That seems wrong, won't this be the error you get if the node is down etc?

ErrBadConn is the error you get if the connection is unexpectedly broken, which maybe could include the node being down. But I'm pretty sure you'd see a timeout first.

Silly question, why can't we pick up the PR you linked in lib/pq and be done with it?

We already have picked up the linked PR. lib/pq#1000 first appears in v1.9.0 and we are now using v1.10.2. To be clear, lib/pq#1000 is the PR that caused the behavior of ErrBadConn getting returned when the context has been canceled and you try to use a connection (see lib/pq#1000 (comment))

rafiss · 2021-09-28T22:04:22Z

I added the following diff so I could get a stack trace of which worker was actually getting ErrBadConn:

diff --git a/pkg/workload/cli/run.go b/pkg/workload/cli/run.go
index c1d034acb1..6c842f19ce 100644
--- a/pkg/workload/cli/run.go
+++ b/pkg/workload/cli/run.go
@@ -277,6 +277,8 @@ func workerRun(
                                // that has been canceled. See https://github.com/lib/pq/pull/1000
                                return
                        }
+                       log.Infof(ctx, "err: %+v\nctxErr: %+v", err, ctx.Err())
+                       panic(err)
                        errCh <- err
                        continue
                }

After running roachstress.sh for a while:

I210928 19:30:48.157155 1821 workload/cli/run.go:280  [-] 6  err: driver: bad connection
I210928 19:30:48.157155 1821 workload/cli/run.go:280  [-] 6 +(1) driver: bad connection
I210928 19:30:48.157155 1821 workload/cli/run.go:280  [-] 6 +Error types: (1) *errors.errorString
I210928 19:30:48.157155 1821 workload/cli/run.go:280  [-] 6 +ctxErr: <nil>
panic: driver: bad connection

goroutine 1821 [running]:
panic(0x20d24e0, 0xc0000ca270)
        /usr/local/go/src/runtime/panic.go:1065 +0x565 fp=0xc000363eb0 sp=0xc000363de8 pc=0x462d45
github.com/cockroachdb/cockroach/pkg/workload/cli.workerRun(0x4a6f8b0, 0xc001110740, 0xc0004e6240, 0xc00073fb00, 0x0, 0xc000e36620)
        /go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:281 +0x3f0 fp=0xc000363f58 sp=0xc000363eb0 pc=0x18683d0
github.com/cockroachdb/cockroach/pkg/workload/cli.runRun.func3.1(0xc00012a010, 0xc000996040, 0xc0004e6240, 0x0, 0x4a6f8b0, 0xc001110740, 0xc00073fb00, 0x74
, 0xc000e36620)
        /go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:477 +0xec fp=0xc000363f98 sp=0xc000363f58 pc=0x186c9ec
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc000363fa0 sp=0xc000363f98 pc=0x49d281
created by github.com/cockroachdb/cockroach/pkg/workload/cli.runRun.func3
        /go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:467 +0x111

That corresponds to this line:

cockroach/pkg/workload/cli/run.go

Line 475 in bdb4c1a

workerRun(workersCtx, errCh, &wg, limiter, workFn)

The notable thing here is that it's coming from the non-cancelled workersCtx, so it makes sense that ctx.Err() would be nil. It seems like a bad connection remained in the pool, and for some reason when the worker tried using it, it couldn't try to find a different connection right away. (As per the sql/driver docs, the sql package is supposed to retry on a new connection when it sees ErrBadConn.

I'm going to try to see where exactly ErrBadConn is coming from...

rafiss · 2021-09-29T05:17:49Z

I wrote this quick test based on what workload is doing https://gist.github.com/rafiss/698d0bdbeacfd225437db631d1a95e85

It has non-deterministic behavior. Sometimes the test passes, but sometimes it gets a transaction rolled back error where it's expecting ErrBadConn. (It changes when I enable debugging and add breakpoints.) Also, if I run it with make stress, there's a different failure entirely (it gets a "read timed out" error).

rafiss · 2021-09-30T15:03:58Z

I have a non-YCSB repro and was able to reproduce the same behavior on 21.1. So this is almost certainly a workload-side problem, and if not that it is definitely not a new CRDB v21.2 bug.

It seems like a race condition in lib/pq. The test below fails somewhere around 50% of the time. It fails much less frequently when I run against a local CRDB single-node cluster. But it fails more reliably if I run against a 3-node roachprod cluster. If I add a time.Sleep after the workerRun(rampCtx, errCh, nil /* wg */, nil, workFn, debugFn) line, then I'm not able to get it to fail at all.

So I'll remove the GA-blocker tag since it shouldn't block any release. But I'll keep working on this to prevent this from flaking and will backport whatever workload fix I make to release-21.2

Expand code

func workerRun(
	ctx context.Context,
	errCh chan<- error,
	wg *sync.WaitGroup,
	limiter *rate.Limiter,
	workFn func(context.Context) error,
	debugFn func(),
) {
	if wg != nil {
		defer wg.Done()
	}

	for {
		if ctx.Err() != nil {
			return
		}

		// Limit how quickly the load generator sends requests based on --max-rate.
		if limiter != nil {
			if err := limiter.Wait(ctx); err != nil {
				return
			}
		}

		if err := workFn(ctx); err != nil {
			if ctx.Err() != nil && (errors.Is(err, ctx.Err()) || errors.Is(err, driver.ErrBadConn)) {
				// lib/pq may return either the `context canceled` error or a
				// `bad connection` error when performing an operation with a context
				// that has been canceled. See https://github.com/lib/pq/pull/1000
				return
			}
			// blah
			debugFn()
			panic(err)
			errCh <- err
			continue
		}
	}
}


// textFormatter produces output meant for quick parsing by humans. The
// data is printed as fixed-width columns. Summary rows
// are printed at the end.
type textFormatter struct {
	i      int
	numErr int
}

func (f *textFormatter) rampDone() {
	f.i = 0
}

func (f *textFormatter) outputError(_ error) {
	f.numErr++
}

func (f *textFormatter) outputTick(startElapsed time.Duration, t histogram.Tick) {
	if f.i%20 == 0 {
		fmt.Println("_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)")
	}
	f.i++
	fmt.Printf("%7.1fs %8d %14.1f %14.1f %8.1f %8.1f %8.1f %8.1f %s\n",
		startElapsed.Seconds(),
		f.numErr,
		float64(t.Hist.TotalCount())/t.Elapsed.Seconds(),
		float64(t.Cumulative.TotalCount())/startElapsed.Seconds(),
		time.Duration(t.Hist.ValueAtQuantile(50)).Seconds()*1000,
		time.Duration(t.Hist.ValueAtQuantile(95)).Seconds()*1000,
		time.Duration(t.Hist.ValueAtQuantile(99)).Seconds()*1000,
		time.Duration(t.Hist.ValueAtQuantile(100)).Seconds()*1000,
		t.Name,
	)
}

const totalHeader = "\n_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)"

func (f *textFormatter) outputTotal(startElapsed time.Duration, t histogram.Tick) {
	f.outputFinal(startElapsed, t, "__total")
}

func (f *textFormatter) outputResult(startElapsed time.Duration, t histogram.Tick) {
	f.outputFinal(startElapsed, t, "__result")
}

func (f *textFormatter) outputFinal(
	startElapsed time.Duration, t histogram.Tick, titleSuffix string,
) {
	fmt.Println(totalHeader + titleSuffix)
	if t.Cumulative == nil {
		return
	}
	if t.Cumulative.TotalCount() == 0 {
		return
	}
	fmt.Printf("%7.1fs %8d %14d %14.1f %8.1f %8.1f %8.1f %8.1f %8.1f  %s\n",
		startElapsed.Seconds(),
		f.numErr,
		t.Cumulative.TotalCount(),
		float64(t.Cumulative.TotalCount())/startElapsed.Seconds(),
		time.Duration(t.Cumulative.Mean()).Seconds()*1000,
		time.Duration(t.Cumulative.ValueAtQuantile(50)).Seconds()*1000,
		time.Duration(t.Cumulative.ValueAtQuantile(95)).Seconds()*1000,
		time.Duration(t.Cumulative.ValueAtQuantile(99)).Seconds()*1000,
		time.Duration(t.Cumulative.ValueAtQuantile(100)).Seconds()*1000,
		t.Name,
	)
}


func TestLoop(t *testing.T) {
	defer leaktest.AfterTest(t)()
	defer log.Scope(t).Close(t)

	concurreny := 50

	var ramp *time.Duration
	i := 4 * time.Second
	ramp = &i
	totalTime := 2 * time.Second

	urls := []string{
		"postgres://root@<roachprod ip>:26257?sslmode=disable",
		"postgres://root@<roachprod ip>:26257?sslmode=disable",
		"postgres://root@<roachprod ip>:26257?sslmode=disable",
	}
	//db, err := gosql.Open("postgres", "postgresql://root@localhost:26257/defaultdb?sslmode=disable")
	db, err := gosql.Open("postgres", urls[0])//strings.Join(urls, ` `))
	require.NoError(t, err)
	defer db.Close()
	db.SetMaxOpenConns(concurreny + 1)
	db.SetMaxIdleConns(concurreny + 1)
	db.Exec("create table if not exists t (a int primary key, b int)")
	db.Exec("insert into t(a,b) select g,g from generate_series(0,99) as g")

	readStmt, _ := db.Prepare("select b from t where a = $1")
	updateStmt, _ := db.Prepare("update t set b = $1 where a = $2")

	reg := histogram.NewRegistry(
		2 * time.Second,
		"myname",
	)
	handle := reg.GetHandle()


	workFn := func(ctx context.Context) error {
		start := timeutil.Now()
		defer func() {
			elapsed := timeutil.Since(start)
			handle.Get("the_op").Record(elapsed)
		}()
		return crdb.ExecuteTx(ctx, db, nil, func(tx *gosql.Tx) error {
			a := rand.Intn(100)
			var b int
			if err := tx.StmtContext(ctx, readStmt).QueryRowContext(ctx, a).Scan(&b); err != nil {
				return err
			}
			_, err := tx.StmtContext(ctx, updateStmt).ExecContext(ctx, rand.Intn(100), a)
			if err != nil {
			}
			return err
		})
	}
	debugFn := func() {
		stats := db.Stats()
		fmt.Printf("stats %+v\n", stats)
	}

	formatter := &textFormatter{}
	start := timeutil.Now()
	errCh := make(chan error)

	workersCtx, cancelWorkers := context.WithCancel(context.Background())
	defer cancelWorkers()
	var wg sync.WaitGroup
	wg.Add(concurreny)

	var rampDone chan struct{}
	if *ramp > 0 {
		// Create a channel to signal when the ramp period finishes. Will
		// be reset to nil when consumed by the process loop below.
		rampDone = make(chan struct{})
	}

	// If a ramp period was specified, start all of the workers gradually
	// with a new context.
	var rampCtx context.Context
	if rampDone != nil {
		var cancel func()
		rampCtx, cancel = context.WithTimeout(workersCtx, *ramp)
		defer cancel()
	}
	go func() {
		for i := 0; i < concurreny; i++ {
			go func(i int, workFn func(context.Context) error) {
				// If a ramp period was specified, start all of the workers
				// gradually with a new context.
				if rampCtx != nil {
					rampPerWorker := *ramp / time.Duration(concurreny)
					time.Sleep(time.Duration(i) * rampPerWorker)
					workerRun(rampCtx, errCh, nil /* wg */, nil, workFn, debugFn)
					//time.Sleep(rampPerWorker)
				}

				fmt.Println("start real worker", i)
				// Start worker again, this time with the main context.
				workerRun(workersCtx, errCh, &wg, nil, workFn, debugFn)
			}(i, workFn)
		}

		if rampCtx != nil {
			// Wait for the ramp period to finish, then notify the process loop
			// below to reset timers and histograms.
			<-rampCtx.Done()
			close(rampDone)
		}
	}()

	ticker := time.NewTicker(1 * time.Second)
	defer ticker.Stop()
	done := make(chan os.Signal, 3)
	signal.Notify(done, []os.Signal{unix.SIGINT, unix.SIGTERM}...)

	go func() {
		wg.Wait()
		done <- os.Interrupt
	}()

	go func() {
		time.Sleep(totalTime + *ramp)
		done <- os.Interrupt
	}()

	for {
		select {
		case err := <-errCh:
			formatter.outputError(err)
			panic(err)

		case <-ticker.C:
			startElapsed := timeutil.Since(start)
			reg.Tick(func(t histogram.Tick) {
				formatter.outputTick(startElapsed, t)
			})

		// Once the load generator is fully ramped up, we reset the histogram
		// and the start time to throw away the stats for the ramp up period.
		case <-rampDone:
			rampDone = nil
			start = timeutil.Now()
			formatter.rampDone()
			reg.Tick(func(t histogram.Tick) {
				t.Cumulative.Reset()
				t.Hist.Reset()
			})

		case <-done:
			cancelWorkers()

			startElapsed := timeutil.Since(start)
			resultTick := histogram.Tick{Name: "the_op"}
			reg.Tick(func(t histogram.Tick) {
				formatter.outputTotal(startElapsed, t)
					if resultTick.Cumulative == nil {
						resultTick.Now = t.Now
						resultTick.Cumulative = t.Cumulative
					} else {
						resultTick.Cumulative.Merge(t.Cumulative)
					}
			})
			formatter.outputResult(startElapsed, resultTick)
			return
		}
	}
}

cockroach-teamcity added branch-release-21.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 10, 2021

blathers-crl bot added the T-sql-queries SQL Queries Team label Sep 13, 2021

tbg added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 13, 2021

mgartner assigned cucaroach Sep 14, 2021

rafiss removed the GA-blocker label Sep 30, 2021

rafiss assigned rafiss and unassigned cucaroach Sep 30, 2021

blathers-crl bot added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Sep 30, 2021

rafiss mentioned this issue Sep 30, 2021

workload: use the same ctx for ramp and main load #70947

Merged

craig bot closed this as completed in cf4fe62 Oct 5, 2021

blathers-crl bot mentioned this issue Oct 5, 2021

release-21.2: workload: use the same ctx for ramp and main load #71142

Merged

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Done in SQL Queries Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: ycsb/F/nodes=3/cpu=32 failed #70019

roachtest: ycsb/F/nodes=3/cpu=32 failed #70019

cockroach-teamcity commented Sep 10, 2021

tbg commented Sep 13, 2021

tbg commented Sep 13, 2021

rytaft commented Sep 13, 2021

rafiss commented Sep 14, 2021

rytaft commented Sep 20, 2021

rafiss commented Sep 20, 2021

tbg commented Sep 20, 2021

rafiss commented Sep 20, 2021

cucaroach commented Sep 20, 2021

tbg commented Sep 20, 2021

cucaroach commented Sep 20, 2021

rafiss commented Sep 20, 2021 •

edited

Loading

cockroach-teamcity commented Sep 24, 2021

tbg commented Sep 24, 2021

rafiss commented Sep 24, 2021

cockroach-teamcity commented Sep 25, 2021

rytaft commented Sep 27, 2021

rafiss commented Sep 27, 2021

rafiss commented Sep 27, 2021

tbg commented Sep 28, 2021

rafiss commented Sep 28, 2021

rafiss commented Sep 28, 2021 •

edited

Loading

rafiss commented Sep 29, 2021 •

edited

Loading

rafiss commented Sep 30, 2021 •

edited

Loading

roachtest: ycsb/F/nodes=3/cpu=32 failed #70019

roachtest: ycsb/F/nodes=3/cpu=32 failed #70019

Comments

cockroach-teamcity commented Sep 10, 2021

tbg commented Sep 13, 2021

tbg commented Sep 13, 2021

rytaft commented Sep 13, 2021

rafiss commented Sep 14, 2021

rytaft commented Sep 20, 2021

rafiss commented Sep 20, 2021

tbg commented Sep 20, 2021

rafiss commented Sep 20, 2021

cucaroach commented Sep 20, 2021

tbg commented Sep 20, 2021

cucaroach commented Sep 20, 2021

rafiss commented Sep 20, 2021 • edited Loading

cockroach-teamcity commented Sep 24, 2021

tbg commented Sep 24, 2021

rafiss commented Sep 24, 2021

cockroach-teamcity commented Sep 25, 2021

rytaft commented Sep 27, 2021

rafiss commented Sep 27, 2021

rafiss commented Sep 27, 2021

tbg commented Sep 28, 2021

rafiss commented Sep 28, 2021

rafiss commented Sep 28, 2021 • edited Loading

rafiss commented Sep 29, 2021 • edited Loading

rafiss commented Sep 30, 2021 • edited Loading

rafiss commented Sep 20, 2021 •

edited

Loading

rafiss commented Sep 28, 2021 •

edited

Loading

rafiss commented Sep 29, 2021 •

edited

Loading

rafiss commented Sep 30, 2021 •

edited

Loading