roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [self-delegated snaps] #72083

cockroach-teamcity · 2021-10-28T09:03:27Z

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on master @ d91fead28392841a943251842fbd43a0affb2eca:

		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1071
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:905
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 11: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1300,context.go:91,cluster.go:1288,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3647465-1635401487-34-n12cpu4-geo --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		1: 13290
		3: 12804
		2: 13472
		8: skipped
		6: 11783
		12: skipped
		7: 11853
		11: dead (exit status 137)
		9: 11501
		5: 12334
		10: 11563
		Error: UNCLASSIFIED_PROBLEM: 11: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1175
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:281
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2104
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 11: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Help

See: roachtest README

|

See: How To Investigate (internal)

Same failure on other branches

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [sst raft oom] #71802 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [C-test-failure O-roachtest O-robot branch-release-21.2]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!

Jira issue: CRDB-10940}

The text was updated successfully, but these errors were encountered:

AlexTalks · 2021-10-29T18:04:56Z

It is surprising that we are still seeing OOMs on this test despite merging #71132 - potentially related to #71802

tbg · 2021-11-04T09:57:11Z

https://share.polarsignals.com/73a06c8/

@erikgrinaker this seems to be something we should be looking into more actively. It is "sort of" expected that we're seeing lots of memory held up by sideloaded proposals; after all this phase of the test mostly crams lots of SSTs into our log and then asks us to send them to two followers, who are possibly also a region hop away. But something seems to have changed as we didn't use to see this and also #71132 hasn't prevented it from happening, and I looked before and couldn't see any obvious other leaks. So currently I am expecting that we will see that we happen to have a lot of groups catch up followers at once, overwhelming the system. If that is the case, it would be difficult to even think of a quick fix. We would need to either delay adding new entries to the log or sending entries to followers. The latter happens inside of raft, so the easier choice is the former. Then the question becomes, do we apply it to SSTs only, or to all proposals? SSTs is easier since there is already a concept of delaying them, plus they are not that sensitive to it. But first we need to see that what I'm describing is really what we're seeing.

erikgrinaker · 2021-11-04T10:30:56Z

Yeah, this seems bad. We seem to be enforcing per-range size limits that should mostly prevent this, so I agree that this seems likely to be because we're catching up many groups at once.

Would it be worth bisecting this to find out what triggered it?

tbg · 2021-11-04T11:28:05Z

Hard to say, it sure would be nice to know the commit if there is one. On the other hand, it would likely be extremely painful. I think I used to do hundreds of runs when working on #69414, though, and never saw the OOM there. This was based on ab1fc34, so I think that would be our "good" commit (though it has the inconsistency). Now when did I first see this OOM? I think it was in #71050. Note that this isn't the exact same OOM (the memory is held in the inefficiency fixed in #71132) but I think this is still the same.

Hmm, maybe it's fine? Really depends on how clean the repro loop is. I think we should run import/tpcc/warehouses=4000/geo as tpccbench does lots of stuff not related to the import assuming it does get past the import. import/tpcc takes roundabout an hour so we should be able to see something. I might take this as the excuse to get #70435 back into shape and to see how far we can get.

tobias@td:~/go/src/github.com/cockroachdb/cockroach$ git bisect start
tobias@td:~/go/src/github.com/cockroachdb/cockroach$ git bisect good ab1fc34
tobias@td:~/go/src/github.com/cockroachdb/cockroach$ git bisect bad d1231cff60125b397ccce6c79c9aeea771cdcca4
Bisecting: 311 revisions left to test after this (roughly 8 steps)
warning: unable to rmdir 'pkg/ui/yarn-vendor': Directory not empty
Submodule path 'vendor': checked out 'fcef703fb087367037cfd20f9576875c2cec9092'
[ecffc89299760b8bf5f966030fd524475b4095ca] kv: deflake and unskip TestPushTxnUpgradeExistingTxn

edit: test balloon launched,

BRANCH=release-21.2 SHA=$(git rev-parse HEAD) TEST=import/tpcc/warehouses=4000/geo COUNT=1 ~/roachstress-ci.sh

https://teamcity.cockroachdb.com/viewLog.html?buildId=3683316&

tbg · 2021-11-04T13:26:49Z

Ok, the roachstress-CI thing seems to work. Going to log the bisect here and update as I make progress.

I'm using

BRANCH=release-21.2 SHA=$(git rev-parse HEAD) TEST=import/tpcc/warehouses=4000/geo COUNT=50 ~/roachstress-ci.sh

d1231cf (confirming starting bad commit): https://teamcity.cockroachdb.com/viewQueued.html?itemId=3683412, we expect this to produce the failure
ab1fc34 (confirming starting good commit): https://teamcity.cockroachdb.com/viewQueued.html?itemId=3683413, this should not produce the failure
ecffc89 (bisect step 1): https://teamcity.cockroachdb.com/viewLog.html?buildId=3683411&

tbg · 2021-11-04T22:07:40Z

Hmm so stressing this test (import/tpcc/warehouses=4000/geo) worked great, the problem is all 50 runs passed on all three commits.

tbg · 2021-11-04T22:09:34Z

Screw it, going to try stressing tpccbench as is. I don't have it in me to patch each commit to just do the import, etc.; let's see what we get.

tbg · 2021-11-04T22:12:09Z

first bad commit: BRANCH=release-21.2 SHA=d1231cff60125b397ccce6c79c9aeea771cdcca4 TEST=import/tpcc/warehouses=4000/geo COUNT=50 ~/roachstress-ci.sh

tbg · 2021-11-04T23:50:21Z

oops that was the old test again. Ok here for reals:

first bad commit BRANCH=release-21.2 SHA=d1231cff60125b397ccce6c79c9aeea771cdcca4 TEST=tpccbench/nodes=9/cpu=4/multi-region COUNT=50 ~/roachstress-ci.sh

tbg · 2021-11-05T20:26:34Z

They all passed too. We were supposed to see an oom here.

erikgrinaker · 2021-11-06T13:11:44Z

Interesting, I suppose there must have been aggravating circumstances in the initial failure -- perhaps a failure mode that caused concurrent AddSSTable requests to pile up.

I had a look at the debug.zip, and noticed that we have several nodes with ~200 outbound snapshots in progress concurrently:

 $ grep 'kvserver.sendSnapshot' */stacks.txt | cut -f 1 -d / | uniq -c
      2 1
    165 4
    188 6
    195 7
    203 8

All of these appear to come via Replica.adminScatter. I'm speculating here, but seems plausible that if this amount of ranges were seeing concurrent AddSSTable traffic, then after the snapshots were applied we'd have to catch up ~200 ranges with AddSSTable entries. 3 GB / 200 ranges works out to about 15 MB/range, which is in the right ballpark.

tbg · 2021-11-08T12:49:27Z

Just for the record, if we wanted to limit the size of the messages, we'd have to work something down into raft onto this line

https://github.com/cockroachdb/vendored/blob/master/go.etcd.io/etcd/raft/v3/raft.go#L435

Instead of a fixed maxMsgSize we would need to pass an interface that dynamically limits the budget, i.e. something like

limiter interface {
  Request(size uint64) bool
}

and if the limiter returns false, we don't send anything else. The main new thing that comes out of this is that maybeSendAppend may end up sending nothing even though there is something that should be sent (in the current impl, it will send at least one entry in that case), not sure if that causes problems for any of the (few) callers. We'd also have to think about starvation. One very busy raft group may starve out another that is "just trying to send a single SST". So the underlying impl would have to "remember" a failed call on the assumption that the call will happen again soon. But we also need to figure out how wait until to try again. It's not entirely straightforward to set this all up.

tbg · 2021-11-11T11:26:41Z

Last failure is [perm denied #72635]

pav-kv · 2022-09-23T13:37:27Z

@andrewbaptist @erikgrinaker Thanks for the heads-up. The grpc bump to v1.47 is problematic for other reasons too (#81881), so I will upgrade to v1.49 soon. The issue you caught seems similar to grpc/grpc-go#5512, which was also fixed in v1.49.

cockroach-teamcity · 2022-09-23T15:41:48Z

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on master @ 34dc56fbb5789b39be47b110bf22332c7f5654f6:

test artifacts and logs in: /artifacts/tpccbench/nodes=9/cpu=4/multi-region/run_1
	monitor.go:127,tpcc.go:1113,tpcc.go:950,test_runner.go:928: monitor failure: monitor task failed: Non-zero exit code: 1
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1113
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:950
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1594
		Wraps: (4) monitor task failed
		Wraps: (5) Non-zero exit code: 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *install.NonZeroExitCode

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #87590 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [C-test-failure GA-blocker O-roachtest O-robot T-kv branch-release-22.2]

_{This test on roachdash | Improve this report!}

87533: sqlliveness: add timeouts to heartbeats r=ajwerner a=aadityasondhi Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves #85541 Release note: None Release justification: low-risk bugfix to existing functionality 88293: backupccl: elide expensive ShowCreate call in SHOW BACKUP r=stevendanna a=adityamaru In #88376 we see the call to `ShowCreate` taking ~all the time on a cluster with 2.5K empty tables. In all cases except `SHOW BACKUP SCHEMAS` we do not need to construct the SQL representation of the table's schema. This results in a marked improvement in the performance of `SHOW BACKUP` as can be seen in #88376 (comment). Fixes: #88376 Release note (performance improvement): `SHOW BACKUP` on a backup containing several table descriptors is now more performant 88471: sql/schemachanger: plumb context, check for cancelation sometimes r=ajwerner a=ajwerner Fixes #87246 This will also improve tracing. Release note: None 88557: testserver: add ShareMostTestingKnobsWithTenant option r=msbutler a=stevendanna The new ShareMostTestingKnobs copies nearly all of the testing knobs specified for a TestServer to any tenant started for that server. The goal here is to make it easier to write tests that depend on testing hooks that work under probabilistic tenant testing. Release justification: non-production code change Release note: None 88562: upgrade grpc to v.1.49.0 r=erikgrinaker a=pavelkalinnikov Fixes #81881 Touches #72083 Release note: upgraded grpc to v1.49.0 to fix a few panics that the old version caused 88568: sql: fix panic due to missing schema r=ajwerner a=ajwerner A schema might not exist because it has been dropped. We need to mark the lookup as required. Fixes #87895 Release note (bug fix): Fixed a bug in pg_catalog tables which could result in an internal error if a schema is concurrently dropped. Co-authored-by: David Hartunian <davidh@cockroachlabs.com> Co-authored-by: Aaditya Sondhi <aadityas@cockroachlabs.com> Co-authored-by: adityamaru <adityamaru@gmail.com> Co-authored-by: Andrew Werner <awerner32@gmail.com> Co-authored-by: Steven Danna <danna@cockroachlabs.com> Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>

andrewbaptist · 2022-09-23T20:06:35Z

This looks like the issue is that admin scatter takes ~1 hour on these tests. Looking at runs that succeeded / failed, they all take on the order of 55+ minutes to complete this step. Running manually twice also confirmed this number. I'm planning to change the test

cockroach/pkg/workload/cli/run.go

Line 428 in 682b303

const prepareTimeout = 60 * time.Minute

to allow a longer timeout.

Relates to #72083. Allow scatter to complete. Release note: None

andrewbaptist · 2022-09-24T01:22:41Z

The issue was that scatter was taking between 55 minutes and 1hr 18 minutes to complete (based on running the test 10 times). There was nothing hung however and the tests all completed successfully after bumping the timeout. Based on this being a test timing issue only, it would make sense to backport this to 22.2 (and probably also remove the release blocker label).

$ grep -A1 "cockroach workload run tpcc" */test.log
run_1/test.log:20:19:03 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_1/test.log-22:24:00 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)
--
run_10/test.log:20:28:20 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_10/test.log-22:40:26 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)
--
run_2/test.log:20:17:44 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_2/test.log-22:35:12 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)
--
run_3/test.log:20:18:55 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_3/test.log-22:30:35 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)
--
run_4/test.log:20:26:13 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_4/test.log-22:28:04 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)
--
run_5/test.log:20:22:22 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_5/test.log-22:29:30 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)
--
run_6/test.log:20:24:04 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_6/test.log-22:27:54 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)
--
run_7/test.log:20:18:38 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_7/test.log-22:13:23 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)
--
run_8/test.log:20:17:41 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_8/test.log-22:24:49 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)
--
run_9/test.log:20:20:39 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_9/test.log-22:34:23 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)```

erikgrinaker · 2022-09-26T08:10:02Z

The issue was that scatter was taking between 55 minutes and 1hr 18 minutes to complete (based on running the test 10 times). There was nothing hung however and the tests all completed successfully after bumping the timeout.

Do we know why this started failing a month ago? Is this now expected behavior, or is the slowdown pathological?

andrewbaptist · 2022-09-26T13:32:28Z

The best I can tell is that the test was working consistently until about May 26th. After that it didn't run again until July 8th (although I'm not sure why). From July 8th until now, it has been failing about half the time.
https://teamcity.cockroachdb.com/test/-7667002519850730298?currentProjectId=Cockroach_Nightlies

I tried to see if there was any related reason for this, however there have been a lot of changes between then and now. Even when it was succeeding before, it did run for about the same time as now (4-5 hours total) - so I'm not sure what exactly changed.

erikgrinaker · 2022-09-26T13:34:10Z

Ok, thanks. If we're sure the 1 hour+ scatter times here are expected then I suggest we close this out with the timeout bump, and deal with any new failures separately. Thanks for looking into this!

andrewbaptist · 2022-09-26T13:38:02Z

Sounds good - merging this timeout change. This is definitely a good test to have running consistently. It is possible that the scatter times have gotten slightly worse, but scatter was completely rewritten about 6-9 months ago. So it is possible that fixes that occurred to that over the past few months are related. It is also a strange operation that should be re-examined at some point in the near future as it runs "out-of-band" of other things.

88550: kvserver: use execution timestamps for verified when available r=erikgrinaker a=tbg Now that "most" operations save their execution timestamps, use them for verification. This has the undesirable side effect of failing the entire test suite, which didn't bother specifying timestamps for most operations. Now they are required, and need to be present, at least for all mutations. I took the opportunity to also clean up the test helpers a bit, so now we don't have to pass an `error` when it's not required. The big remaining caveat is that units that return with an ambiguous result don't necessarily have a commit timestamp. I *think* this is only an implementation detail. We *could* ensure that `AmbiguousResultError` always contains the one possible commit timestamp. This should work since `TxnCoordSender` is always local to `kvnemesis`, and so there's no "fallible" component between the two. This would result in a significant simplification of `kvnemesis`, since as is when there are ambiguous deletions, we have to materialize them but cannot assign them a timestamp. This complicates various code paths and to be honest I'm not even sure what exactly we verify and how it all works when there are such "half-materialized" writes. I would rather do away with the concept altogether. Clearly we also won't be able to simplify the verification to simply use commit order if there are operations that don't have a timestamp, which is another reason to keep pushing on this. Release note: None 88641: workload: Bump prepare timeout to 90 minute r=aayushshah15 a=andrewbaptist Relates to #72083. Allow scatter to complete. Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Andrew Baptist <baptist@cockroachlabs.com>

Relates to #72083. Allow scatter to complete. Release note: None

srosenberg · 2022-10-03T17:38:56Z

The issue was that scatter was taking between 55 minutes and 1hr 18 minutes to complete (based on running the test 10 times). There was nothing hung however and the tests all completed successfully after bumping the timeout. Based on this being a test timing issue only, it would make sense to backport this to 22.2 (and probably also remove the release blocker label).

Two follow-up questions,

is --scatter a prerequisite for this test? i.e., the preceding step–fixtures import already does that to a degree. Is the resulting distribution (of ranges) after import not sufficiently uniform?
is --scatter taking longer now not a KV regression?

andrewbaptist · 2022-10-03T18:10:03Z

These are great questions and unfortunately, I don't know the full answers to either of them.

It would certainly be better to remove the call to scatter from this test since it is artificial and bypasses much of the normal control mechanisms, however, I don't know if that would expose other failure modes. I think it is necessary because the data is initially written from a single location and the goal is to simulate a system that has data evenly read/written from all locations.
The scatter implementation changed dramatically about 6 months ago and its performance isn't something we are too concerned about. We know that most "bulk snapshot" operations (e.g. decommissioning) have gotten considerably faster between 22.1 and 22.2. This applies in particular to running these operations in parallel with other things happening on the system. It is possible these improvements caused a regression. Aayush and I did some analysis on the performance of this scatter and it appears to be running efficiently, so we were not concerned about the time it took to complete based on the amount of data.

AlexTalks removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Oct 29, 2021

erikgrinaker changed the title ~~roachtest: tpccbench/nodes=9/cpu=4/multi-region failed~~ roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [raft oom] Nov 4, 2021

This comment has been minimized.

Sign in to view

tbg mentioned this issue Nov 11, 2021

roachprod: failures due to permission denied on /mnt/data1/cockroach #72635

Closed

This comment has been minimized.

Sign in to view

pav-kv mentioned this issue Sep 23, 2022

upgrade grpc to v.1.49.0 #88562

Merged

pav-kv mentioned this issue Sep 23, 2022

release-22.2: upgrade grpc to v.1.49.0 #88630

Merged

andrewbaptist added a commit that referenced this issue Sep 24, 2022

workload: Bump prepare timeout to 90 minute

643d7d2

Relates to #72083. Allow scatter to complete. Release note: None

andrewbaptist mentioned this issue Sep 24, 2022

workload: Bump prepare timeout to 90 minute #88641

Merged

andrewbaptist linked a pull request Sep 24, 2022 that will close this issue

workload: Bump prepare timeout to 90 minute #88641

Merged

cockroach-teamcity mentioned this issue Sep 26, 2022

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #88698

Closed

andrewbaptist removed GA-blocker S-1 High impact: many users impacted, serious risk of high unavailability or data loss labels Sep 26, 2022

craig bot closed this as completed in #88641 Sep 26, 2022

srosenberg mentioned this issue Sep 27, 2022

roachtest: tpccbench/nodes=12/cpu=16 failed #86987

Closed

blathers-crl bot pushed a commit that referenced this issue Sep 30, 2022

workload: Bump prepare timeout to 90 minute

68c1add

Relates to #72083. Allow scatter to complete. Release note: None

blathers-crl bot mentioned this issue Sep 30, 2022

release-22.2: workload: Bump prepare timeout to 90 minute #89118

Merged

exalate-issue-sync bot unassigned andrewbaptist Dec 9, 2022

exalate-issue-sync bot added the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label Dec 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [self-delegated snaps] #72083

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [self-delegated snaps] #72083

cockroach-teamcity commented Oct 28, 2021 •

edited by cockroach-jira-scripts

Loading

AlexTalks commented Oct 29, 2021

tbg commented Nov 4, 2021

erikgrinaker commented Nov 4, 2021

tbg commented Nov 4, 2021 •

edited

Loading

tbg commented Nov 4, 2021 •

edited

Loading

tbg commented Nov 4, 2021 •

edited

Loading

tbg commented Nov 4, 2021

tbg commented Nov 4, 2021

tbg commented Nov 4, 2021

tbg commented Nov 5, 2021

erikgrinaker commented Nov 6, 2021

tbg commented Nov 8, 2021

This comment has been minimized.

tbg commented Nov 11, 2021

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

pav-kv commented Sep 23, 2022

cockroach-teamcity commented Sep 23, 2022

andrewbaptist commented Sep 23, 2022

andrewbaptist commented Sep 24, 2022

erikgrinaker commented Sep 26, 2022

andrewbaptist commented Sep 26, 2022

erikgrinaker commented Sep 26, 2022 •

edited

Loading

andrewbaptist commented Sep 26, 2022

srosenberg commented Oct 3, 2022

andrewbaptist commented Oct 3, 2022

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [self-delegated snaps] #72083

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [self-delegated snaps] #72083

Comments

cockroach-teamcity commented Oct 28, 2021 • edited by cockroach-jira-scripts Loading

AlexTalks commented Oct 29, 2021

tbg commented Nov 4, 2021

erikgrinaker commented Nov 4, 2021

tbg commented Nov 4, 2021 • edited Loading

tbg commented Nov 4, 2021 • edited Loading

tbg commented Nov 4, 2021 • edited Loading

tbg commented Nov 4, 2021

tbg commented Nov 4, 2021

tbg commented Nov 4, 2021

tbg commented Nov 5, 2021

erikgrinaker commented Nov 6, 2021

tbg commented Nov 8, 2021

This comment has been minimized.

tbg commented Nov 11, 2021

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

pav-kv commented Sep 23, 2022

cockroach-teamcity commented Sep 23, 2022

andrewbaptist commented Sep 23, 2022

andrewbaptist commented Sep 24, 2022

erikgrinaker commented Sep 26, 2022

andrewbaptist commented Sep 26, 2022

erikgrinaker commented Sep 26, 2022 • edited Loading

andrewbaptist commented Sep 26, 2022

srosenberg commented Oct 3, 2022

andrewbaptist commented Oct 3, 2022

cockroach-teamcity commented Oct 28, 2021 •

edited by cockroach-jira-scripts

Loading

tbg commented Nov 4, 2021 •

edited

Loading

tbg commented Nov 4, 2021 •

edited

Loading

tbg commented Nov 4, 2021 •

edited

Loading

erikgrinaker commented Sep 26, 2022 •

edited

Loading