storage: WaitForFullReplication hangs and causes test flakes #40805

thoszhang · 2019-09-16T18:04:29Z

TestParallel/subquery_retry_multinode timed out and caused a build to fail, and the stack traces show that it seemed to have gotten stuck on TestCluster.WaitForFullReplication(). There were about 10 seconds of testutils/testcluster/testcluster.go:718 [n1,s1] has 1 underreplicated ranges in the logs.

#38565 is potentially related.

Test logs (internal): https://drive.google.com/file/d/1kQeirJNVxZlUUtgT_SpRYjlfw7U1tw-m/view?usp=sharing

The text was updated successfully, but these errors were encountered:

ajwerner · 2019-09-16T20:13:46Z

@irfansharif any chance you could take a look at this? I've also seen

 make roachprod-stress CLUSTER=ajwerner-stress PKG=./pkg/ccl/partitionccl TESTS=TestRepartitioning

fail where it seems like we just don't move a replica we need to move (the SucceedsSoon times out). That may or may not be related. That test is currently pretty flaky with fatal errors today due to things fixed in #40751 but it eventually fails and it looks like we're just not doing anything. I verified that the replica which needs to move does indeed get the right zone config.

This change demonstrates a totally reliable flake during repartitioning. My guess is this implies that the intermittent flakes we've observed in cockroachdb#40805 are due to the server's automatic upgrade happening late in the test. Take this diff and run: ``` make test PKG=./pkg/ccl/partitionccl TESTS=Repartition TESTFLAGS=-v 2>&1 | tee out.$(date +%s) ``` Release Justification: definitely don't release this, it just repros a failure. Release note: None

ajwerner · 2019-09-17T11:52:09Z

Turns out my above comment is totally unrelated, see #40823.

irfansharif · 2019-09-25T17:48:27Z

Taking a look at this now.

irfansharif · 2019-09-25T19:29:13Z

The 10s or so you've seen where we wait for full replication seems typical, and is something #38565 is looking at.

panic: test timed out after 12m0s

goroutine 5157537 [running]:
testing.(*M).startAlarm.func1()
  /usr/local/go/src/testing/testing.go:1334 +0xdf
  created by time.goFunc

Just a timeout. We seem to have bumped this up in #40838.

thoszhang added the A-kv-replication Relating to Raft, consensus, and coordination. label Sep 16, 2019

irfansharif self-assigned this Sep 16, 2019

asubiotto mentioned this issue Sep 17, 2019

githooks: enhance Release justification error message #40820

Merged

irfansharif closed this as completed Sep 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: WaitForFullReplication hangs and causes test flakes #40805

storage: WaitForFullReplication hangs and causes test flakes #40805

thoszhang commented Sep 16, 2019

ajwerner commented Sep 16, 2019

ajwerner commented Sep 17, 2019

irfansharif commented Sep 25, 2019

irfansharif commented Sep 25, 2019 •

edited

Loading

storage: WaitForFullReplication hangs and causes test flakes #40805

storage: WaitForFullReplication hangs and causes test flakes #40805

Comments

thoszhang commented Sep 16, 2019

ajwerner commented Sep 16, 2019

ajwerner commented Sep 17, 2019

irfansharif commented Sep 25, 2019

irfansharif commented Sep 25, 2019 • edited Loading

irfansharif commented Sep 25, 2019 •

edited

Loading