Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: WaitForFullReplication hangs and causes test flakes #40805

Closed
thoszhang opened this issue Sep 16, 2019 · 4 comments
Closed

storage: WaitForFullReplication hangs and causes test flakes #40805

thoszhang opened this issue Sep 16, 2019 · 4 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination.

Comments

@thoszhang
Copy link
Contributor

TestParallel/subquery_retry_multinode timed out and caused a build to fail, and the stack traces show that it seemed to have gotten stuck on TestCluster.WaitForFullReplication(). There were about 10 seconds of testutils/testcluster/testcluster.go:718 [n1,s1] has 1 underreplicated ranges in the logs.

#38565 is potentially related.

Test logs (internal): https://drive.google.com/file/d/1kQeirJNVxZlUUtgT_SpRYjlfw7U1tw-m/view?usp=sharing

@thoszhang thoszhang added the A-kv-replication Relating to Raft, consensus, and coordination. label Sep 16, 2019
@ajwerner
Copy link
Contributor

@irfansharif any chance you could take a look at this? I've also seen

 make roachprod-stress CLUSTER=ajwerner-stress PKG=./pkg/ccl/partitionccl TESTS=TestRepartitioning

fail where it seems like we just don't move a replica we need to move (the SucceedsSoon times out). That may or may not be related. That test is currently pretty flaky with fatal errors today due to things fixed in #40751 but it eventually fails and it looks like we're just not doing anything. I verified that the replica which needs to move does indeed get the right zone config.

@irfansharif irfansharif self-assigned this Sep 16, 2019
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 17, 2019
This change demonstrates a totally reliable flake during repartitioning.
My guess is this implies that the intermittent flakes we've observed in cockroachdb#40805
are due to the server's automatic upgrade happening late in the test.

Take this diff and run:
```
make test PKG=./pkg/ccl/partitionccl TESTS=Repartition TESTFLAGS=-v 2>&1 | tee out.$(date +%s)
```

Release Justification: definitely don't release this, it just repros a failure.

Release note: None
@ajwerner
Copy link
Contributor

Turns out my above comment is totally unrelated, see #40823.

@irfansharif
Copy link
Contributor

Taking a look at this now.

@irfansharif
Copy link
Contributor

irfansharif commented Sep 25, 2019

The 10s or so you've seen where we wait for full replication seems typical, and is something #38565 is looking at.

panic: test timed out after 12m0s

goroutine 5157537 [running]:
testing.(*M).startAlarm.func1()
  /usr/local/go/src/testing/testing.go:1334 +0xdf
  created by time.goFunc

Just a timeout. We seem to have bumped this up in #40838.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination.
Projects
None yet
Development

No branches or pull requests

3 participants