Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

teamcity: failed test: TestSystemZoneConfigs #40980

Closed
cockroach-teamcity opened this issue Sep 23, 2019 · 5 comments · Fixed by #41300
Closed

teamcity: failed test: TestSystemZoneConfigs #40980

cockroach-teamcity opened this issue Sep 23, 2019 · 5 comments · Fixed by #41300
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

The following tests appear to have failed on master (test): TestSystemZoneConfigs

You may want to check for open issues.

#1501930:

TestSystemZoneConfigs
...sks left:
1      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190923 16:07:39.668000 768737 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
I190923 16:07:39.668309 768736 util/stop/stopper.go:542  quiescing; tasks left:
3      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190923 16:07:39.668347 768736 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190923 16:07:39.668365 768736 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
I190923 16:07:39.668384 768736 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
I190923 16:07:39.668621 768734 util/stop/stopper.go:542  quiescing; tasks left:
5      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190923 16:07:39.668652 768733 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-rangefeed-subscriber
I190923 16:07:39.668659 768734 util/stop/stopper.go:542  quiescing; tasks left:
4      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190923 16:07:39.668811 768734 util/stop/stopper.go:542  quiescing; tasks left:
3      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190923 16:07:39.668850 768735 util/stop/stopper.go:542  quiescing; tasks left:
5      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190923 16:07:39.668879 768735 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190923 16:07:39.669000 768734 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190923 16:07:39.669031 768734 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
I190923 16:07:39.669052 768734 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
I190923 16:07:39.689406 768514 rpc/nodedialer/nodedialer.go:160  [ct-client] unable to connect to n1: failed to connect to n1 at 127.0.0.1:33229: context canceled
I190923 16:07:39.689432 768517 rpc/nodedialer/nodedialer.go:160  [ct-client] unable to connect to n3: failed to connect to n3 at 127.0.0.1:38691: context canceled
I190923 16:07:39.689448 768515 rpc/nodedialer/nodedialer.go:160  [ct-client] unable to connect to n5: failed to connect to n5 at 127.0.0.1:34893: context canceled
I190923 16:07:39.689494 768512 rpc/nodedialer/nodedialer.go:160  [ct-client] unable to connect to n6: failed to connect to n6 at 127.0.0.1:42347: context canceled
I190923 16:07:39.689502 768516 rpc/nodedialer/nodedialer.go:160  [ct-client] unable to connect to n2: failed to connect to n2 at 127.0.0.1:34545: context canceled
I190923 16:07:39.689546 768513 rpc/nodedialer/nodedialer.go:160  [ct-client] unable to connect to n4: failed to connect to n4 at 127.0.0.1:43217: context canceled
    client_replica_test.go:1753: condition failed to evaluate within 45s: not all replicas agree on the range descriptor for r14
        goroutine 453325 [running]:
        runtime/debug.Stack(0xc0020d5470, 0x3f405a0, 0xc01005d770)
        	/usr/local/go/src/runtime/debug/stack.go:24 +0x9d
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon(0x402ef20, 0xc00753db00, 0xc0020d5470)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:37 +0x6b
        github.com/cockroachdb/cockroach/pkg/storage_test.TestSystemZoneConfigs(0xc00753db00)
        	/go/src/github.com/cockroachdb/cockroach/pkg/storage/client_replica_test.go:1753 +0x7dd
        testing.tRunner(0xc00753db00, 0x383be88)
        	/usr/local/go/src/testing/testing.go:865 +0xc0
        created by testing.(*T).Run
        	/usr/local/go/src/testing/testing.go:916 +0x35a




Please assign, take a look and update the issue accordingly.

@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Sep 23, 2019
@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Sep 23, 2019
tbg added a commit to tbg/cockroach that referenced this issue Sep 26, 2019
The failure mode in cockroachdb#40980 didn't give out any actionable information.
It now prints the mismatching descriptors.

Release justification: testing-only improvement.

Release note: None
@tbg
Copy link
Member

tbg commented Sep 26, 2019

@danhhz could you look into this? I sent PR #41119 to repro. The log I got indicates that we wait for full replication (takes ~10s in that run) and then I'm not seeing any more replication events, but we're stuck with a learner. Possible I've missed something, to be honest I haven't taken the time to really figure out what the test expects, but maybe there's a scenario in which we abandon a learner and whatever would clean it up isn't triggered by the test nor happens organically within 45s.

Note also that I can only repro this on top of the SHA of the above failure (9dd1564), not on today's master, so maybe it has gotten rarer recently. @ajwerner's eager replicaGC work comes to mind; it looks like the test looks at all the replicas it finds on disk even if they're just waiting for GC. That seems more likely to be a problem than the other thing I mention above.

@tbg tbg assigned danhhz and unassigned tbg Sep 26, 2019
@tbg
Copy link
Member

tbg commented Sep 26, 2019

Nvm, I did get it on master. Here's the output: https://gist.github.com/tbg/c097f503e3136deab5df51eaf1a346cc

@danhhz
Copy link
Contributor

danhhz commented Sep 26, 2019

From your gist: is it just me or does this look wrong?

I190926 08:20:35.248522 649 storage/replica_command.go:1582  [n1,replicate,s1,r1/1:/{Min-System/NodeL…}] change replicas (add [(n4,s4):7VOTER_INCOMING] remove [(n3,s3):4VOTER_OUTGOING]): existing descriptor r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n2,s2):2, (n7,s7):3, (n3,s3):4, (n6,s6):6, (n4,s4):7LEARNER, next=8, gen=12]
I190926 08:20:35.254539 37495 storage/raft_snapshot_queue.go:125  [n1,raftsnapshot,s1,r1/1:/{Min-System/NodeL…}] skipping snapshot; replica is likely a learner in the process of being added: (n4,s4):7LEARNER
I190926 08:20:35.290871 649 storage/replica_raft.go:291  [n1,s1,r1/1:/{Min-System/NodeL…}] proposing ENTER_JOINT ADD_REPLICA[(n4,s4):7VOTER_INCOMING], REMOVE_REPLICA[(n3,s3):4VOTER_OUTGOING]: after=[(n1,s1):1 (n2,s2):2 (n7,s7):3 (n3,s3):4VOTER_OUTGOING (n6,s6):6 (n4,s4):7VOTER_INCOMING] next=8

@danhhz
Copy link
Contributor

danhhz commented Sep 26, 2019

Oh, nevermind. Pulled the trigger on that question too soon. I just haven't looked at these logs since joint configs went in

craig bot pushed a commit that referenced this issue Sep 27, 2019
41119: storage: improve TestSystemZoneConfigs r=danhhz a=tbg

The failure mode in #40980 didn't give out any actionable information.
It now prints the mismatching descriptors.

With this commit (and at this SHA) we see within a few minutes

```
make stress PKG=./pkg/storage/ TESTS=TestSystemZoneConfigs
--- FAIL: TestSystemZoneConfigs (61.72s)
    client_replica_test.go:1749: condition failed to evaluate within 45s: mismatch between
        r1:/{Min-System/NodeLiveness} [(n2,s2):8, (n7,s7):2, (n5,s5):3, (n6,s6):4, (n3,s3):6, next=9, gen=17]
        r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n7,s7):2, (n5,s5):3, (n6,s6):4, (n3,s3):6, (n4,s4):7LEARNER, next=8, gen=12]
```

Release justification: testing-only improvement.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@ajwerner
Copy link
Contributor

ajwerner commented Oct 3, 2019

This test still fails sometimes even with #41119

It seems happens if we get stuck with a node with a removed learner that never hears about its removal. The replica GC queue in this case is very conservative (10 days!) about checking whether we should remove this replica. I'm inclined to add a relatively short timeout period to check whether we should check on the status of a learner. Note learners will never campaign but still may get ignored.

PR inbound.

ajwerner added a commit to ajwerner/cockroach that referenced this issue Oct 3, 2019
This PR fixes a test flake in TestSystemZoneConfig:

```
    client_replica_test.go:1753: condition failed to evaluate within 45s: mismatch between
        r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):7, (n7,s7):5, next=8, gen=14]
        r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):4, (n7,s7):5, (n3,s3):6LEARNER, next=7, gen=9]
```

The above flake happens because we set the expectation in the map to a
descriptor which contains a learner which has since been removed.

We shouldn't use a range descriptor which contains learners as the expectation.
To avoid that we return an error in the succeeds soon if we come across a
descriptor which contains learners. This behavior unvealed another issue,
we are way too conservative with replica GC for learners. Most of the time
when learners are removed they hear about their own removal, but if they don't
we won't consider the Replica for removal for 10 days! This commit changes
the replica gc queue behavior to treat learners line candidates.

Fixes cockroachdb#40980.

Release Justification: bug fixes and low-risk updates to new functionality.

Release note: None
craig bot pushed a commit that referenced this issue Oct 4, 2019
41300: storage: more aggressively replica GC learner replicas r=ajwerner a=ajwerner

This PR fixes a test flake in TestSystemZoneConfig:

```
    client_replica_test.go:1753: condition failed to evaluate within 45s: mismatch between
        r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):7, (n7,s7):5, next=8, gen=14]
        r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):4, (n7,s7):5, (n3,s3):6LEARNER, next=7, gen=9]
```

The above flake happens because we set the expectation in the map to a
descriptor which contains a learner which has since been removed.

We shouldn't use a range descriptor which contains learners as the expectation.
To avoid that we return an error in the succeeds soon if we come across a
descriptor which contains learners. This behavior unvealed another issue,
we are way too conservative with replica GC for learners. Most of the time
when learners are removed they hear about their own removal, but if they don't
we won't consider the Replica for removal for 10 days! This commit changes
the replica gc queue behavior to treat learners line candidates.

Fixes #40980.

Release Justification: bug fixes and low-risk updates to new functionality.

Release note: None

41308: storage: remove error from Replica.applyTimestampCache() r=ajwerner a=ajwerner

Stumbled upon a function with an error in its return signature
that never returns an error. Better to remove it and the stale
comment that goes with it. The removal of the code paths which
could have returned an error occurred in #33396.

Release justification: Low risk, does not change logic. Could also
hold off.

Release note: None

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
@craig craig bot closed this as completed in 2e9ce1c Oct 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants