-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
teamcity: failed test: TestSystemZoneConfigs #40980
Comments
The failure mode in cockroachdb#40980 didn't give out any actionable information. It now prints the mismatching descriptors. Release justification: testing-only improvement. Release note: None
@danhhz could you look into this? I sent PR #41119 to repro. The log I got indicates that we wait for full replication (takes ~10s in that run) and then I'm not seeing any more replication events, but we're stuck with a learner. Possible I've missed something, to be honest I haven't taken the time to really figure out what the test expects, but maybe there's a scenario in which we abandon a learner and whatever would clean it up isn't triggered by the test nor happens organically within 45s. Note also that I can only repro this on top of the SHA of the above failure (9dd1564), not on today's master, so maybe it has gotten rarer recently. @ajwerner's eager replicaGC work comes to mind; it looks like the test looks at all the replicas it finds on disk even if they're just waiting for GC. That seems more likely to be a problem than the other thing I mention above. |
Nvm, I did get it on master. Here's the output: https://gist.github.com/tbg/c097f503e3136deab5df51eaf1a346cc |
From your gist: is it just me or does this look wrong?
|
Oh, nevermind. Pulled the trigger on that question too soon. I just haven't looked at these logs since joint configs went in |
41119: storage: improve TestSystemZoneConfigs r=danhhz a=tbg The failure mode in #40980 didn't give out any actionable information. It now prints the mismatching descriptors. With this commit (and at this SHA) we see within a few minutes ``` make stress PKG=./pkg/storage/ TESTS=TestSystemZoneConfigs --- FAIL: TestSystemZoneConfigs (61.72s) client_replica_test.go:1749: condition failed to evaluate within 45s: mismatch between r1:/{Min-System/NodeLiveness} [(n2,s2):8, (n7,s7):2, (n5,s5):3, (n6,s6):4, (n3,s3):6, next=9, gen=17] r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n7,s7):2, (n5,s5):3, (n6,s6):4, (n3,s3):6, (n4,s4):7LEARNER, next=8, gen=12] ``` Release justification: testing-only improvement. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
This test still fails sometimes even with #41119 It seems happens if we get stuck with a node with a removed learner that never hears about its removal. The replica GC queue in this case is very conservative (10 days!) about checking whether we should remove this replica. I'm inclined to add a relatively short timeout period to check whether we should check on the status of a learner. Note learners will never campaign but still may get ignored. PR inbound. |
This PR fixes a test flake in TestSystemZoneConfig: ``` client_replica_test.go:1753: condition failed to evaluate within 45s: mismatch between r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):7, (n7,s7):5, next=8, gen=14] r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):4, (n7,s7):5, (n3,s3):6LEARNER, next=7, gen=9] ``` The above flake happens because we set the expectation in the map to a descriptor which contains a learner which has since been removed. We shouldn't use a range descriptor which contains learners as the expectation. To avoid that we return an error in the succeeds soon if we come across a descriptor which contains learners. This behavior unvealed another issue, we are way too conservative with replica GC for learners. Most of the time when learners are removed they hear about their own removal, but if they don't we won't consider the Replica for removal for 10 days! This commit changes the replica gc queue behavior to treat learners line candidates. Fixes cockroachdb#40980. Release Justification: bug fixes and low-risk updates to new functionality. Release note: None
41300: storage: more aggressively replica GC learner replicas r=ajwerner a=ajwerner This PR fixes a test flake in TestSystemZoneConfig: ``` client_replica_test.go:1753: condition failed to evaluate within 45s: mismatch between r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):7, (n7,s7):5, next=8, gen=14] r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):4, (n7,s7):5, (n3,s3):6LEARNER, next=7, gen=9] ``` The above flake happens because we set the expectation in the map to a descriptor which contains a learner which has since been removed. We shouldn't use a range descriptor which contains learners as the expectation. To avoid that we return an error in the succeeds soon if we come across a descriptor which contains learners. This behavior unvealed another issue, we are way too conservative with replica GC for learners. Most of the time when learners are removed they hear about their own removal, but if they don't we won't consider the Replica for removal for 10 days! This commit changes the replica gc queue behavior to treat learners line candidates. Fixes #40980. Release Justification: bug fixes and low-risk updates to new functionality. Release note: None 41308: storage: remove error from Replica.applyTimestampCache() r=ajwerner a=ajwerner Stumbled upon a function with an error in its return signature that never returns an error. Better to remove it and the stale comment that goes with it. The removal of the code paths which could have returned an error occurred in #33396. Release justification: Low risk, does not change logic. Could also hold off. Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
The following tests appear to have failed on master (test): TestSystemZoneConfigs
You may want to check for open issues.
#1501930:
Please assign, take a look and update the issue accordingly.
The text was updated successfully, but these errors were encountered: