Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: replicagc-changed-peers/restart=false failed #98966

Closed
cockroach-teamcity opened this issue Mar 19, 2023 · 8 comments · Fixed by #99020
Closed

roachtest: replicagc-changed-peers/restart=false failed #98966

cockroach-teamcity opened this issue Mar 19, 2023 · 8 comments · Fixed by #99020
Assignees
Labels
A-testing Testing tools and infrastructure branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 19, 2023

roachtest.replicagc-changed-peers/restart=false failed with artifacts on release-22.2 @ 73df0dc428ff1e516c5b11a7ca1930e3339da118:

test artifacts and logs in: /artifacts/replicagc-changed-peers/restart=false/run_1
(replicagc.go:286).waitForZeroReplicasOnN3: ranges remained on n3 (according to meta2): map[19:{3,5,6} 31:{3,5,6} 48:{3,5,6} 50:{3,4,6} 77:{3,4,6} 83:{3,5,6} 89:{3,5,6} 97:{3,4,6} 121:{3,4,6}]

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/replication

This test on roachdash | Improve this report!

Jira issue: CRDB-25629

@cockroach-teamcity cockroach-teamcity added branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv-replication labels Mar 19, 2023
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Mar 19, 2023
@aliher1911 aliher1911 removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Mar 20, 2023
@aliher1911
Copy link
Contributor

Digging the logs, I can see that remaining replicas listed in failure message are being removed, but removals are happening longer than the test waits. Removal happens past the retry duration.
The question is, should we just bump timeout then?

@erikgrinaker
Copy link
Contributor

Seems fine.

@aliher1911 aliher1911 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure labels Mar 20, 2023
@aliher1911 aliher1911 self-assigned this Mar 20, 2023
@aliher1911
Copy link
Contributor

I rerun that locally and i think the problem is not that we don't give it enough time (5 min), but moving replicas off node only starts after 5 minutes. So if we are lucky (because of exponential backoff), then once 5 min hits, we quickly remove everything and test succeeds, but if attempt happens at 5 min mark then we fail. I'll rerun with verbose allocator logs to see why it isn't picking up replicas.

@aliher1911
Copy link
Contributor

So 5 minutes is server.time_until_store_dead after which we treat removed node as one that lost its replicas and allocator tries to move node out. I think it should be picking decommissioning/decommissioned status of the node straight away instead of waiting 5 minutes. Otherwise 5 minute wait that we have in test doesn't make sense.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Mar 21, 2023

How about reducing server.time_until_store_dead ? But yeah, would be better if it picked up on the decommissioned state immediately.

@aliher1911
Copy link
Contributor

server.time_until_store_dead is an option to speed up the test. won't fix the issue in allocator though. I'll update the test.

@erikgrinaker
Copy link
Contributor

Write up an issue?

@aliher1911
Copy link
Contributor

Forgot to reference it here. But here it is #99064

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants