Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: unskip test_demo_partitioning interactive test #106722

Conversation

aadityasondhi
Copy link
Collaborator

Fixes #96797.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member

tbg commented Jul 13, 2023

The test was skipped due to this. You can it locally once, as far as I can tell. What indication is there that it just won't be flaky again?

The failure mode here was a DistSender-level error leaking up to the client. This is likely to happen again, as I don't think we've fixed it. So either the test needs to be able to handle this (if this bubbling up is expected behavior) or the bubbling up has to be prevented. The former seems really annoying to implement, and the latter seems conducive to higher quality anyway, so that seems to be the better option. But it's also a real little piece of work that likely won't be done as a drive-by.

In terms of what this kind of test failure means to users, it means that CRDB is harder to script against. I think someone had a similar conversation in KV the other day - if, e.g., commands to relocate a range don't reliably work, they are less useful.

@aadityasondhi
Copy link
Collaborator Author

@tbg I see, I overlooked that when the test passed. Thanks for pointing that out! I propose the following (let me know if you disagree):

  1. We add additional retries to this test to make sure it passes without this error. (band-aid fix to get this test unskipped)
  2. Maybe we can handle the distsender issue as part of this kvclient: route requests via followers when leaseholder is unreachable #93503. The solution there will be rewriting a lot of the logic that causes this error in the first place.

cc @andrewbaptist

@tbg
Copy link
Member

tbg commented Jul 17, 2023

We add additional retries to this test to make sure it passes without this error. (band-aid fix to get this test unskipped)

SGTM.

Maybe we can handle the distsender issue as part of this #93503. The solution there will be rewriting a lot of the logic that causes this error in the first place.

it seems unrelated enough to not address this problem automatically. Even if it did, it would be doing so accidentally. The contract about which errors are returned from the KV API (and when) is not well defined currently.

#49868 is a related issue. I would suggest filing, against T-kv, an issue that asks for a clear contract implemented via a doc comment somewhere and a set of checks (maybe using #106508) that reliably catch and report unexpected errors. #49868 and this issue should then be linked against it. #105330 is another one.

This isn't solving the problem, obviously, but then we have a canonical reference that collects instances in which we see unexpected errors reaching KV clients.

Copy link
Contributor

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@aadityasondhi aadityasondhi force-pushed the 20230712.partitioning-interactive-test-unskip branch from f82a596 to c175eb7 Compare August 17, 2023 09:25
@aadityasondhi
Copy link
Collaborator Author

bors r=AlexTalks

@craig
Copy link
Contributor

craig bot commented Aug 17, 2023

Build succeeded:

@craig craig bot merged commit af53e9f into cockroachdb:master Aug 17, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestDockerCLI/test_demo_partitioning.tcl failed
4 participants