-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: last node in 3-node cluster fails to shut down gracefully due to livelock #40834
Comments
thanks, that's 1) a separate issue from before 2) also apparently a new/recent issue Investigating... |
I have updated the issue description with the analysis. This seems to be a new problem. @andreimatei can you have a quick look and help me triage this? For liveness issues like this I'd usually ask tobias and you're next in line. This seems to be a low-priority issue (and not a release blocker) because there's an easy production workaround (simply kill the last process) However the issue does cause tests to flake unfortunately — even though I could tweak the one test that I used for the analysis to issue a kill instead of graceful quit, I suspect other tests are failing non-deterministically because of this. cc @andy-kimball, maybe you have an opinion on how to prioritize this. |
Release justification: deflakes a test and makes CI run faster Prior to this patch the test would attempt to shut down the cluster gracefully after asserting that the 3 nodes are properly joined. Unfortunately this wait is running into separate issue cockroachdb#40834 and this makes the test clean-up flaky. Since this unit test is not about quitting a cluster but merely checking that the join is successful, this patch both works around the related issue and accelerates the test by simply killing the nodes. Release note: None
40867: cli/interactive_tests: deflake and accelerate test_multiple_nodes r=knz a=knz Release justification: deflakes a test and makes CI run faster Prior to this patch the test would attempt to shut down the cluster gracefully after asserting that the 3 nodes are properly joined. Unfortunately this wait is running into separate issue #40834 and this makes the test clean-up flaky. Since this unit test is not about quitting a cluster but merely checking that the join is successful, this patch both works around the related issue and accelerates the test by simply killing the nodes. Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
This reproduces every time? FWIW, I believe that after waiting a minute the shutdown would proceed (or at least try to) because of this code Line 967 in b8004ff
|
No but that makes me think of something else - is it possible for an internal client to hold up draining in this way? |
I'm not completely sure, but I don't think an |
No it does not.
Well then we must have something else to investigate. There is certainly no external client connected at that point. |
clear what is happening here since #45149 - there's no way for the last node to shut down gracefully since its liveness record is unavailable at that point. The hard shutdown timeout should take care of this now. |
Found by the CLI test
test_multiple_nodes.tcl
here:#1493442:
The scenario is pretty simple:
At this point the
quit
process on n1 livelocks and n1 fails to shut down. Log file here: cockroach.logRelevant log lines from n1:
The text was updated successfully, but these errors were encountered: