-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure only alive leaders and followers when fetching QQ replica states #12709
Conversation
when checking replica states to help avoid missing inactive replicas e.g. on QQ checks from cli tools
and ensure non-existent/inactive/noproc QQ members are not reported.
(T) -> | ||
T | ||
(T = {K, _, _}) -> | ||
case rabbit_process:is_registered_process_alive(K) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whereis/1
will do here, you are already on the node where the process is running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i avoided the rabbit_process:is_process_alive/1
which does the extra distribution work. But also noticed we have rabbit_process:is_registered_process_alive/1
which is already doing whereis/1
with is_pid/1
guard check on the local node, which i'd still have to do on the case statement. seems to be doing same thing (if i'm seeing correct)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case whereis(K) of
undefined ->
...
Pid ->
to_replica_state(T)
end
I'm not sure using a helper function for the above is worth it, just causes read indirection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k cool, definitely easier to read. sorry for the back-&-forth on this small item (tendency has been check/use rabbit helpers 1st 😊 after finding myself rewriting some existing utils in the past 😅 ). Yep, the read indirection not necessary here 👍
true -> | ||
to_replica_state(T); | ||
false -> | ||
[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about returning nil here. is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think the noproc
issue, if my understanding this right, is when we have an entry in ra_state
of a member which had already terminated, but wasnt cleaned up from ets from FSM terminate callback. Then its still listed here as if it were still active, yet not alive. But if alive=false, we should handle it as if it had been removed from ra_state
, so returning nil here. (i'm not sure if we should also be doing a deletion from ra_state
here when the fsm proc is not alive)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all other code paths return a tuple {RegName, Status}
so returning []
feels a bit odd, could you either return {K, noproc}
or use lists:filtermap to remove the entry all together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, lists:filtermap
sounds good. this was removing/flattening out the []
on the conversion to a map
, but no need for that extra step. updated 👍
rabbit_quorum_queue:all_replica_states/0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, thanks @Ayanda-D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gmake ct-unit_quorum_queue
fails repeatably, both on Actions and for locally for me, with
Node: rabbit_shard1@localhost
Case: unit_quorum_queue_SUITE:all_replica_states_includes_nonvoters
Reason: {error,
{{badmatch,{rabbit_shard1@localhost,#{}}},
[{unit_quorum_queue_SUITE,all_replica_states_includes_nonvoters,
1,
[{file,"unit_quorum_queue_SUITE.erl"},{line,111}]},
{test_server,ts_tc,3,[{file,"test_server.erl"},{line,1793}]},
{test_server,run_test_case_eval1,6,
[{file,"test_server.erl"},{line,1302}]},
{test_server,run_test_case_eval,9,
[{file,"test_server.erl"},{line,1234}]}]}}
Node: rabbit_shard4@localhost
Case: mirrored_supervisor_SUITE:{end_per_group,broker_tests,[]}
Reason: {error,
{{assertEqual,
[{module,rabbit_ct_broker_helpers},
{line,1218},
{expression,"CrashesCount"},
{expected,0},
{value,21}]},
[{rabbit_ct_broker_helpers,find_crashes_in_logs,2,
[{file,"rabbit_ct_broker_helpers.erl"},{line,1218}]},
{rabbit_ct_broker_helpers,stop_rabbitmq_nodes,1,
[{file,"rabbit_ct_broker_helpers.erl"},{line,1174}]},
{rabbit_ct_helpers,run_steps,2,
[{file,"rabbit_ct_helpers.erl"},{line,136}]},
{test_server,ts_tc,3,[{file,"test_server.erl"},{line,1793}]},
{test_server,run_test_case_eval1,6,
[{file,"test_server.erl"},{line,1390}]},
{test_server,run_test_case_eval,9,
[{file,"test_server.erl"},{line,1234}]}]}}
test queue processes (since we now check/return only alive members when fetching replica states)
didnt know there were more QQ tests elsewhere :) updated, thanks for the review & QA @kjnilsson @michaelklishin 👍 |
Thank you, @Ayanda-D. |
Merged in #12727. |
Proposed Changes
This addresses issues reported in #12701 and #10423 where quorum queue leader and follower FSMs have been observed to be down (
noproc
) while reporting tools such ascheck_if_node_is_quorum_critical
still pass. With this patch, such terminated processes should now be reported by existing CLI tools to help avoid risks of carrying out operations such as node restarts/upgrades, when some QQ members aren't in a good state.Types of Changes
What types of changes does your code introduce to this project?
Put an
x
in the boxes that applyChecklist
Put an
x
in the boxes that apply.You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.
CONTRIBUTING.md
documentFurther Comments
If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution
you did and what alternatives you considered, etc.