Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure only alive leaders and followers when fetching QQ replica states #12709

Closed
wants to merge 4 commits into from

Conversation

Ayanda-D
Copy link
Contributor

@Ayanda-D Ayanda-D commented Nov 12, 2024

Proposed Changes

This addresses issues reported in #12701 and #10423 where quorum queue leader and follower FSMs have been observed to be down (noproc) while reporting tools such as check_if_node_is_quorum_critical still pass. With this patch, such terminated processes should now be reported by existing CLI tools to help avoid risks of carrying out operations such as node restarts/upgrades, when some QQ members aren't in a good state.

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution
you did and what alternatives you considered, etc.

when checking replica states to help avoid missing
inactive replicas e.g. on QQ checks from cli tools
and ensure non-existent/inactive/noproc QQ members are
not reported.
(T) ->
T
(T = {K, _, _}) ->
case rabbit_process:is_registered_process_alive(K) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whereis/1 will do here, you are already on the node where the process is running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i avoided the rabbit_process:is_process_alive/1 which does the extra distribution work. But also noticed we have rabbit_process:is_registered_process_alive/1 which is already doing whereis/1 with is_pid/1 guard check on the local node, which i'd still have to do on the case statement. seems to be doing same thing (if i'm seeing correct)

Copy link
Contributor

@kjnilsson kjnilsson Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case whereis(K) of
    undefined ->
         ...
    Pid -> 
        to_replica_state(T)
end

I'm not sure using a helper function for the above is worth it, just causes read indirection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k cool, definitely easier to read. sorry for the back-&-forth on this small item (tendency has been check/use rabbit helpers 1st 😊 after finding myself rewriting some existing utils in the past 😅 ). Yep, the read indirection not necessary here 👍

true ->
to_replica_state(T);
false ->
[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about returning nil here. is that expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think the noproc issue, if my understanding this right, is when we have an entry in ra_state of a member which had already terminated, but wasnt cleaned up from ets from FSM terminate callback. Then its still listed here as if it were still active, yet not alive. But if alive=false, we should handle it as if it had been removed from ra_state, so returning nil here. (i'm not sure if we should also be doing a deletion from ra_state here when the fsm proc is not alive)

Copy link
Contributor

@kjnilsson kjnilsson Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all other code paths return a tuple {RegName, Status} so returning [] feels a bit odd, could you either return {K, noproc} or use lists:filtermap to remove the entry all together?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, lists:filtermap sounds good. this was removing/flattening out the [] on the conversion to a map, but no need for that extra step. updated 👍

Copy link
Contributor

@kjnilsson kjnilsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thanks @Ayanda-D

@michaelklishin
Copy link
Member

#12727

Copy link
Member

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gmake ct-unit_quorum_queue

fails repeatably, both on Actions and for locally for me, with

Node: rabbit_shard1@localhost
Case: unit_quorum_queue_SUITE:all_replica_states_includes_nonvoters
Reason: {error,
            {{badmatch,{rabbit_shard1@localhost,#{}}},
             [{unit_quorum_queue_SUITE,all_replica_states_includes_nonvoters,
                  1,
                  [{file,"unit_quorum_queue_SUITE.erl"},{line,111}]},
              {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1793}]},
              {test_server,run_test_case_eval1,6,
                  [{file,"test_server.erl"},{line,1302}]},
              {test_server,run_test_case_eval,9,
                  [{file,"test_server.erl"},{line,1234}]}]}}

Node: rabbit_shard4@localhost
Case: mirrored_supervisor_SUITE:{end_per_group,broker_tests,[]}
Reason: {error,
            {{assertEqual,
                 [{module,rabbit_ct_broker_helpers},
                  {line,1218},
                  {expression,"CrashesCount"},
                  {expected,0},
                  {value,21}]},
             [{rabbit_ct_broker_helpers,find_crashes_in_logs,2,
                  [{file,"rabbit_ct_broker_helpers.erl"},{line,1218}]},
              {rabbit_ct_broker_helpers,stop_rabbitmq_nodes,1,
                  [{file,"rabbit_ct_broker_helpers.erl"},{line,1174}]},
              {rabbit_ct_helpers,run_steps,2,
                  [{file,"rabbit_ct_helpers.erl"},{line,136}]},
              {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1793}]},
              {test_server,run_test_case_eval1,6,
                  [{file,"test_server.erl"},{line,1390}]},
              {test_server,run_test_case_eval,9,
                  [{file,"test_server.erl"},{line,1234}]}]}}

test queue processes (since we now check/return only alive members
when fetching replica states)
@Ayanda-D
Copy link
Contributor Author

didnt know there were more QQ tests elsewhere :) updated, thanks for the review & QA @kjnilsson @michaelklishin 👍

@michaelklishin
Copy link
Member

Thank you, @Ayanda-D.

@michaelklishin
Copy link
Member

Merged in #12727.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants