Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodically check for unapplied policies on QQs #12412

Conversation

LoisSotoLopez
Copy link
Contributor

Proposed Changes

As documented #7863 :

If a quorum queue is unavailable when a policy is changed it may never apply the resulting configuration command and thus be out of sync with the matching policy.

This PR provides a function in rabbit_quorum_queue.erl that checks whether the current Ra Machine configuration for a queue corresponds to the expected configuration to be in use based on defined policies. That function is called by each queue process on tick (handle_tick).

Types of Changes

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

.

Copy link
Member

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "maybe log" changes the internal API in ways that are difficult to justify.

deps/rabbit/src/rabbit_quorum_queue.erl Outdated Show resolved Hide resolved
end,
Consume([]).

ensure_qq_proc_dead(Config, Server, RaName) ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the target process recovers in fewer than 500ms, this function will loop forever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ra supervisor has a max restart intensity of 2 restarts per 5 seconds https://github.com/rabbitmq/ra/blob/main/src/ra_server_sup.erl#L36-L37. So supervisor will give up eventually.
Otoh if the process restart takes more than 500ms then this loop would stop before the process is dead completely. But I think this is highly unlikely for a test queue.

rabbit_log:info("~ts: delivery_limit not set, defaulting to ~b",
[rabbit_misc:rs(QName), ?DEFAULT_DELIVERY_LIMIT]),
maybe_log(ShouldLog, info,
"~ts: delivery_limit not set, defaulting to ~b",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dear Core Team, should this message be logged unconditionally as well? This is not a misconfiguration, if a user is happy with the default value and does not set an explicit delivery-limit, this will be logged all the time for all the quorum queues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's there to be clear that there is a default set. We can perhaps lower it to debug in 4.1 or remove it completely but I think making users aware of this potentially breaking change doesn't harm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this to a function that is not called periodically and remove it from this function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, we don't necessarily have to do it in rabbit_quorum_queue if there's a more suitable alternative where the delivery limit is known.

@LoisSotoLopez
Copy link
Contributor Author

Is there anything we can do about this PR?

@michaelklishin
Copy link
Member

@LoisSotoLopez according to git, this PR is over 100 commits behind main. Please rebase it.

LoisSotoLopez and others added 11 commits October 24, 2024 07:23
Instead of checking the values for current configuration, represented in
`rabbit_quorum_queue:handle_tick` by the `Overview` variable, against
the effective policy, just regenerate the configuration and compare with
the current configuration.
(some of this is just reverting to the original format to reduce the
diff against main)
Removes the usage of a ShouldLog parameter on several functions
and limits the logging of the message warning about the delivery_limit
not being set to the moment of queueDeclaration
@LoisSotoLopez LoisSotoLopez force-pushed the periodically-check-for-unaplied-policies-on-qqs branch from 0ccce0b to 9dc9f97 Compare October 24, 2024 05:28
Copy link
Member

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize this change: the optional policy re-application happens as part of the periodic QQ replica "tick", also used by the continuous membership reconciliation feature.

In cases where effective policy keys do not change, the new operation is a no-op.

Copy link
Member

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cd deps/rabbit
gmake ct-quorum_queue

as well as

bazel test //deps/rabbit:quorum_queue_SUITE

fails, specifically one policy-based test does, including with Make on Actions and locally, with Khepri and with Mnesia, in other words, reliably:

  • quorum_queue_SUITE > single_node > queue_ttl
  • quorum_queue_SUITE > clustered > cluster_size_3 > queue_ttl

Since maybe_apply_policies/2 is the only new function that
seems relevant, I suggest adding some debug logging there and comparing what it logs vs. main.

@michaelklishin
Copy link
Member

Some details:

=== Ended at 2024-10-24 21:06:50
=== Location: [{quorum_queue_SUITE,'-queue_ttl/1-AwaitMatchFilter/1-0-',[3776](https://github.com/rabbitmq/rabbitmq-server/pull/quorum_queue_suite.src.html#3776)},
              {quorum_queue_SUITE,queue_ttl,[3765](https://github.com/rabbitmq/rabbitmq-server/pull/quorum_queue_suite.src.html#3765)},
              {test_server,ts_tc,1794},
              {test_server,run_test_case_eval1,1303},
              {test_server,run_test_case_eval,1235}]
=== === Reason: {awaitMatch,
                     [{module,quorum_queue_SUITE},
                      {line,3776},
                      {expression,
                          "catch amqp_channel : call ( Ch , # 'queue.declare' { queue = QQ , passive = true , durable = true , auto_delete = false , arguments = QArgs } )"},
                      {pattern,
                          "{ 'EXIT' , { { shutdown , { server_initiated_close , 404 , << \"NOT_FOUND - no queue 'queue_ttl' in vhost '/'\" >> } } , _ } }"},
                      {value,{'queue.declare_ok',<<"queue_ttl">>,0,0}}]}
  in function  quorum_queue_SUITE:'-queue_ttl/1-AwaitMatchFilter/1-0-'/3 (quorum_queue_SUITE.erl, line 3776)
  in call from quorum_queue_SUITE:queue_ttl/1 (quorum_queue_SUITE.erl, line 3765)

While this could have been be a flake, it is too persistent to conclude that.

@gomoripeti
Copy link
Contributor

indeed this also fails for me locally with the same error, thanks for the details. we will look at it on Monday

@michaelklishin
Copy link
Member

@gomoripeti were you able to collect some more data from the tests/test suite in question?

2577b7e does not explain why the key were removed.

@gomoripeti
Copy link
Contributor

The new gather_policy_config sub-function's purpose is to gather configs that can be changed by a policy. The problem was that the keys single_active_consumer_on and created are not one of those and they are set later in ra_machine_config. So they were accidentally set two times. The result was that the config was always reapplied even though real policy config did not change (but created changed at every call). Lois knowns the details of why the test case failed, but as far as I understand there was some race condition because of this and the TTL setting was overridden or ignored.

@michaelklishin
Copy link
Member

In other words, this is ready for another round 👍

@michaelklishin michaelklishin merged commit 2577b7e into rabbitmq:main Nov 4, 2024
265 of 269 checks passed
@michaelklishin
Copy link
Member

@gomoripeti @LoisSotoLopez I'm afraid we are still not done here. #12640 and #12641 pass all tests on PR branches and in v4.0.x but specifically the new test fails on main repeatedly with

quorum_queue_SUITE > clustered > cluster_size_3 > policy_repair
    #1. {error,
         {{awaitMatch,
           [{module,quorum_queue_SUITE},
            {line,1446},
            {expression,
             "rpc : call ( Server0 , ra , local_query , [ RaName , QueryFun ] )"},
            {pattern,
             "{ ok , { _ , # { config := # { max_length := ExpectedMaxLength3 } } } , _ }"},
            {value,
             {ok,
              {{77,4},
               #{type => rabbit_fifo,
                 config =>
                  #{name => '%2F_policy_repair',
                    resource => {resource,<<"/">>,queue,<<"policy_repair">>},
                    max_length => 20,msg_ttl => undefined,
                    max_bytes => undefined,delivery_limit => 20,
                    dead_letter_handler => undefined,
                    overflow_strategy => reject_publish,expires => undefined,
                    consumer_strategy => competing,
                    dead_lettering_enabled => false},
                 release_cursors => [],num_checked_out => 32,
                 checkout_message_bytes => 96,enqueue_message_bytes => 0,
                 in_memory_message_bytes => 0,num_active_consumers => 32,
                 num_consumers => 32,num_enqueuers => 1,
                 num_in_memory_ready_messages => 0,num_messages => 32,
                 num_ready_messages => 0,num_ready_messages_high => 0,
                 num_ready_messages_normal => 0,
                 num_ready_messages_return => 0,num_release_cursors => 0,
                 release_cursor_enqueue_counter => 32,
                 smallest_raft_index => 5,discard_checkout_message_bytes => 0,
                 discard_message_bytes => 0,num_discard_checked_out => 0,
                 num_discarded => 0,
                 single_active_consumer_id => {<<"1">>,<19604.8448.0>},
                 single_active_consumer_key => {<<"1">>,<19604.8448.0>},
                 single_active_num_waiting_consumers => 0}},
              {'%2F_policy_repair',
               'rmq-ct-cluster_size_3-2-21192@localhost'}}}]},
          [{quorum_queue_SUITE,'-policy_repair/1-AwaitMatchFilter/1-2-',3,
            [{file,"quorum_queue_SUITE.erl"},{line,1446}]},
           {quorum_queue_SUITE,policy_repair,1,
            [{file,"quorum_queue_SUITE.erl"},{line,1444}]},
           {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1793}]},
           {test_server,run_test_case_eval1,6,
            [{file,"test_server.erl"},{line,1302}]},
           {test_server,run_test_case_eval,9,
            [{file,"test_server.erl"},{line,1234}]}]}}

We won't revert the PR but please take a look at this failure (with Make) in main. I cannot reproduce it locally to provide anything beyond the above stack trace.

@gomoripeti
Copy link
Contributor

do I see correctly that the policy_repair test case only fails in mixed cluster runs?
What version of RabbitMQ is used for the "old" version in those cases? (is it v4.0.2? https://github.com/rabbitmq/rabbitmq-server/blob/main/bazel/bzlmod/secondary_umbrella.bzl#L34)
Is it possible that sometimes the QQ leader is on an "old" version node, and the code to repair policy is simply not present in that version?

@michaelklishin
Copy link
Member

@gomoripeti it turned out to be a flake, it seems. The old version is the latest patch of the previous series, so, 3.13.7.

@michaelklishin
Copy link
Member

Is it possible that sometimes the QQ leader is on an "old" version node, and the code to
repair policy is simply not present in that version?

This is a very good hypothesis. In which case the test should be skipped for mixed version clusters because it won't be deterministic by definition. Like here, for example.

@LoisSotoLopez
Copy link
Contributor Author

... test should be skipped for mixed version clusters because it won't be deterministic by definition

@michaelklishin Maybe you prefer to add that change yourself. In case you don't I PRed it here: #12665 . Feel free to close that one if contributing it yourself eases the process 👍

@michaelklishin
Copy link
Member

@LoisSotoLopez your changes seems fine, I have submitted #12666.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants