Periodically check for unapplied policies on QQs #12412

LoisSotoLopez · 2024-10-01T08:46:33Z

Proposed Changes

As documented #7863 :

If a quorum queue is unavailable when a policy is changed it may never apply the resulting configuration command and thus be out of sync with the matching policy.

This PR provides a function in rabbit_quorum_queue.erl that checks whether the current Ra Machine configuration for a queue corresponds to the expected configuration to be in use based on defined policies. That function is called by each queue process on tick (handle_tick).

Types of Changes

Bug fix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)
Build system and/or CI

Checklist

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
I have added tests that prove my fix is effective or that my feature works
All tests pass locally with my changes
If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

.

deps/rabbit/src/rabbit_quorum_queue.erl

michaelklishin

The "maybe log" changes the internal API in ways that are difficult to justify.

deps/rabbit/src/rabbit_quorum_queue.erl

deps/rabbit/test/quorum_queue_SUITE.erl

michaelklishin · 2024-10-01T20:08:31Z

deps/rabbit/test/quorum_queue_SUITE.erl

+    end,
+    Consume([]).
+
+ensure_qq_proc_dead(Config, Server, RaName) ->


If the target process recovers in fewer than 500ms, this function will loop forever.

ra supervisor has a max restart intensity of 2 restarts per 5 seconds https://github.com/rabbitmq/ra/blob/main/src/ra_server_sup.erl#L36-L37. So supervisor will give up eventually.
Otoh if the process restart takes more than 500ms then this loop would stop before the process is dead completely. But I think this is highly unlikely for a test queue.

gomoripeti · 2024-10-03T14:54:10Z

deps/rabbit/src/rabbit_quorum_queue.erl

-                            rabbit_log:info("~ts: delivery_limit not set, defaulting to ~b",
-                                             [rabbit_misc:rs(QName), ?DEFAULT_DELIVERY_LIMIT]),
+                            maybe_log(ShouldLog, info,
+                                      "~ts: delivery_limit not set, defaulting to ~b",


dear Core Team, should this message be logged unconditionally as well? This is not a misconfiguration, if a user is happy with the default value and does not set an explicit delivery-limit, this will be logged all the time for all the quorum queues.

It's there to be clear that there is a default set. We can perhaps lower it to debug in 4.1 or remove it completely but I think making users aware of this potentially breaking change doesn't harm.

We can move this to a function that is not called periodically and remove it from this function.

In fact, we don't necessarily have to do it in rabbit_quorum_queue if there's a more suitable alternative where the delivery limit is known.

LoisSotoLopez · 2024-10-22T07:34:05Z

Is there anything we can do about this PR?

michaelklishin · 2024-10-22T15:27:32Z

@LoisSotoLopez according to git, this PR is over 100 commits behind main. Please rebase it.

Instead of checking the values for current configuration, represented in `rabbit_quorum_queue:handle_tick` by the `Overview` variable, against the effective policy, just regenerate the configuration and compare with the current configuration.

(some of this is just reverting to the original format to reduce the diff against main)

Removes the usage of a ShouldLog parameter on several functions and limits the logging of the message warning about the delivery_limit not being set to the moment of queueDeclaration

michaelklishin

To summarize this change: the optional policy re-application happens as part of the periodic QQ replica "tick", also used by the continuous membership reconciliation feature.

In cases where effective policy keys do not change, the new operation is a no-op.

michaelklishin

cd deps/rabbit
gmake ct-quorum_queue

as well as

bazel test //deps/rabbit:quorum_queue_SUITE

fails, specifically one policy-based test does, including with Make on Actions and locally, with Khepri and with Mnesia, in other words, reliably:

quorum_queue_SUITE > single_node > queue_ttl
quorum_queue_SUITE > clustered > cluster_size_3 > queue_ttl

Since maybe_apply_policies/2 is the only new function that
seems relevant, I suggest adding some debug logging there and comparing what it logs vs. main.

michaelklishin · 2024-10-25T01:53:45Z

Some details:

=== Ended at 2024-10-24 21:06:50
=== Location: [{quorum_queue_SUITE,'-queue_ttl/1-AwaitMatchFilter/1-0-',[3776](https://github.com/rabbitmq/rabbitmq-server/pull/quorum_queue_suite.src.html#3776)},
              {quorum_queue_SUITE,queue_ttl,[3765](https://github.com/rabbitmq/rabbitmq-server/pull/quorum_queue_suite.src.html#3765)},
              {test_server,ts_tc,1794},
              {test_server,run_test_case_eval1,1303},
              {test_server,run_test_case_eval,1235}]
=== === Reason: {awaitMatch,
                     [{module,quorum_queue_SUITE},
                      {line,3776},
                      {expression,
                          "catch amqp_channel : call ( Ch , # 'queue.declare' { queue = QQ , passive = true , durable = true , auto_delete = false , arguments = QArgs } )"},
                      {pattern,
                          "{ 'EXIT' , { { shutdown , { server_initiated_close , 404 , << \"NOT_FOUND - no queue 'queue_ttl' in vhost '/'\" >> } } , _ } }"},
                      {value,{'queue.declare_ok',<<"queue_ttl">>,0,0}}]}
  in function  quorum_queue_SUITE:'-queue_ttl/1-AwaitMatchFilter/1-0-'/3 (quorum_queue_SUITE.erl, line 3776)
  in call from quorum_queue_SUITE:queue_ttl/1 (quorum_queue_SUITE.erl, line 3765)

While this could have been be a flake, it is too persistent to conclude that.

gomoripeti · 2024-10-25T16:01:15Z

indeed this also fails for me locally with the same error, thanks for the details. we will look at it on Monday

michaelklishin · 2024-11-01T00:36:49Z

@gomoripeti were you able to collect some more data from the tests/test suite in question?

2577b7e does not explain why the key were removed.

gomoripeti · 2024-11-01T08:40:35Z

The new gather_policy_config sub-function's purpose is to gather configs that can be changed by a policy. The problem was that the keys single_active_consumer_on and created are not one of those and they are set later in ra_machine_config. So they were accidentally set two times. The result was that the config was always reapplied even though real policy config did not change (but created changed at every call). Lois knowns the details of why the test case failed, but as far as I understand there was some race condition because of this and the TTL setting was overridden or ignored.

michaelklishin · 2024-11-01T15:44:54Z

In other words, this is ready for another round 👍

michaelklishin · 2024-11-04T17:06:03Z

@gomoripeti @LoisSotoLopez I'm afraid we are still not done here. #12640 and #12641 pass all tests on PR branches and in v4.0.x but specifically the new test fails on main repeatedly with

quorum_queue_SUITE > clustered > cluster_size_3 > policy_repair
    #1. {error,
         {{awaitMatch,
           [{module,quorum_queue_SUITE},
            {line,1446},
            {expression,
             "rpc : call ( Server0 , ra , local_query , [ RaName , QueryFun ] )"},
            {pattern,
             "{ ok , { _ , # { config := # { max_length := ExpectedMaxLength3 } } } , _ }"},
            {value,
             {ok,
              {{77,4},
               #{type => rabbit_fifo,
                 config =>
                  #{name => '%2F_policy_repair',
                    resource => {resource,<<"/">>,queue,<<"policy_repair">>},
                    max_length => 20,msg_ttl => undefined,
                    max_bytes => undefined,delivery_limit => 20,
                    dead_letter_handler => undefined,
                    overflow_strategy => reject_publish,expires => undefined,
                    consumer_strategy => competing,
                    dead_lettering_enabled => false},
                 release_cursors => [],num_checked_out => 32,
                 checkout_message_bytes => 96,enqueue_message_bytes => 0,
                 in_memory_message_bytes => 0,num_active_consumers => 32,
                 num_consumers => 32,num_enqueuers => 1,
                 num_in_memory_ready_messages => 0,num_messages => 32,
                 num_ready_messages => 0,num_ready_messages_high => 0,
                 num_ready_messages_normal => 0,
                 num_ready_messages_return => 0,num_release_cursors => 0,
                 release_cursor_enqueue_counter => 32,
                 smallest_raft_index => 5,discard_checkout_message_bytes => 0,
                 discard_message_bytes => 0,num_discard_checked_out => 0,
                 num_discarded => 0,
                 single_active_consumer_id => {<<"1">>,<19604.8448.0>},
                 single_active_consumer_key => {<<"1">>,<19604.8448.0>},
                 single_active_num_waiting_consumers => 0}},
              {'%2F_policy_repair',
               'rmq-ct-cluster_size_3-2-21192@localhost'}}}]},
          [{quorum_queue_SUITE,'-policy_repair/1-AwaitMatchFilter/1-2-',3,
            [{file,"quorum_queue_SUITE.erl"},{line,1446}]},
           {quorum_queue_SUITE,policy_repair,1,
            [{file,"quorum_queue_SUITE.erl"},{line,1444}]},
           {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1793}]},
           {test_server,run_test_case_eval1,6,
            [{file,"test_server.erl"},{line,1302}]},
           {test_server,run_test_case_eval,9,
            [{file,"test_server.erl"},{line,1234}]}]}}

We won't revert the PR but please take a look at this failure (with Make) in main. I cannot reproduce it locally to provide anything beyond the above stack trace.

gomoripeti · 2024-11-05T13:25:27Z

do I see correctly that the policy_repair test case only fails in mixed cluster runs?
What version of RabbitMQ is used for the "old" version in those cases? (is it v4.0.2? https://github.com/rabbitmq/rabbitmq-server/blob/main/bazel/bzlmod/secondary_umbrella.bzl#L34)
Is it possible that sometimes the QQ leader is on an "old" version node, and the code to repair policy is simply not present in that version?

michaelklishin · 2024-11-05T13:58:06Z

@gomoripeti it turned out to be a flake, it seems. The old version is the latest patch of the previous series, so, 3.13.7.

michaelklishin · 2024-11-05T13:59:22Z

Is it possible that sometimes the QQ leader is on an "old" version node, and the code to
repair policy is simply not present in that version?

This is a very good hypothesis. In which case the test should be skipped for mixed version clusters because it won't be deterministic by definition. Like here, for example.

LoisSotoLopez · 2024-11-05T15:40:50Z

... test should be skipped for mixed version clusters because it won't be deterministic by definition

@michaelklishin Maybe you prefer to add that change yourself. In case you don't I PRed it here: #12665 . Feel free to close that one if contributing it yourself eases the process 👍

michaelklishin · 2024-11-05T15:59:42Z

@LoisSotoLopez your changes seems fine, I have submitted #12666.

michaelklishin reviewed Oct 1, 2024

View reviewed changes

deps/rabbit/src/rabbit_quorum_queue.erl Outdated Show resolved Hide resolved

michaelklishin requested changes Oct 1, 2024

View reviewed changes

deps/rabbit/src/rabbit_quorum_queue.erl Outdated Show resolved Hide resolved

michaelklishin reviewed Oct 1, 2024

View reviewed changes

deps/rabbit/test/quorum_queue_SUITE.erl Show resolved Hide resolved

michaelklishin reviewed Oct 1, 2024

View reviewed changes

gomoripeti reviewed Oct 3, 2024

View reviewed changes

LoisSotoLopez and others added 11 commits October 24, 2024 07:23

Add QQ periodic policy repair

f9179d1

Add test for QQ policy repair feature

b408351

Use ra_machine_config but limit keys to check

ec87ef1

Refactoring suggestion

ccd8548

(some of this is just reverting to the original format to reduce the diff against main)

Move tests to main qq SUITE & refactor a bit

dc9ab1d

Consider QQs may let pass 1st overflowing msg

51abb5c

Use local function for ensuring qq proc dead

df14b4a

Use wait_for_messages_ready

42b58c7

Simplify publish_confirm_many

3b5069f

Remove ShouldLog & limit deliv. limit not set logg

9dc9f97

Removes the usage of a ShouldLog parameter on several functions and limits the logging of the message warning about the delivery_limit not being set to the moment of queueDeclaration

LoisSotoLopez force-pushed the periodically-check-for-unaplied-policies-on-qqs branch from 0ccce0b to 9dc9f97 Compare October 24, 2024 05:28

michaelklishin mentioned this pull request Oct 25, 2024

Re-submitted #12412 #12585

Closed

michaelklishin approved these changes Oct 25, 2024

View reviewed changes

michaelklishin requested changes Oct 25, 2024

View reviewed changes

Remove extra keys from gather_policy_config out

2577b7e

michaelklishin merged commit 2577b7e into rabbitmq:main Nov 4, 2024
265 of 269 checks passed

michaelklishin mentioned this pull request Nov 4, 2024

QQs: periodically apply policies if there's a discrepancy between the current and desired policy-driven state #12640

Merged

LoisSotoLopez mentioned this pull request Nov 5, 2024

Exclude policy_repair QQ test on mixed versions #12665

Closed

12 tasks

michaelklishin mentioned this pull request Nov 5, 2024

By @LoisSotoLopez: Exclude policy_repair QQ test on mixed versions #12666

Merged

mergify bot mentioned this pull request Nov 5, 2024

By @LoisSotoLopez: Exclude policy_repair QQ test on mixed versions (backport #12666) #12667

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Periodically check for unapplied policies on QQs #12412

Periodically check for unapplied policies on QQs #12412

LoisSotoLopez commented Oct 1, 2024

michaelklishin left a comment

michaelklishin Oct 1, 2024

gomoripeti Oct 3, 2024

gomoripeti Oct 3, 2024

kjnilsson Oct 3, 2024

michaelklishin Oct 3, 2024

michaelklishin Oct 3, 2024

LoisSotoLopez commented Oct 22, 2024

michaelklishin commented Oct 22, 2024

michaelklishin left a comment

michaelklishin left a comment •

edited

Loading

michaelklishin commented Oct 25, 2024

gomoripeti commented Oct 25, 2024

michaelklishin commented Nov 1, 2024

gomoripeti commented Nov 1, 2024

michaelklishin commented Nov 1, 2024

michaelklishin commented Nov 4, 2024

gomoripeti commented Nov 5, 2024

michaelklishin commented Nov 5, 2024

michaelklishin commented Nov 5, 2024

LoisSotoLopez commented Nov 5, 2024

michaelklishin commented Nov 5, 2024

Periodically check for unapplied policies on QQs #12412

Periodically check for unapplied policies on QQs #12412

Conversation

LoisSotoLopez commented Oct 1, 2024

Proposed Changes

Types of Changes

Checklist

Further Comments

michaelklishin left a comment

Choose a reason for hiding this comment

michaelklishin Oct 1, 2024

Choose a reason for hiding this comment

gomoripeti Oct 3, 2024

Choose a reason for hiding this comment

gomoripeti Oct 3, 2024

Choose a reason for hiding this comment

kjnilsson Oct 3, 2024

Choose a reason for hiding this comment

michaelklishin Oct 3, 2024

Choose a reason for hiding this comment

michaelklishin Oct 3, 2024

Choose a reason for hiding this comment

LoisSotoLopez commented Oct 22, 2024

michaelklishin commented Oct 22, 2024

michaelklishin left a comment

Choose a reason for hiding this comment

michaelklishin left a comment • edited Loading

Choose a reason for hiding this comment

michaelklishin commented Oct 25, 2024

gomoripeti commented Oct 25, 2024

michaelklishin commented Nov 1, 2024

gomoripeti commented Nov 1, 2024

michaelklishin commented Nov 1, 2024

michaelklishin commented Nov 4, 2024

gomoripeti commented Nov 5, 2024

michaelklishin commented Nov 5, 2024

michaelklishin commented Nov 5, 2024

LoisSotoLopez commented Nov 5, 2024

michaelklishin commented Nov 5, 2024

michaelklishin left a comment •

edited

Loading