-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Periodically check for unapplied policies on QQs #12412
Periodically check for unapplied policies on QQs #12412
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "maybe log" changes the internal API in ways that are difficult to justify.
end, | ||
Consume([]). | ||
|
||
ensure_qq_proc_dead(Config, Server, RaName) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the target process recovers in fewer than 500ms, this function will loop forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ra supervisor has a max restart intensity of 2 restarts per 5 seconds https://github.com/rabbitmq/ra/blob/main/src/ra_server_sup.erl#L36-L37. So supervisor will give up eventually.
Otoh if the process restart takes more than 500ms then this loop would stop before the process is dead completely. But I think this is highly unlikely for a test queue.
rabbit_log:info("~ts: delivery_limit not set, defaulting to ~b", | ||
[rabbit_misc:rs(QName), ?DEFAULT_DELIVERY_LIMIT]), | ||
maybe_log(ShouldLog, info, | ||
"~ts: delivery_limit not set, defaulting to ~b", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dear Core Team, should this message be logged unconditionally as well? This is not a misconfiguration, if a user is happy with the default value and does not set an explicit delivery-limit, this will be logged all the time for all the quorum queues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's there to be clear that there is a default set. We can perhaps lower it to debug in 4.1 or remove it completely but I think making users aware of this potentially breaking change doesn't harm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move this to a function that is not called periodically and remove it from this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, we don't necessarily have to do it in rabbit_quorum_queue
if there's a more suitable alternative where the delivery limit is known.
Is there anything we can do about this PR? |
@LoisSotoLopez according to |
Instead of checking the values for current configuration, represented in `rabbit_quorum_queue:handle_tick` by the `Overview` variable, against the effective policy, just regenerate the configuration and compare with the current configuration.
(some of this is just reverting to the original format to reduce the diff against main)
Removes the usage of a ShouldLog parameter on several functions and limits the logging of the message warning about the delivery_limit not being set to the moment of queueDeclaration
0ccce0b
to
9dc9f97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To summarize this change: the optional policy re-application happens as part of the periodic QQ replica "tick", also used by the continuous membership reconciliation feature.
In cases where effective policy keys do not change, the new operation is a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cd deps/rabbit
gmake ct-quorum_queue
as well as
bazel test //deps/rabbit:quorum_queue_SUITE
fails, specifically one policy-based test does, including with Make on Actions and locally, with Khepri and with Mnesia, in other words, reliably:
quorum_queue_SUITE > single_node > queue_ttl
quorum_queue_SUITE > clustered > cluster_size_3 > queue_ttl
Since maybe_apply_policies/2
is the only new function that
seems relevant, I suggest adding some debug logging there and comparing what it logs vs. main
.
Some details:
While this could have been be a flake, it is too persistent to conclude that. |
indeed this also fails for me locally with the same error, thanks for the details. we will look at it on Monday |
@gomoripeti were you able to collect some more data from the tests/test suite in question? 2577b7e does not explain why the key were removed. |
The new |
In other words, this is ready for another round 👍 |
@gomoripeti @LoisSotoLopez I'm afraid we are still not done here. #12640 and #12641 pass all tests on PR branches and in
We won't revert the PR but please take a look at this failure (with Make) in |
do I see correctly that the |
@gomoripeti it turned out to be a flake, it seems. The old version is the latest patch of the previous series, so, |
This is a very good hypothesis. In which case the test should be skipped for mixed version clusters because it won't be deterministic by definition. Like here, for example. |
@michaelklishin Maybe you prefer to add that change yourself. In case you don't I PRed it here: #12665 . Feel free to close that one if contributing it yourself eases the process 👍 |
@LoisSotoLopez your changes seems fine, I have submitted #12666. |
Proposed Changes
As documented #7863 :
If a quorum queue is unavailable when a policy is changed it may never apply the resulting configuration command and thus be out of sync with the matching policy.
This PR provides a function in
rabbit_quorum_queue.erl
that checks whether the current Ra Machine configuration for a queue corresponds to the expected configuration to be in use based on defined policies. That function is called by each queue process on tick (handle_tick
).Types of Changes
Checklist
CONTRIBUTING.md
documentFurther Comments
.