-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved handling of notification in cluster::node_status_backend
#11164
Improved handling of notification in cluster::node_status_backend
#11164
Conversation
Using queue to process member updates will allow the backend to make the notification handling a future returning function. Thanks to the queue there is no risk of updates reordering related with scheduling background futures handling updates. Signed-off-by: Michal Maslanka <michal@redpanda.com>
When node joins the cluster it should be added to the `node_status_table` so that all the downstream systems relaying on the node_status are aware that the node appeared in the cluster. Previously adding node to status table was deferred until the first successful heartbeat reply was received. Signed-off-by: Michal Maslanka <michal@redpanda.com>
Previously nodes were not removed from status table. Signed-off-by: Michal Maslanka <michal@redpanda.com>
/backport v23.1.x |
Failed to run cherry-pick command. I executed the commands below:
|
while (!_pending_member_notifications.empty()) { | ||
auto& notification = _pending_member_notifications.front(); | ||
co_await handle_members_updated_notification( | ||
notification.id, notification.state); | ||
_pending_member_notifications.pop_front(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like this loop is safe to concurrent changes to _pending_member_notifications
, except for this last line: when the coroutines resumes after co_await
, if the container is empty, then pop_front
is undefined behavior.
since the loop is using the common pattern to avoid concurrent modifications to the container, should this scenario be a concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is not really possible because this is the only fiber popping from the container. This is like an SPSC queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. even though this code is safe, the pattern itself is fragile--all it would take to break is someone fiddling with the notifications queue. how can we avoid having these foot guns? should we focus on building safe utilities (e.g. a queue object that is easy to use correctly?), comments, etc..?
Partition balancer is going to relay on the
node_status_table
when deciding which cluster members are unresponsive. Previously anode_status_table
were not updated until the first status response was received from newly joined node, this made it impossible to recognize a failure of a newly joined node. Additionally removed members were never removed from the status table.Backports Required
Release Notes