Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved handling of notification in cluster::node_status_backend #11164

Merged

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Jun 2, 2023

Partition balancer is going to relay on the node_status_table when deciding which cluster members are unresponsive. Previously a node_status_table were not updated until the first status response was received from newly joined node, this made it impossible to recognize a failure of a newly joined node. Additionally removed members were never removed from the status table.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.1.x
  • v22.3.x
  • v22.2.x

Release Notes

  • none

Using queue to process member updates will allow the backend to make the
notification handling a future returning function. Thanks to the queue
there is no risk of updates reordering related with scheduling
background futures handling updates.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
When node joins the cluster it should be added to the
`node_status_table` so that all the downstream systems relaying on the
node_status are aware that the node appeared in the cluster.

Previously adding node to status table was deferred until the first
successful heartbeat reply was received.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
Previously nodes were not removed from status table.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
@mmaslankaprv
Copy link
Member Author

@mmaslankaprv mmaslankaprv merged commit 80fafd3 into redpanda-data:dev Jun 5, 2023
@mmaslankaprv mmaslankaprv deleted the node-status-improvements branch June 5, 2023 16:04
@vshtokman
Copy link
Contributor

/backport v23.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to run cherry-pick command. I executed the commands below:

git checkout -b backport-pr-11164-v23.1.x-165 remotes/upstream/v23.1.x
git cherry-pick -x ce0db70f6ae6cde93887315f2a6fb94c8a03b89b 5b69a7c1e067e184606619480a2cc884fd6ffe12 54108c8a311c1c4b1cb4200f5af1ee0d2f9c6fdb

Workflow run logs.

Comment on lines +69 to +73
while (!_pending_member_notifications.empty()) {
auto& notification = _pending_member_notifications.front();
co_await handle_members_updated_notification(
notification.id, notification.state);
_pending_member_notifications.pop_front();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmaslankaprv @bharathv

it looks like this loop is safe to concurrent changes to _pending_member_notifications, except for this last line: when the coroutines resumes after co_await, if the container is empty, then pop_front is undefined behavior.

since the loop is using the common pattern to avoid concurrent modifications to the container, should this scenario be a concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is not really possible because this is the only fiber popping from the container. This is like an SPSC queue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. even though this code is safe, the pattern itself is fragile--all it would take to break is someone fiddling with the notifications queue. how can we avoid having these foot guns? should we focus on building safe utilities (e.g. a queue object that is easy to use correctly?), comments, etc..?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants