Improved handling of notification in `cluster::node_status_backend` #11164

mmaslankaprv · 2023-06-02T13:59:04Z

Partition balancer is going to relay on the node_status_table when deciding which cluster members are unresponsive. Previously a node_status_table were not updated until the first status response was received from newly joined node, this made it impossible to recognize a failure of a newly joined node. Additionally removed members were never removed from the status table.

Backports Required

Release Notes

none

Using queue to process member updates will allow the backend to make the notification handling a future returning function. Thanks to the queue there is no risk of updates reordering related with scheduling background futures handling updates. Signed-off-by: Michal Maslanka <michal@redpanda.com>

When node joins the cluster it should be added to the `node_status_table` so that all the downstream systems relaying on the node_status are aware that the node appeared in the cluster. Previously adding node to status table was deferred until the first successful heartbeat reply was received. Signed-off-by: Michal Maslanka <michal@redpanda.com>

Previously nodes were not removed from status table. Signed-off-by: Michal Maslanka <michal@redpanda.com>

mmaslankaprv · 2023-06-05T16:04:04Z

Known ci failures:

CI Failure (failed to upload enough indices) in EndToEndShadowIndexingTest.test_write #11166

vshtokman · 2023-06-07T14:51:54Z

/backport v23.1.x

vbotbuildovich · 2023-06-07T14:52:54Z

Failed to run cherry-pick command. I executed the commands below:

git checkout -b backport-pr-11164-v23.1.x-165 remotes/upstream/v23.1.x
git cherry-pick -x ce0db70f6ae6cde93887315f2a6fb94c8a03b89b 5b69a7c1e067e184606619480a2cc884fd6ffe12 54108c8a311c1c4b1cb4200f5af1ee0d2f9c6fdb

Workflow run logs.

dotnwat · 2023-06-09T04:52:29Z

src/v/cluster/node_status_backend.cc

+    while (!_pending_member_notifications.empty()) {
+        auto& notification = _pending_member_notifications.front();
+        co_await handle_members_updated_notification(
+          notification.id, notification.state);
+        _pending_member_notifications.pop_front();


@mmaslankaprv @bharathv

it looks like this loop is safe to concurrent changes to _pending_member_notifications, except for this last line: when the coroutines resumes after co_await, if the container is empty, then pop_front is undefined behavior.

since the loop is using the common pattern to avoid concurrent modifications to the container, should this scenario be a concern?

I guess this is not really possible because this is the only fiber popping from the container. This is like an SPSC queue.

got it. even though this code is safe, the pattern itself is fragile--all it would take to break is someone fiddling with the notifications queue. how can we avoid having these foot guns? should we focus on building safe utilities (e.g. a queue object that is easy to use correctly?), comments, etc..?

[v23.1.x] Backport of #9380 #8750 #10923 #11164 #11350 #11838 #11905 #11840 #11691 #11726 #10860 #12073

mmaslankaprv added 3 commits June 2, 2023 15:56

c/node_status_table: handle removing nodes from status table

54108c8

Previously nodes were not removed from status table. Signed-off-by: Michal Maslanka <michal@redpanda.com>

mmaslankaprv requested review from bharathv and ztlpn June 2, 2023 13:59

github-actions bot added the area/redpanda label Jun 2, 2023

bharathv approved these changes Jun 5, 2023

View reviewed changes

mmaslankaprv merged commit 80fafd3 into redpanda-data:dev Jun 5, 2023

mmaslankaprv deleted the node-status-improvements branch June 5, 2023 16:04

vbotbuildovich mentioned this pull request Jun 7, 2023

[v23.1.x] Improved handling of notification in cluster::node_status_backend #11282

Closed

dotnwat reviewed Jun 9, 2023

View reviewed changes

mmaslankaprv mentioned this pull request Jul 13, 2023

[v23.1.x] Backport of #9380 #8750 #10923 #11164 #11350 #11838 #11905 #11840 #11691 #11726 #10860 #12073 #12075

Merged

7 tasks

mmaslankaprv added a commit that referenced this pull request Jul 20, 2023

Merge pull request #12075 from mmaslankaprv/v23.1.x-backports

9a9ea29

[v23.1.x] Backport of #9380 #8750 #10923 #11164 #11350 #11838 #11905 #11840 #11691 #11726 #10860 #12073

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved handling of notification in `cluster::node_status_backend` #11164

Improved handling of notification in `cluster::node_status_backend` #11164

mmaslankaprv commented Jun 2, 2023 •

edited

Loading

mmaslankaprv commented Jun 5, 2023

vshtokman commented Jun 7, 2023

vbotbuildovich commented Jun 7, 2023

dotnwat Jun 9, 2023

ztlpn Jun 15, 2023

dotnwat Jun 23, 2023

Improved handling of notification in cluster::node_status_backend #11164

Improved handling of notification in cluster::node_status_backend #11164

Conversation

mmaslankaprv commented Jun 2, 2023 • edited Loading

Backports Required

Release Notes

mmaslankaprv commented Jun 5, 2023

vshtokman commented Jun 7, 2023

vbotbuildovich commented Jun 7, 2023

dotnwat Jun 9, 2023

Choose a reason for hiding this comment

ztlpn Jun 15, 2023

Choose a reason for hiding this comment

dotnwat Jun 23, 2023

Choose a reason for hiding this comment

Improved handling of notification in `cluster::node_status_backend` #11164

Improved handling of notification in `cluster::node_status_backend` #11164

mmaslankaprv commented Jun 2, 2023 •

edited

Loading