raft: Implement force partition reconfiguration API #9785

bharathv · 2023-04-03T05:11:57Z

Adds POST /v1/debug/partitions/{namespace}/{topic}/{partition}/force_replicas to force reconfigure a given partition uncleanly into a smaller raft group. Only to be used in exceptional circumstances like recovering from a lost majority.

Fixes #9096

Backports Required

Release Notes

Features

Implements force reconfiguration API to uncleanly reconfigure a raft group into a smaller replica set. Meant to be used as an escape hatch or equivalent high level abstractions to recover raft groups from unrecoverable state.

bharathv · 2023-04-18T01:39:04Z

Latest force push rebases on controller snapshot changes for topic table.

Think controller snapshot doesn't require any special changes for the new force reconfiguration command type. I was trying to convince myself that and I think it boils down to a couple of (interesting) cases when the snapshot is taken (dumping my notes here for reviewers).

Snapshot is taken after force_reconfiguration is finished.
Snapshot is taken while the force_reconfiguration is in progress (very small window but theoretically feasible) but hasn't finished yet.

Since force_reconfiguration is also supported for in progress updates, we will have 2 more possibilities (inp_update=True/False). Overall we have the following 4 cases

a. (case1, inp_update=True)
b. (case1, inp_update=False)
c. (case2, inp_update=True)
d. (case2, inp_update=False)

case a/b : snapshot applies the final forced assignment, simple.
case d: Snapshot applies the current assignment on the node applying it which will be cleaned up if needed (depending on whether the node is a part of replica set) when update_finished is processed in the controller log.
case c: Snapshot applies the delta for in progress move, which will be marked done when update_finished from the log is processed.

src/v/raft/consensus.h

src/v/raft/consensus.cc

src/v/cluster/topic_table.cc

mmaslankaprv · 2023-04-18T12:58:34Z

src/v/cluster/topic_updates_dispatcher.cc

@@ -211,6 +211,15 @@ topic_updates_dispatcher::apply_update(model::record_batch b) {
                 *
                 *  target_replicas -> previous_replicas
                 *
+                 * Note on force reconfiguration:
+                 * force reconfiguration can happen in a couple of cases
+                 *  1. An in_progress update


do we need to complicate force reconfiguration with being able to cancel ongoing reallocations ? I am wondering, would it be enough to simply force cancel reconfiguration and then force update ? The two cases here makes things much more complicated.

undone, as as discussed.. in the latest rev, force_update is not allowed if there is an in-progress update.

mmaslankaprv · 2023-04-18T13:01:55Z

src/v/cluster/controller_backend.cc

+    if (update_type == topic_table_delta::op_type::force_update) {
+        // Pick the first node to issue an update_finished
+        // once succeeded.
+        return current_replicas.front().node_id == _self;


what if the first node from replica set is down ? Maybe we can finish that from any node ?

what if the first node from replica set is down ?

Thats true but I was operating under the assumption that this is unlikely as the window is super small.. anyway bad assumption and needs fixing.

Maybe we can finish that from any node ?

I think the problem is we have duplicate finish commands? The issue with duplicate commands is allocator updates in topic_updates_dispatcher are accounted for twice? We do it for force_abort_update tho.. am I missing something?

src/v/cluster/controller_backend.cc

mmaslankaprv · 2023-04-18T13:04:40Z

src/v/redpanda/admin/api-doc/debug.json

@@ -255,6 +255,40 @@
                    }
                }
            ]
+        },
+        {
+            "path": "/v1/debug/partitions/{namespace}/{topic}/{partition}/force_replicas",


nit: maybe for consistency we should put it in /v1/partitions ?

I put it in debug since it is not meant to be widely used and needs supervision.

src/v/cluster/controller_backend.cc

mmaslankaprv · 2023-04-18T13:08:25Z

src/v/cluster/controller_backend.cc

+        if (type == topic_table_delta::op_type::force_update) {
+            // TODO: Add a detailed comment.
+            co_return errc::success;
+        }


why we do not want to force reconfigure partition in this case i.e. it exists on current shard but should not ? i think we should, imagine a situation in which replicas become alive after some time

this branch is when the shard local partition is no longer a part of new force assignment. In that case there is nothing to do during force_update but any cleanup will be done as a part of update_finished command. am I missing something?

src/v/cluster/types.h

src/v/cluster/controller_backend.h

ztlpn · 2023-04-18T21:49:29Z

src/v/redpanda/admin_server.cc

+    if (in_progress_it != in_progress.end()) {
+        const auto& update = (*in_progress_it).second;
+        const auto state = update.get_state();
+        if (state != cluster::reconfiguration_state::in_progress) {


I think we should check all state preconditions in the topic table apply method, because only there we are guaranteed to observe up-to-date state, checking here can be an optimization

src/v/redpanda/admin_server.cc

src/v/cluster/topic_table.cc

dotnwat · 2023-04-21T06:40:49Z

/ci-repeat 1

toddfarmer · 2023-04-26T15:38:00Z

Not planning to backport at this time.

bharathv · 2023-05-06T00:21:51Z

Test failure: #10497 (known issue, slow startup in debug).

src/v/cluster/types.h

tests/rptest/tests/partition_force_reconfiguration_test.py

ztlpn · 2023-05-08T20:43:24Z

tests/rptest/tests/partition_force_reconfiguration_test.py

+        if controller_snapshots:
+            # Wait for few seconds to make sure snapshots
+            # happen.
+            time.sleep(3)


Can use self.redpanda.wait_for_controller_snapshot() here to ensure that the snapshot has really been created. There was a small problem wit this because controller snapshots depended on the initialization of metrics_reporter and sometimes it was slow to initialize, but this should now be fixed by #10387

ztlpn · 2023-05-08T20:51:02Z

tests/rptest/tests/partition_force_reconfiguration_test.py

+                replication=reduced_replica_set_size,
+                hosts=alive_nodes_after)
+        if acks == -1:
+            self.run_validation()


In theory we can still experience data loss, right? E.g. if a message is replicated to the killed nodes (and only to them) and then we kill them and reconfigure to alive nodes, they won't have that message.

Thats right.. although the window for it in the test is probably quite small.. will ci-repeat just in case and adjust the test accordingly.

tests/rptest/tests/partition_force_reconfiguration_test.py

src/v/cluster/topic_updates_dispatcher.cc

ztlpn · 2023-05-08T21:02:37Z

src/v/raft/consensus.cc

+template<typename Func>
+ss::future<std::error_code>
+consensus::force_replace_configuration_locally(Func&& f) {
+    return _op_lock.get_units()


nit: coroutines?

iirc there is some issues with coroutine templates in at least the version of clang we are using 14. that might be what's happening here?

done (simplified a bit too).. I was reusing some code and ended up retaining original continuations.

iirc there is some issues with coroutine templates in at least the version of clang we are using 14.

right, I remember experiencing those too... but if it compiles then it is okay :)

.. with new broker set. This can cause data loss when the replicas chosen are lagging from the majority. Hence should be used with caution and under exceptional circumstances.

A force_update operation is like update but as the name suggests, it is intended to force a reconfiguration. It is separately applied on the raft instances of the target assignemnt. Adds a placeholder implementation that is intentionally empty, to be filled in next commit.

This command drives the force replica set update.

In order to distinguish force_update from other inprogress states, adds a new state.

.. in change_partition_replicas(). This makes it possible to reuse the method for forced moves by passing the appropriate boolean. Reverts e37c308 and redoes it in a differnt way not that we have a new reconfiguration_state.

'debug' because it is not meant to be used without a good understanding of the cluster state.

.. to be (re)used in the next method.

bharathv · 2023-05-10T02:28:10Z

/ci-repeat 3
skip-units
dt-repeat=50
tests/rptest/tests/partition_force_reconfiguration_test.py

mattschumpert · 2023-05-17T16:52:40Z

Follow up is #10574

bharathv requested a review from a team as a code owner April 3, 2023 05:11

bharathv requested review from ivotron and removed request for a team April 3, 2023 05:11

github-actions bot added the area/redpanda label Apr 3, 2023

bharathv requested review from mmaslankaprv and ztlpn and removed request for ivotron April 3, 2023 17:03

bharathv force-pushed the eh_force_reconfig branch from ff73327 to 2f4117d Compare April 18, 2023 01:04