Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: Implement force partition reconfiguration API #9785

Merged
merged 17 commits into from
May 10, 2023

Conversation

bharathv
Copy link
Contributor

@bharathv bharathv commented Apr 3, 2023

Adds POST /v1/debug/partitions/{namespace}/{topic}/{partition}/force_replicas to force reconfigure a given partition uncleanly into a smaller raft group. Only to be used in exceptional circumstances like recovering from a lost majority.

Fixes #9096

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.1.x
  • v22.3.x
  • v22.2.x

Release Notes

Features

  • Implements force reconfiguration API to uncleanly reconfigure a raft group into a smaller replica set. Meant to be used as an escape hatch or equivalent high level abstractions to recover raft groups from unrecoverable state.

@bharathv bharathv requested a review from a team as a code owner April 3, 2023 05:11
@bharathv bharathv requested review from ivotron and removed request for a team April 3, 2023 05:11
@bharathv bharathv requested review from mmaslankaprv and ztlpn and removed request for ivotron April 3, 2023 17:03
@bharathv bharathv force-pushed the eh_force_reconfig branch from ff73327 to 2f4117d Compare April 18, 2023 01:04
@bharathv
Copy link
Contributor Author

Latest force push rebases on controller snapshot changes for topic table.

Think controller snapshot doesn't require any special changes for the new force reconfiguration command type. I was trying to convince myself that and I think it boils down to a couple of (interesting) cases when the snapshot is taken (dumping my notes here for reviewers).

  1. Snapshot is taken after force_reconfiguration is finished.
  2. Snapshot is taken while the force_reconfiguration is in progress (very small window but theoretically feasible) but hasn't finished yet.

Since force_reconfiguration is also supported for in progress updates, we will have 2 more possibilities (inp_update=True/False). Overall we have the following 4 cases

a. (case1, inp_update=True)
b. (case1, inp_update=False)
c. (case2, inp_update=True)
d. (case2, inp_update=False)

case a/b : snapshot applies the final forced assignment, simple.
case d: Snapshot applies the current assignment on the node applying it which will be cleaned up if needed (depending on whether the node is a part of replica set) when update_finished is processed in the controller log.
case c: Snapshot applies the delta for in progress move, which will be marked done when update_finished from the log is processed.

@@ -211,6 +211,15 @@ topic_updates_dispatcher::apply_update(model::record_batch b) {
*
* target_replicas -> previous_replicas
*
* Note on force reconfiguration:
* force reconfiguration can happen in a couple of cases
* 1. An in_progress update
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to complicate force reconfiguration with being able to cancel ongoing reallocations ? I am wondering, would it be enough to simply force cancel reconfiguration and then force update ? The two cases here makes things much more complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

undone, as as discussed.. in the latest rev, force_update is not allowed if there is an in-progress update.

Comment on lines 1382 to 1375
if (update_type == topic_table_delta::op_type::force_update) {
// Pick the first node to issue an update_finished
// once succeeded.
return current_replicas.front().node_id == _self;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the first node from replica set is down ? Maybe we can finish that from any node ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the first node from replica set is down ?

Thats true but I was operating under the assumption that this is unlikely as the window is super small.. anyway bad assumption and needs fixing.

Maybe we can finish that from any node ?

I think the problem is we have duplicate finish commands? The issue with duplicate commands is allocator updates in topic_updates_dispatcher are accounted for twice? We do it for force_abort_update tho.. am I missing something?

@@ -255,6 +255,40 @@
}
}
]
},
{
"path": "/v1/debug/partitions/{namespace}/{topic}/{partition}/force_replicas",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe for consistency we should put it in /v1/partitions ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it in debug since it is not meant to be widely used and needs supervision.

Comment on lines 1185 to 1178
if (type == topic_table_delta::op_type::force_update) {
// TODO: Add a detailed comment.
co_return errc::success;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we do not want to force reconfigure partition in this case i.e. it exists on current shard but should not ? i think we should, imagine a situation in which replicas become alive after some time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this branch is when the shard local partition is no longer a part of new force assignment. In that case there is nothing to do during force_update but any cleanup will be done as a part of update_finished command. am I missing something?

src/v/cluster/types.h Show resolved Hide resolved
src/v/cluster/controller_backend.h Outdated Show resolved Hide resolved
if (in_progress_it != in_progress.end()) {
const auto& update = (*in_progress_it).second;
const auto state = update.get_state();
if (state != cluster::reconfiguration_state::in_progress) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check all state preconditions in the topic table apply method, because only there we are guaranteed to observe up-to-date state, checking here can be an optimization

src/v/redpanda/admin_server.cc Outdated Show resolved Hide resolved
src/v/cluster/topic_table.cc Show resolved Hide resolved
@piyushredpanda piyushredpanda modified the milestone: v23.1.8 Apr 20, 2023
@dotnwat
Copy link
Member

dotnwat commented Apr 21, 2023

/ci-repeat 1

@toddfarmer
Copy link

Not planning to backport at this time.

@bharathv bharathv force-pushed the eh_force_reconfig branch from 142caef to 4099132 Compare May 5, 2023 20:58
@bharathv
Copy link
Contributor Author

bharathv commented May 6, 2023

Test failure: #10497 (known issue, slow startup in debug).

src/v/cluster/types.h Outdated Show resolved Hide resolved
if controller_snapshots:
# Wait for few seconds to make sure snapshots
# happen.
time.sleep(3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use self.redpanda.wait_for_controller_snapshot() here to ensure that the snapshot has really been created. There was a small problem wit this because controller snapshots depended on the initialization of metrics_reporter and sometimes it was slow to initialize, but this should now be fixed by #10387

replication=reduced_replica_set_size,
hosts=alive_nodes_after)
if acks == -1:
self.run_validation()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory we can still experience data loss, right? E.g. if a message is replicated to the killed nodes (and only to them) and then we kill them and reconfigure to alive nodes, they won't have that message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats right.. although the window for it in the test is probably quite small.. will ci-repeat just in case and adjust the test accordingly.

tests/rptest/tests/partition_force_reconfiguration_test.py Outdated Show resolved Hide resolved
src/v/cluster/topic_updates_dispatcher.cc Show resolved Hide resolved
template<typename Func>
ss::future<std::error_code>
consensus::force_replace_configuration_locally(Func&& f) {
return _op_lock.get_units()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: coroutines?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc there is some issues with coroutine templates in at least the version of clang we are using 14. that might be what's happening here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done (simplified a bit too).. I was reusing some code and ended up retaining original continuations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc there is some issues with coroutine templates in at least the version of clang we are using 14.

right, I remember experiencing those too... but if it compiles then it is okay :)

bharathv added 17 commits May 8, 2023 22:29
.. with new broker set. This can cause data loss when the replicas
chosen are lagging from the majority. Hence should be used with caution
and under exceptional circumstances.
A force_update operation is like update but as the name suggests, it
is intended to force a reconfiguration. It is separately applied
on the raft instances of the target assignemnt. Adds a placeholder
implementation that is intentionally empty, to be filled in next commit.
This command drives the force replica set update.
In order to distinguish force_update from other inprogress states, adds
a new state.
.. in change_partition_replicas(). This makes it possible to reuse
the method for forced moves by passing the appropriate boolean.

Reverts e37c308 and redoes it in a differnt way not that we have
a new reconfiguration_state.
'debug' because it is not meant to be used without a good understanding
of the cluster state.
.. to be (re)used in the next method.
@bharathv bharathv force-pushed the eh_force_reconfig branch from 4099132 to a88a7c5 Compare May 9, 2023 07:22
@bharathv bharathv requested review from ztlpn and dotnwat May 9, 2023 07:23
@bharathv
Copy link
Contributor Author

/ci-repeat 3
skip-units
dt-repeat=50
tests/rptest/tests/partition_force_reconfiguration_test.py

@bharathv bharathv merged commit 2937f26 into redpanda-data:dev May 10, 2023
@bharathv bharathv deleted the eh_force_reconfig branch May 10, 2023 17:36
@mattschumpert
Copy link

Follow up is #10574

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

raft: recover from situations where majority of replicas are lost
7 participants