-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: Implement force partition reconfiguration API #9785
Conversation
ff73327
to
2f4117d
Compare
Latest force push rebases on controller snapshot changes for topic table. Think controller snapshot doesn't require any special changes for the new force reconfiguration command type. I was trying to convince myself that and I think it boils down to a couple of (interesting) cases when the snapshot is taken (dumping my notes here for reviewers).
Since force_reconfiguration is also supported for in progress updates, we will have 2 more possibilities (inp_update=True/False). Overall we have the following 4 cases a. (case1, inp_update=True) case a/b : snapshot applies the final forced assignment, simple. |
@@ -211,6 +211,15 @@ topic_updates_dispatcher::apply_update(model::record_batch b) { | |||
* | |||
* target_replicas -> previous_replicas | |||
* | |||
* Note on force reconfiguration: | |||
* force reconfiguration can happen in a couple of cases | |||
* 1. An in_progress update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to complicate force reconfiguration with being able to cancel ongoing reallocations ? I am wondering, would it be enough to simply force cancel reconfiguration and then force update ? The two cases here makes things much more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
undone, as as discussed.. in the latest rev, force_update is not allowed if there is an in-progress update.
src/v/cluster/controller_backend.cc
Outdated
if (update_type == topic_table_delta::op_type::force_update) { | ||
// Pick the first node to issue an update_finished | ||
// once succeeded. | ||
return current_replicas.front().node_id == _self; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the first node from replica set is down ? Maybe we can finish that from any node ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the first node from replica set is down ?
Thats true but I was operating under the assumption that this is unlikely as the window is super small.. anyway bad assumption and needs fixing.
Maybe we can finish that from any node ?
I think the problem is we have duplicate finish commands? The issue with duplicate commands is allocator updates in topic_updates_dispatcher are accounted for twice? We do it for force_abort_update tho.. am I missing something?
@@ -255,6 +255,40 @@ | |||
} | |||
} | |||
] | |||
}, | |||
{ | |||
"path": "/v1/debug/partitions/{namespace}/{topic}/{partition}/force_replicas", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe for consistency we should put it in /v1/partitions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it in debug
since it is not meant to be widely used and needs supervision.
src/v/cluster/controller_backend.cc
Outdated
if (type == topic_table_delta::op_type::force_update) { | ||
// TODO: Add a detailed comment. | ||
co_return errc::success; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we do not want to force reconfigure partition in this case i.e. it exists on current shard but should not ? i think we should, imagine a situation in which replicas become alive after some time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this branch is when the shard local partition is no longer a part of new force assignment. In that case there is nothing to do during force_update
but any cleanup will be done as a part of update_finished
command. am I missing something?
src/v/redpanda/admin_server.cc
Outdated
if (in_progress_it != in_progress.end()) { | ||
const auto& update = (*in_progress_it).second; | ||
const auto state = update.get_state(); | ||
if (state != cluster::reconfiguration_state::in_progress) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should check all state preconditions in the topic table apply method, because only there we are guaranteed to observe up-to-date state, checking here can be an optimization
/ci-repeat 1 |
Not planning to backport at this time. |
2f4117d
to
c2f98dd
Compare
c2f98dd
to
142caef
Compare
142caef
to
4099132
Compare
Test failure: #10497 (known issue, slow startup in debug). |
if controller_snapshots: | ||
# Wait for few seconds to make sure snapshots | ||
# happen. | ||
time.sleep(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use self.redpanda.wait_for_controller_snapshot()
here to ensure that the snapshot has really been created. There was a small problem wit this because controller snapshots depended on the initialization of metrics_reporter and sometimes it was slow to initialize, but this should now be fixed by #10387
replication=reduced_replica_set_size, | ||
hosts=alive_nodes_after) | ||
if acks == -1: | ||
self.run_validation() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory we can still experience data loss, right? E.g. if a message is replicated to the killed nodes (and only to them) and then we kill them and reconfigure to alive nodes, they won't have that message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats right.. although the window for it in the test is probably quite small.. will ci-repeat just in case and adjust the test accordingly.
src/v/raft/consensus.cc
Outdated
template<typename Func> | ||
ss::future<std::error_code> | ||
consensus::force_replace_configuration_locally(Func&& f) { | ||
return _op_lock.get_units() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: coroutines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc there is some issues with coroutine templates in at least the version of clang we are using 14. that might be what's happening here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done (simplified a bit too).. I was reusing some code and ended up retaining original continuations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc there is some issues with coroutine templates in at least the version of clang we are using 14.
right, I remember experiencing those too... but if it compiles then it is okay :)
.. with new broker set. This can cause data loss when the replicas chosen are lagging from the majority. Hence should be used with caution and under exceptional circumstances.
A force_update operation is like update but as the name suggests, it is intended to force a reconfiguration. It is separately applied on the raft instances of the target assignemnt. Adds a placeholder implementation that is intentionally empty, to be filled in next commit.
This command drives the force replica set update.
In order to distinguish force_update from other inprogress states, adds a new state.
.. in change_partition_replicas(). This makes it possible to reuse the method for forced moves by passing the appropriate boolean. Reverts e37c308 and redoes it in a differnt way not that we have a new reconfiguration_state.
'debug' because it is not meant to be used without a good understanding of the cluster state.
.. to be (re)used in the next method.
4099132
to
a88a7c5
Compare
/ci-repeat 3 |
Follow up is #10574 |
Adds
POST /v1/debug/partitions/{namespace}/{topic}/{partition}/force_replicas
to force reconfigure a given partition uncleanly into a smaller raft group. Only to be used in exceptional circumstances like recovering from a lost majority.Fixes #9096
Backports Required
Release Notes
Features