Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfer leadership before stepping down after reconfiguration #19966

Merged
merged 3 commits into from
Jun 28, 2024

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Jun 24, 2024

When a node currently being a raft group leader is not a part of new
configuration it must step down and become a follower. When stepping
down a leader stops sending heartbeats to the followers allowing them to
trigger election. The election starts only after an election timeout
elapsed on on of the followers. This makes the whole process
slow and during the whole time clients can not write and read from the
raft group as it is leaderless. To address this issue a new method of
step down was introduced. The new stepdown implementation which is
going to be used for reconfiguration requests one of the followers to
timeout immediately and trigger leader election. This speeds up the
whole process and makes it much less disruptive as the stepdown is now
comparable to leadership transfer.

Time to elect a leader before the fix:

INFO  2024-06-24 11:19:34,717 [shard  0:main] raft_test - basic_raft_fixture_test.cc:782 - leadership_transfer - new leader reported after: 74 ms

after reconfiguration:

INFO  2024-06-24 11:19:36,607 [shard  0:main] raft_test - basic_raft_fixture_test.cc:815 - reconfiguration - new leader reported after: 1690 ms

with the fix:

INFO  2024-06-24 12:14:45,170 [shard  0:main] raft_test - basic_raft_fixture_test.cc:817 - reconfiguration - new leader reported after: 66 ms

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

Improvements

  • Made leadership changes related with reconfiguration faster and less disruptive

bharathv
bharathv previously approved these changes Jun 24, 2024
src/v/raft/consensus.cc Outdated Show resolved Hide resolved
src/v/raft/consensus.cc Outdated Show resolved Hide resolved
bharathv
bharathv previously approved these changes Jun 25, 2024
@dotnwat
Copy link
Member

dotnwat commented Jun 25, 2024

When a node currently being a raft group leader is not a part of new
configuration it must step down and become a follower.

does this make sense? if a node is removed from a raft configuration, why would it be a follower?

@bharathv
Copy link
Contributor

When a node currently being a raft group leader is not a part of new
configuration it must step down and become a follower.

does this make sense? if a node is removed from a raft configuration, why would it be a follower?

Michal was probably referring to the implementation detail there. The "becoming a follower" (see do_step_down("reason")) part in the implementation relinquishes leadership letting a new leader take charge. This is the terminal state for that replica as it cannot request votes (it is no longer a part of the configuration) nor it can receive any heartbeats as the rest of the quorum already forgot about it. It will be GC-ed by the controller.

Added a missing trigger of Raft leadership notification after stepping
down when a leader node is not longer part of raft group configuration.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
@mmaslankaprv
Copy link
Member Author

/ci-repeat 1

Comment on lines +3031 to +3034
if (_leader_id) {
_leader_id = std::nullopt;
trigger_leadership_notification();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be part of do_step_down?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in some cases (when processing requests) we do not trigger the notification with no leader but immediately update leader with the new leader node id

src/v/raft/consensus.cc Outdated Show resolved Hide resolved
src/v/raft/consensus.cc Outdated Show resolved Hide resolved
src/v/raft/tests/basic_raft_fixture_test.cc Show resolved Hide resolved
co_await stop_node(vn.id());
}

auto tolerance = 0.15;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I have a feeling that could be flaky in a noisy debug environment. In the end what we care about is that this interval is much smaller than the election timeout, maybe we can test that directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was worried about that and this is why i expressed the expected value based on the leadership transfer that is executed right before the reconfiguration. I was thinking that in the debug environment the leadership transfer would also be slower hence the test will self adapt to the env

When a node currently being a raft group leader is not a part of new
configuration it must step down and become a follower. When stepping
down a leader stops sending heartbeats to the followers allowing them to
trigger election. The election starts only after an election timeout
elapsed on on of the followers. This makes the whole process
slow and during the whole time clients can not write and read from the
raft group as it is leaderless. To address this issue a new method of
step down was introduced. The new stepdown implementation which is
going to be used for reconfiguration requests one of the followers to
timeout immediately and trigger leader election. This speeds up the
whole process and makes it much less disruptive as the stepdown is now
comparable to leadership transfer.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
Added a test validating if a leader election caused by removing leader
from the replica set takes a comparable amount of time to the leadership
transfer.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
@mmaslankaprv
Copy link
Member Author

/ci-repeat 1

@mmaslankaprv
Copy link
Member Author

ci failure: #19012

@mmaslankaprv mmaslankaprv merged commit 1aed2fd into redpanda-data:dev Jun 28, 2024
15 of 18 checks passed
@mmaslankaprv mmaslankaprv deleted the transfer-step-down branch June 28, 2024 14:54
@vbotbuildovich
Copy link
Collaborator

/backport v24.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-19966-v24.1.x-655 remotes/upstream/v24.1.x
git cherry-pick -x 8f4db30afde5beed6c5ca8f2ac979c296972061c 9f7a5085d1d1161d7384acc76e2064f47f898a6a 9c581091ea38afcfa3461d2f1d1524b054ec8269

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-19966-v23.3.x-253 remotes/upstream/v23.3.x
git cherry-pick -x 8f4db30afde5beed6c5ca8f2ac979c296972061c 9f7a5085d1d1161d7384acc76e2064f47f898a6a 9c581091ea38afcfa3461d2f1d1524b054ec8269

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants