Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guard from concurrent elections #21361

Merged

Conversation

bashtanov
Copy link
Contributor

@bashtanov bashtanov commented Jul 12, 2024

At the moment nothing prevents a raft node from being a candidate in two elections at the same time, two vote_stms actively working. This may lead to a nasty race condition where one of them has succeeded (becoming a leader, last heartbeat goes max), and then the other one fails (role becomes follower, no changes to last heartbeat). As a result, a node never becomes a candidate, as it's not a leader but has max last heartbeat.

This PR is mostly should guard against concurrent elections with the same candidate. Some minor fixes and refactorings included.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

  • none

@bashtanov bashtanov force-pushed the guard-from-concurrent-elections branch from 006353b to 62aa52f Compare July 12, 2024 16:17
@bashtanov
Copy link
Contributor Author

/dt

1 similar comment
@bashtanov
Copy link
Contributor Author

/dt

@vbotbuildovich
Copy link
Collaborator

skipped ducktape retry in https://buildkite.com/redpanda/redpanda/builds/51446#0190a954-a6c1-44e8-bbf6-044ccd4e04d1:
pandatriage cache was not found

@bashtanov
Copy link
Contributor Author

DT failure: #21376

@bashtanov bashtanov force-pushed the guard-from-concurrent-elections branch from 62aa52f to a4e6535 Compare July 15, 2024 14:59
@bashtanov bashtanov marked this pull request as ready for review July 15, 2024 14:59
@bashtanov bashtanov marked this pull request as draft July 15, 2024 15:00
@bashtanov bashtanov marked this pull request as ready for review July 15, 2024 15:11
@bashtanov bashtanov force-pushed the guard-from-concurrent-elections branch from a4e6535 to deabb87 Compare July 16, 2024 16:45
@bashtanov
Copy link
Contributor Author

I removed the consensus.cc continuations refactoring, as we decided to do it differently and it doesn't really interfere with my other changes in this PR (I originally did them because I thought of touching this part of the code). Maybe next time.

src/v/raft/consensus.h Show resolved Hide resolved
@@ -479,4 +479,13 @@ ss::future<> vote_stm::self_vote() {
auto m = _replies.find(_ptr->self());
m->second.set_value(reply);
}

void vote_stm::lose() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a more descriptive name would be nicer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to lose_election, do you think it's better? (I'm not quite sure what else we can lose in a class whose purpose is to run an election)

Also if we're after naming things right I'd rename half of the vote to election to clarify whether we arrange an election or merely vote in an election already organised. Say, consensus::vote() vs vote_stm::vote confused me, as the latter runs an election, not just votes really. How conservative are we regarding existing names? Is it okay to rename things if I touch the code around them?

@bashtanov bashtanov force-pushed the guard-from-concurrent-elections branch 2 times, most recently from 81fdfbe to ef89f73 Compare July 18, 2024 13:21
@bashtanov
Copy link
Contributor Author

failure in previous build unrelated: https://redpandadata.atlassian.net/browse/CORE-5736

@bashtanov bashtanov force-pushed the guard-from-concurrent-elections branch from ef89f73 to 5e8643a Compare July 22, 2024 10:42
vstm has already been moved-from, it makes no sense to capture it here
will be used later to guard from concurrent elections
Use a lock held by critical part of the voting process,
released when decision is made
@bashtanov bashtanov force-pushed the guard-from-concurrent-elections branch from 5e8643a to 8a933e6 Compare July 22, 2024 10:42
@bashtanov bashtanov merged commit c925d0d into redpanda-data:dev Jul 22, 2024
19 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v24.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

@vbotbuildovich
Copy link
Collaborator

Oops! Something went wrong.

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-21361-v23.3.x-580 remotes/upstream/v23.3.x
git cherry-pick -x cafb4458f7fe29071ddc2d4e9ffa1a4224b6cfae f8f7561a9e1739916244b98d1de51560a0171ce6 a7f1d75a91477e55522d836b8ae9e8da204992c5 97e52de555701a0ee5ce6129bca8bef82920cceb 8a933e6976fa7bc9ebb93942fcb36d59e3e46b1c

Workflow run logs.

@@ -1046,9 +1046,7 @@ void consensus::dispatch_vote(bool leadership_transfer) {
}
// background
ssx::spawn_with_gate(
_bg,
[vstm = std::move(vstm),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not just dead, but i wonder why clang-tidy didn't warn against what looks like a use-after-move (or at least a move-after-move).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://clang.llvm.org/extra/clang-tidy/checks/bugprone/use-after-move.html#use:

An exception to this are objects of type std::unique_ptr, std::shared_ptr and std::weak_ptr, which have defined move behavior (objects of these classes are guaranteed to be empty after they have been moved from). Therefore, an object of these classes will only be considered to be used if it is dereferenced, i.e. if operator*, operator-> or operator[] (in the case of std::unique_ptr<T []>) is called on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh interesting TIL

@bashtanov
Copy link
Contributor Author

/backport v24.2.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants