-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] storage: combine Raft batches when possible #15648
Conversation
Sprinkling the Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion. pkg/storage/replica.go, line 3968 at r1 (raw file):
I'm pretty sure the test failures are due to exactly this lack of safety. Comments from Reviewable |
Did you try this with no pre splits? How about |
|
Actually, it is probably the serial evaluation of write commands that is the limiter in the |
Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions. pkg/storage/replica.go, line 2849 at r1 (raw file):
It's unsafe to send any raft messages before committing new log entries and HardState to stable storage (any new log entries will always be accompanied by at least one outgoing message). This severely limits the scope of batches that can be safely combined. We could perhaps move this down below the processing of CommittedEntries to have more room to combine batches, but I'm not sure how much of the performance improvement will be left at that point (and delaying outgoing messages may make latency worse in other ways) Comments from Reviewable |
Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions. pkg/storage/replica.go, line 2849 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Thanks for pointing this out. I'll add another TODO. Comments from Reviewable |
DO NOT MERGE: this change is not safe, only an experiment to show the benefit of this approach. Provides a 21% speed-up on a 100% write workload: kv --read-percent 0 --duration 1m --splits 100
836b778
to
f5ce759
Compare
Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending. pkg/storage/replica.go, line 2849 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
Actually, can you expand on what is causing this to be unsafe? When the leader is appending a log entry to itself and to followers, why does its write to stable storage have to occur before followers? I can imagine there is some subtlety with HardState, just curious to know what it is. Comments from Reviewable |
Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending. pkg/storage/replica.go, line 2849 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
The concern is mainly for the followers: the follower must not send its MsgAppResp acknowledging receipt of the log entry until it has been persisted to stable storage. Otherwise if the follower crashes, there may be a log entry which the leader has considered committed but is not present on a majority of replicas. Similarly, the HardState is used to record the vote so a node doesn't cast two conflicting votes in the same term, so the HardState must be persisted before any MsgVoteResp is sent. Comments from Reviewable |
@petermattis Could you expand on why you abandoned this? Was the issue raised by @bdarnell a hard blocker? |
@nvanbenschoten I think the concern was about combining the batch containing change to Raft state with batches containing the Raft commands. If we're only combining the batches containing Raft commands I think this approach is safe. |
…ecute This change addresses the first optimization discussed in cockroachdb#17500. The change seems to work and provides a modest performance boost. Unfortunately, I don't think we'll want to consider merging it at the moment. The problem is that while it is technically safe to respond to clients before performing the Raft command application, doing so is a nightmare for testing. Pretty much every test in the `storage` package expects to be able to perform an operation and then "reach beneath raft" immediately to operate on the result. This can range from inspecting Raft entries to working on the most up-to-date `Replica` state. To support this change, all of these tests would need to be updated to handle the now asynchronous operations performed in `handleEvalResultRaftMuLocked`. I addressed this by adding a testing knob called `DisableRaftRespBeforeApplication` in this change. The problem is that I don't feel very comfortable with it because we basically need to use it for all tests (indirectly through `multiTestContext` and `LocalTestCluster`) which means that we probably aren't testing this optimization thoroughly. We could disable the optimization on a finer granularity but this would become a serious issue for maintainability and I'm not sure it would be worth it. Perhaps there's some middle ground between returning to the client after performing in-memory state updates but before performing persistent state updates? Something like calling: 1. `handleEvalResultRaftMuLocked` 2. `maybeRespondToClient` 3. `applyRaftCommand` This would solve a lot of the testing issues present here without the need to use the `DisableRaftRespBeforeApplication` knob, but I'm almost certain that wouldn't be safe to do. I think cockroachdb#15648 will run into a similar issue to this. We'll either need to block clients while we combine Raft batches or we'll need to update tests which expect a client response to be an indication that the command has already been applied in all cases. Things might not be as bad in that case though because less is being done asynchronously.
DO NOT MERGE: this change is not safe, only an experiment to show the
benefit of this approach.
Provides a 21% speed-up on a 100% write workload:
kv --read-percent 0 --duration 1m --splits 100