[WIP] storage: combine Raft batches when possible #15648

petermattis · 2017-05-03T16:17:46Z

DO NOT MERGE: this change is not safe, only an experiment to show the
benefit of this approach.

Provides a 21% speed-up on a 100% write workload:

kv --read-percent 0 --duration 1m --splits 100

cockroach-teamcity · 2017-05-03T16:17:52Z

This change is

petermattis · 2017-05-03T16:21:09Z

Sprinkling the maybeCommitRaftBatch() calls before any side-effect that wants to see the committed state (e.g. before gossiping the system config) is fragile. I'm open to suggestions for a cleaner approach.

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion.

pkg/storage/replica.go, line 3968 at r1 (raw file):

	// (processRaftCommand) will be finishing the pending command. We could fix
	// this by only finishing the pending commands once the batch is finally
	// committed.

I'm pretty sure the test failures are due to exactly this lack of safety.

Comments from Reviewable

spencerkimball · 2017-05-04T12:34:36Z

Did you try this with no pre splits? How about --splits=10?

petermattis · 2017-05-04T12:43:32Z

--splits=0 shows a 29% improvement. --splits=10 shows a 42% improvement. I'm mildly surprised --splits=0 didn't show a bigger improvement. It is likely that the range actually split during the test.

petermattis · 2017-05-04T12:52:56Z

It is likely that the range actually split during the test.

Actually, it is probably the serial evaluation of write commands that is the limiter in the --splits=0 case.

bdarnell · 2017-05-08T22:07:56Z

Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions.

pkg/storage/replica.go, line 2849 at r1 (raw file):

	for _, msg := range rd.Messages {
		r.sendRaftMessage(ctx, msg)

It's unsafe to send any raft messages before committing new log entries and HardState to stable storage (any new log entries will always be accompanied by at least one outgoing message). This severely limits the scope of batches that can be safely combined. We could perhaps move this down below the processing of CommittedEntries to have more room to combine batches, but I'm not sure how much of the performance improvement will be left at that point (and delaying outgoing messages may make latency worse in other ways)

Comments from Reviewable

petermattis · 2017-05-09T01:11:08Z

Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions.

pkg/storage/replica.go, line 2849 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It's unsafe to send any raft messages before committing new log entries and HardState to stable storage (any new log entries will always be accompanied by at least one outgoing message). This severely limits the scope of batches that can be safely combined. We could perhaps move this down below the processing of CommittedEntries to have more room to combine batches, but I'm not sure how much of the performance improvement will be left at that point (and delaying outgoing messages may make latency worse in other ways)

Thanks for pointing this out. I'll add another TODO.

Comments from Reviewable

DO NOT MERGE: this change is not safe, only an experiment to show the benefit of this approach. Provides a 21% speed-up on a 100% write workload: kv --read-percent 0 --duration 1m --splits 100

petermattis · 2017-05-09T01:19:00Z

Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.

pkg/storage/replica.go, line 2849 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Thanks for pointing this out. I'll add another TODO.

Actually, can you expand on what is causing this to be unsafe? When the leader is appending a log entry to itself and to followers, why does its write to stable storage have to occur before followers? I can imagine there is some subtlety with HardState, just curious to know what it is.

Comments from Reviewable

bdarnell · 2017-05-09T01:32:48Z

Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.

pkg/storage/replica.go, line 2849 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Actually, can you expand on what is causing this to be unsafe? When the leader is appending a log entry to itself and to followers, why does its write to stable storage have to occur before followers? I can imagine there is some subtlety with HardState, just curious to know what it is.

The concern is mainly for the followers: the follower must not send its MsgAppResp acknowledging receipt of the log entry until it has been persisted to stable storage. Otherwise if the follower crashes, there may be a log entry which the leader has considered committed but is not present on a majority of replicas. Similarly, the HardState is used to record the vote so a node doesn't cast two conflicting votes in the same term, so the HardState must be persisted before any MsgVoteResp is sent.

Comments from Reviewable

nvanbenschoten · 2017-08-08T15:50:28Z

@petermattis Could you expand on why you abandoned this? Was the issue raised by @bdarnell a hard blocker?

petermattis · 2017-08-08T15:52:31Z

@nvanbenschoten I think the concern was about combining the batch containing change to Raft state with batches containing the Raft commands. If we're only combining the batches containing Raft commands I think this approach is safe.

…ecute This change addresses the first optimization discussed in cockroachdb#17500. The change seems to work and provides a modest performance boost. Unfortunately, I don't think we'll want to consider merging it at the moment. The problem is that while it is technically safe to respond to clients before performing the Raft command application, doing so is a nightmare for testing. Pretty much every test in the `storage` package expects to be able to perform an operation and then "reach beneath raft" immediately to operate on the result. This can range from inspecting Raft entries to working on the most up-to-date `Replica` state. To support this change, all of these tests would need to be updated to handle the now asynchronous operations performed in `handleEvalResultRaftMuLocked`. I addressed this by adding a testing knob called `DisableRaftRespBeforeApplication` in this change. The problem is that I don't feel very comfortable with it because we basically need to use it for all tests (indirectly through `multiTestContext` and `LocalTestCluster`) which means that we probably aren't testing this optimization thoroughly. We could disable the optimization on a finer granularity but this would become a serious issue for maintainability and I'm not sure it would be worth it. Perhaps there's some middle ground between returning to the client after performing in-memory state updates but before performing persistent state updates? Something like calling: 1. `handleEvalResultRaftMuLocked` 2. `maybeRespondToClient` 3. `applyRaftCommand` This would solve a lot of the testing issues present here without the need to use the `DisableRaftRespBeforeApplication` knob, but I'm almost certain that wouldn't be safe to do. I think cockroachdb#15648 will run into a similar issue to this. We'll either need to block clients while we combine Raft batches or we'll need to update tests which expect a client response to be an indication that the command has already been applied in all cases. Things might not be as bad in that case though because less is being done asynchronously.

petermattis added the in progress label May 3, 2017

petermattis added this to the 1.1 milestone May 3, 2017

knz removed the in progress label May 3, 2017

[WIP] storage: combine Raft batches when possible

f5ce759

DO NOT MERGE: this change is not safe, only an experiment to show the benefit of this approach. Provides a 21% speed-up on a 100% write workload: kv --read-percent 0 --duration 1m --splits 100

petermattis force-pushed the pmattis/raft-batch branch from 836b778 to f5ce759 Compare May 9, 2017 01:13

petermattis closed this Jul 18, 2017

petermattis mentioned this pull request Aug 8, 2017

kv: make disk I/O asynchronous with respect to Raft state machine #17500

Closed

nvanbenschoten mentioned this pull request Sep 23, 2017

[DNM] storage: respond to Raft proposals after entries commit, not execute #18710

Closed

nvanbenschoten mentioned this pull request Oct 19, 2017

technote: add a technote about CRDB retries #18575

Open

petermattis deleted the pmattis/raft-batch branch December 13, 2017 14:47

nvanbenschoten mentioned this pull request May 9, 2019

storage: batch Raft entry application across ready struct #37426

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] storage: combine Raft batches when possible #15648

[WIP] storage: combine Raft batches when possible #15648

petermattis commented May 3, 2017

cockroach-teamcity commented May 3, 2017

petermattis commented May 3, 2017

spencerkimball commented May 4, 2017

petermattis commented May 4, 2017

petermattis commented May 4, 2017

bdarnell commented May 8, 2017

petermattis commented May 9, 2017

petermattis commented May 9, 2017

bdarnell commented May 9, 2017

nvanbenschoten commented Aug 8, 2017

petermattis commented Aug 8, 2017

[WIP] storage: combine Raft batches when possible #15648

[WIP] storage: combine Raft batches when possible #15648

Conversation

petermattis commented May 3, 2017

cockroach-teamcity commented May 3, 2017

petermattis commented May 3, 2017

spencerkimball commented May 4, 2017

petermattis commented May 4, 2017

petermattis commented May 4, 2017

bdarnell commented May 8, 2017

petermattis commented May 9, 2017

petermattis commented May 9, 2017

bdarnell commented May 9, 2017

nvanbenschoten commented Aug 8, 2017

petermattis commented Aug 8, 2017