Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] storage: combine Raft batches when possible #15648

Closed

Conversation

petermattis
Copy link
Collaborator

DO NOT MERGE: this change is not safe, only an experiment to show the
benefit of this approach.

Provides a 21% speed-up on a 100% write workload:

kv --read-percent 0 --duration 1m --splits 100

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@petermattis petermattis added this to the 1.1 milestone May 3, 2017
@petermattis
Copy link
Collaborator Author

Sprinkling the maybeCommitRaftBatch() calls before any side-effect that wants to see the committed state (e.g. before gossiping the system config) is fragile. I'm open to suggestions for a cleaner approach.


Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion.


pkg/storage/replica.go, line 3968 at r1 (raw file):

	// (processRaftCommand) will be finishing the pending command. We could fix
	// this by only finishing the pending commands once the batch is finally
	// committed.

I'm pretty sure the test failures are due to exactly this lack of safety.


Comments from Reviewable

@knz knz removed the in progress label May 3, 2017
@spencerkimball
Copy link
Member

Did you try this with no pre splits? How about --splits=10?

@petermattis
Copy link
Collaborator Author

--splits=0 shows a 29% improvement. --splits=10 shows a 42% improvement. I'm mildly surprised --splits=0 didn't show a bigger improvement. It is likely that the range actually split during the test.

@petermattis
Copy link
Collaborator Author

It is likely that the range actually split during the test.

Actually, it is probably the serial evaluation of write commands that is the limiter in the --splits=0 case.

@bdarnell
Copy link
Contributor

bdarnell commented May 8, 2017

Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions.


pkg/storage/replica.go, line 2849 at r1 (raw file):

	for _, msg := range rd.Messages {
		r.sendRaftMessage(ctx, msg)

It's unsafe to send any raft messages before committing new log entries and HardState to stable storage (any new log entries will always be accompanied by at least one outgoing message). This severely limits the scope of batches that can be safely combined. We could perhaps move this down below the processing of CommittedEntries to have more room to combine batches, but I'm not sure how much of the performance improvement will be left at that point (and delaying outgoing messages may make latency worse in other ways)


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions.


pkg/storage/replica.go, line 2849 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It's unsafe to send any raft messages before committing new log entries and HardState to stable storage (any new log entries will always be accompanied by at least one outgoing message). This severely limits the scope of batches that can be safely combined. We could perhaps move this down below the processing of CommittedEntries to have more room to combine batches, but I'm not sure how much of the performance improvement will be left at that point (and delaying outgoing messages may make latency worse in other ways)

Thanks for pointing this out. I'll add another TODO.


Comments from Reviewable

DO NOT MERGE: this change is not safe, only an experiment to show the
benefit of this approach.

Provides a 21% speed-up on a 100% write workload:

  kv --read-percent 0 --duration 1m --splits 100
@petermattis
Copy link
Collaborator Author

Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


pkg/storage/replica.go, line 2849 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Thanks for pointing this out. I'll add another TODO.

Actually, can you expand on what is causing this to be unsafe? When the leader is appending a log entry to itself and to followers, why does its write to stable storage have to occur before followers? I can imagine there is some subtlety with HardState, just curious to know what it is.


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented May 9, 2017

Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


pkg/storage/replica.go, line 2849 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Actually, can you expand on what is causing this to be unsafe? When the leader is appending a log entry to itself and to followers, why does its write to stable storage have to occur before followers? I can imagine there is some subtlety with HardState, just curious to know what it is.

The concern is mainly for the followers: the follower must not send its MsgAppResp acknowledging receipt of the log entry until it has been persisted to stable storage. Otherwise if the follower crashes, there may be a log entry which the leader has considered committed but is not present on a majority of replicas. Similarly, the HardState is used to record the vote so a node doesn't cast two conflicting votes in the same term, so the HardState must be persisted before any MsgVoteResp is sent.


Comments from Reviewable

@nvanbenschoten
Copy link
Member

@petermattis Could you expand on why you abandoned this? Was the issue raised by @bdarnell a hard blocker?

@petermattis
Copy link
Collaborator Author

@nvanbenschoten I think the concern was about combining the batch containing change to Raft state with batches containing the Raft commands. If we're only combining the batches containing Raft commands I think this approach is safe.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Sep 23, 2017
…ecute

This change addresses the first optimization discussed in cockroachdb#17500.

The change seems to work and provides a modest performance boost.
Unfortunately, I don't think we'll want to consider merging it at
the moment. The problem is that while it is technically safe to
respond to clients before performing the Raft command application,
doing so is a nightmare for testing. Pretty much every
test in the `storage` package expects to be able to perform an
operation and then "reach beneath raft" immediately to operate
on the result. This can range from inspecting Raft entries to
working on the most up-to-date `Replica` state.

To support this change, all of these tests would need to be
updated to handle the now asynchronous operations performed
in `handleEvalResultRaftMuLocked`. I addressed this by adding
a testing knob called `DisableRaftRespBeforeApplication` in
this change. The problem is that I don't feel very comfortable
with it because we basically need to use it for all tests
(indirectly through `multiTestContext` and `LocalTestCluster`)
which means that we probably aren't testing this optimization
thoroughly. We could disable the optimization on a finer
granularity but this would become a serious issue for
maintainability and I'm not sure it would be worth it.

Perhaps there's some middle ground between returning to the
client after performing in-memory state updates but before
performing persistent state updates? Something like calling:
1. `handleEvalResultRaftMuLocked`
2. `maybeRespondToClient`
3. `applyRaftCommand`

This would solve a lot of the testing issues present here without
the need to use the `DisableRaftRespBeforeApplication` knob, but
I'm almost certain that wouldn't be safe to do.

I think cockroachdb#15648 will run into a similar issue to this. We'll either
need to block clients while we combine Raft batches or we'll need
to update tests which expect a client response to be an indication
that the command has already been applied in all cases. Things
might not be as bad in that case though because less is being
done asynchronously.
@petermattis petermattis deleted the pmattis/raft-batch branch December 13, 2017 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants