Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM] storage: respond to Raft proposals after entries commit, not execute #18710

Conversation

nvanbenschoten
Copy link
Member

This change addresses the first optimization discussed in #17500.

The change seems to work and provides a modest performance boost.
Unfortunately, I don't think we'll want to consider merging it at
the moment. The problem is that while it is technically safe to
respond to clients before before performing the Raft command
application, doing so is a nightmare for testing. Pretty much every
test in the storage package expects to be able to perform an
operation and then "reach beneath raft" immediately to operate
on the result. This can range from inspecting Raft entries to
working on the most up-to-date Replica state.

To support this change, all of these tests would need to be
updated to handle the now asynchronous operations performed
in handleEvalResultRaftMuLocked. I addressed this by adding
a testing knob called DisableRaftRespBeforeApplication in
this change. The problem is that I don't feel very comfortable
with it because we basically need to use it for all tests
(indirectly through multiTestContext and LocalTestCluster)
which means that we probably aren't testing this optimization
thoroughly. We could disable the optimization on a finer
granularity but this would become a serious issue for
maintainability and I'm not sure it would be worth it.

Perhaps there's some middle ground between returning to the
client after performing in-memory state updates but before
performing persistent state updates? Something like calling:

  1. handleEvalResultRaftMuLocked
  2. maybeRespondToClient
  3. applyRaftCommand

This would solve a lot of the testing issues present here without
the need to use the DisableRaftRespBeforeApplication knob, but
I'm almost certain that wouldn't be safe to do.

I think #15648 will run into a similar issue to this. We'll either
need to block clients while we combine Raft batches or we'll need
to update tests which expect a client response to be an indication
that the command has already been applied in all cases. Things
might not be as bad in that case though because less is being
done asynchronously.

…ecute

This change addresses the first optimization discussed in cockroachdb#17500.

The change seems to work and provides a modest performance boost.
Unfortunately, I don't think we'll want to consider merging it at
the moment. The problem is that while it is technically safe to
respond to clients before performing the Raft command application,
doing so is a nightmare for testing. Pretty much every
test in the `storage` package expects to be able to perform an
operation and then "reach beneath raft" immediately to operate
on the result. This can range from inspecting Raft entries to
working on the most up-to-date `Replica` state.

To support this change, all of these tests would need to be
updated to handle the now asynchronous operations performed
in `handleEvalResultRaftMuLocked`. I addressed this by adding
a testing knob called `DisableRaftRespBeforeApplication` in
this change. The problem is that I don't feel very comfortable
with it because we basically need to use it for all tests
(indirectly through `multiTestContext` and `LocalTestCluster`)
which means that we probably aren't testing this optimization
thoroughly. We could disable the optimization on a finer
granularity but this would become a serious issue for
maintainability and I'm not sure it would be worth it.

Perhaps there's some middle ground between returning to the
client after performing in-memory state updates but before
performing persistent state updates? Something like calling:
1. `handleEvalResultRaftMuLocked`
2. `maybeRespondToClient`
3. `applyRaftCommand`

This would solve a lot of the testing issues present here without
the need to use the `DisableRaftRespBeforeApplication` knob, but
I'm almost certain that wouldn't be safe to do.

I think cockroachdb#15648 will run into a similar issue to this. We'll either
need to block clients while we combine Raft batches or we'll need
to update tests which expect a client response to be an indication
that the command has already been applied in all cases. Things
might not be as bad in that case though because less is being
done asynchronously.
@nvanbenschoten nvanbenschoten requested review from a team September 23, 2017 04:11
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@petermattis
Copy link
Collaborator

I ran a few quick tests using this PR and it showed a 1-2% boost in throughput. What are you seeing in your testing?

@nvanbenschoten
Copy link
Member Author

Depending on the workload, anywhere from a 1-10% decrease in average latency. I was primarily benchmarking with KV where the larger the block size, the larger the latency boost became. This is in-line with my expectation because the larger the block size is, the more work we're avoiding on each command before returning to clients.

I'm actually a little surprised you saw any movement in throughput though as we're still doing the same amount of work overall. That must mean that write-batch application is not the bottleneck of whatever benchmark you were running.

@petermattis
Copy link
Collaborator

I was running kv --read-percent 0 with the default value size (1 byte). The throughput change might have just been noise. I didn't notice any change in avg, p50, p95 or p99 latency.

@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/earlyResp branch August 9, 2018 21:11
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jul 18, 2019
…apply

Informs cockroachdb#17500.

This is a partial revival of cockroachdb#18710 and a culmination of more recent
thinking in cockroachdb#17500 (comment).

The change adjusts the Raft processing loop so that it acknowledges the success
of raft entries as soon as it learns that they have been durably committed to
the raft log instead of after they have been applied to the proposer replica's
replicated state machine. This not only pulls the application latency out of the
hot path for Raft proposals, but it also pulls the next raft ready iteration's
write to its Raft log (with the associated fsync) out of the hot path for Raft
proposals.

This is safe because a proposal through raft is known to have succeeded as
soon as it is replicated to a quorum of replicas (i.e. has committed in the
raft log). The proposal does not need to wait for its effects to be applied
in order to know whether its changes will succeed or fail. The raft log is
the provider of atomicity and durability for replicated writes, not (ignoring
log truncation) the replicated state machine itself, so a client can be
confident in the result of a write as soon as the raft log confirms that it
has succeeded.

However, there are a few complications to acknowledging the success of a
proposal at this stage:

1. Committing an entry in the raft log and having the command in that entry
   succeed are similar but not equivalent concepts. Even if the entry succeeds
   in achieving durability by replicating to a quorum of replicas, its command
   may still be rejected "beneath raft". This means that a (deterministic)
   check after replication decides that the command will not be applied to the
   replicated state machine. In that case, the client waiting on the result of
   the command should not be informed of its success. Luckily, this check is
   cheap to perform so we can do it here and when applying the command. See
   Replica.shouldApplyCommand.

2. Some commands perform non-trivial work such as updating Replica configuration
   state or performing Range splits. In those cases, it's likely that the client
   is interested in not only knowing whether it has succeeded in sequencing the
   change in the raft log, but also in knowing when the change has gone into
   effect. There's currently no exposed hook to ask for an acknowledgement only
   after a command has been applied, so for simplicity the current implementation
   only ever acks transactional writes before they have gone into effect. All
   other commands wait until they have been applied to ack their client.

3. Even though we can determine whether a command has succeeded without applying
   it, the effect of the command will not be visible to conflicting commands until
   it is applied. Because of this, the client can be informed of the success of
   a write at this point, but we cannot release that write's latches until the
   write has applied. See ProposalData.signalProposalResult/finishApplication.

\### Benchmarks

The change appears to result in an **8-10%** improvement to throughput and a
**6-10%** reduction in p50 latency across the board on kv0. I ran a series of
tests with different node sizes and difference workload concurrencies and the
win seemed pretty stable. This was also true regardless of whether the writes
were to a single Raft group or a large number of Raft groups.

```
name                           old ops/sec  new ops/sec  delta
kv0/cores=16/nodes=3/conc=32    24.1k ± 0%   26.1k ± 1%   +8.35%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=48    30.4k ± 1%   32.8k ± 1%   +8.02%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=64    34.6k ± 1%   37.6k ± 0%   +8.79%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72    46.6k ± 1%   50.8k ± 0%   +8.99%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108   58.8k ± 1%   64.0k ± 1%   +8.99%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144   68.1k ± 1%   74.5k ± 1%   +9.45%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=144   55.8k ± 1%   59.7k ± 2%   +7.12%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216   64.4k ± 4%   68.1k ± 4%   +5.65%  (p=0.016 n=5+5)
kv0/cores=72/nodes=3/conc=288   68.8k ± 2%   74.5k ± 3%   +8.39%  (p=0.008 n=5+5)

name                           old p50(ms)  new p50(ms)  delta
kv0/cores=16/nodes=3/conc=32     1.30 ± 0%    1.20 ± 0%   -7.69%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=48     1.50 ± 0%    1.40 ± 0%   -6.67%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=64     1.70 ± 0%    1.60 ± 0%   -5.88%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72     1.40 ± 0%    1.30 ± 0%   -7.14%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108    1.60 ± 0%    1.50 ± 0%   -6.25%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144    1.84 ± 3%    1.70 ± 0%   -7.61%  (p=0.000 n=5+4)
kv0/cores=72/nodes=3/conc=144    2.00 ± 0%    1.80 ± 0%  -10.00%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216    2.46 ± 2%    2.20 ± 0%  -10.57%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=288    2.80 ± 0%    2.60 ± 0%   -7.14%  (p=0.079 n=4+5)

name                           old p99(ms)  new p99(ms)  delta
kv0/cores=16/nodes=3/conc=32     3.50 ± 0%    3.50 ± 0%     ~     (all equal)
kv0/cores=16/nodes=3/conc=48     4.70 ± 0%    4.58 ± 3%     ~     (p=0.167 n=5+5)
kv0/cores=16/nodes=3/conc=64     5.50 ± 0%    5.20 ± 0%   -5.45%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72     5.00 ± 0%    4.70 ± 0%   -6.00%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108    5.80 ± 0%    5.50 ± 0%   -5.17%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144    6.48 ± 3%    6.18 ± 3%   -4.63%  (p=0.079 n=5+5)
kv0/cores=72/nodes=3/conc=144    11.0 ± 0%    10.5 ± 0%   -4.55%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216    13.4 ± 2%    13.2 ± 5%     ~     (p=0.683 n=5+5)
kv0/cores=72/nodes=3/conc=288    18.2 ± 4%    17.2 ± 3%   -5.70%  (p=0.079 n=5+5)
```

Release note (performance improvement): Raft entries no longer wait to
be applied to the RocksDB storage engine before signaling their success
to clients, they now only wait until they are committed in their Raft log.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 12, 2019
…apply

Informs cockroachdb#17500.

This is a partial revival of cockroachdb#18710 and a culmination of more recent
thinking in cockroachdb#17500 (comment).

The change adjusts the Raft processing loop so that it acknowledges the success
of raft entries as soon as it learns that they have been durably committed to
the raft log instead of after they have been applied to the proposer replica's
replicated state machine. This not only pulls the application latency out of the
hot path for Raft proposals, but it also pulls the next raft ready iteration's
write to its Raft log (with the associated fsync) out of the hot path for Raft
proposals.

This is safe because a proposal through raft is known to have succeeded as
soon as it is replicated to a quorum of replicas (i.e. has committed in the
raft log). The proposal does not need to wait for its effects to be applied
in order to know whether its changes will succeed or fail. The raft log is
the provider of atomicity and durability for replicated writes, not (ignoring
log truncation) the replicated state machine itself, so a client can be
confident in the result of a write as soon as the raft log confirms that it
has succeeded.

However, there are a few complications to acknowledging the success of a
proposal at this stage:

1. Committing an entry in the raft log and having the command in that entry
   succeed are similar but not equivalent concepts. Even if the entry succeeds
   in achieving durability by replicating to a quorum of replicas, its command
   may still be rejected "beneath raft". This means that a (deterministic)
   check after replication decides that the command will not be applied to the
   replicated state machine. In that case, the client waiting on the result of
   the command should not be informed of its success. Luckily, this check is
   cheap to perform so we can do it here and when applying the command. See
   Replica.shouldApplyCommand.

2. Some commands perform non-trivial work such as updating Replica configuration
   state or performing Range splits. In those cases, it's likely that the client
   is interested in not only knowing whether it has succeeded in sequencing the
   change in the raft log, but also in knowing when the change has gone into
   effect. There's currently no exposed hook to ask for an acknowledgement only
   after a command has been applied, so for simplicity the current implementation
   only ever acks transactional writes before they have gone into effect. All
   other commands wait until they have been applied to ack their client.

3. Even though we can determine whether a command has succeeded without applying
   it, the effect of the command will not be visible to conflicting commands until
   it is applied. Because of this, the client can be informed of the success of
   a write at this point, but we cannot release that write's latches until the
   write has applied. See ProposalData.signalProposalResult/finishApplication.

\### Benchmarks

The change appears to result in an **8-10%** improvement to throughput and a
**6-10%** reduction in p50 latency across the board on kv0. I ran a series of
tests with different node sizes and difference workload concurrencies and the
win seemed pretty stable. This was also true regardless of whether the writes
were to a single Raft group or a large number of Raft groups.

```
name                           old ops/sec  new ops/sec  delta
kv0/cores=16/nodes=3/conc=32    24.1k ± 0%   26.1k ± 1%   +8.35%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=48    30.4k ± 1%   32.8k ± 1%   +8.02%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=64    34.6k ± 1%   37.6k ± 0%   +8.79%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72    46.6k ± 1%   50.8k ± 0%   +8.99%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108   58.8k ± 1%   64.0k ± 1%   +8.99%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144   68.1k ± 1%   74.5k ± 1%   +9.45%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=144   55.8k ± 1%   59.7k ± 2%   +7.12%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216   64.4k ± 4%   68.1k ± 4%   +5.65%  (p=0.016 n=5+5)
kv0/cores=72/nodes=3/conc=288   68.8k ± 2%   74.5k ± 3%   +8.39%  (p=0.008 n=5+5)

name                           old p50(ms)  new p50(ms)  delta
kv0/cores=16/nodes=3/conc=32     1.30 ± 0%    1.20 ± 0%   -7.69%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=48     1.50 ± 0%    1.40 ± 0%   -6.67%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=64     1.70 ± 0%    1.60 ± 0%   -5.88%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72     1.40 ± 0%    1.30 ± 0%   -7.14%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108    1.60 ± 0%    1.50 ± 0%   -6.25%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144    1.84 ± 3%    1.70 ± 0%   -7.61%  (p=0.000 n=5+4)
kv0/cores=72/nodes=3/conc=144    2.00 ± 0%    1.80 ± 0%  -10.00%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216    2.46 ± 2%    2.20 ± 0%  -10.57%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=288    2.80 ± 0%    2.60 ± 0%   -7.14%  (p=0.079 n=4+5)

name                           old p99(ms)  new p99(ms)  delta
kv0/cores=16/nodes=3/conc=32     3.50 ± 0%    3.50 ± 0%     ~     (all equal)
kv0/cores=16/nodes=3/conc=48     4.70 ± 0%    4.58 ± 3%     ~     (p=0.167 n=5+5)
kv0/cores=16/nodes=3/conc=64     5.50 ± 0%    5.20 ± 0%   -5.45%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72     5.00 ± 0%    4.70 ± 0%   -6.00%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108    5.80 ± 0%    5.50 ± 0%   -5.17%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144    6.48 ± 3%    6.18 ± 3%   -4.63%  (p=0.079 n=5+5)
kv0/cores=72/nodes=3/conc=144    11.0 ± 0%    10.5 ± 0%   -4.55%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216    13.4 ± 2%    13.2 ± 5%     ~     (p=0.683 n=5+5)
kv0/cores=72/nodes=3/conc=288    18.2 ± 4%    17.2 ± 3%   -5.70%  (p=0.079 n=5+5)
```

Release note (performance improvement): Raft entries no longer wait to
be applied to the RocksDB storage engine before signaling their success
to clients, they now only wait until they are committed in their Raft log.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 13, 2019
…apply

Informs cockroachdb#17500.

This is a partial revival of cockroachdb#18710 and a culmination of more recent
thinking in cockroachdb#17500 (comment).

The change adjusts the Raft processing loop so that it acknowledges the success
of raft entries as soon as it learns that they have been durably committed to
the raft log instead of after they have been applied to the proposer replica's
replicated state machine. This not only pulls the application latency out of the
hot path for Raft proposals, but it also pulls the next raft ready iteration's
write to its Raft log (with the associated fsync) out of the hot path for Raft
proposals.

This is safe because a proposal through raft is known to have succeeded as
soon as it is replicated to a quorum of replicas (i.e. has committed in the
raft log). The proposal does not need to wait for its effects to be applied
in order to know whether its changes will succeed or fail. The raft log is
the provider of atomicity and durability for replicated writes, not (ignoring
log truncation) the replicated state machine itself, so a client can be
confident in the result of a write as soon as the raft log confirms that it
has succeeded.

However, there are a few complications to acknowledging the success of a
proposal at this stage:

1. Committing an entry in the raft log and having the command in that entry
   succeed are similar but not equivalent concepts. Even if the entry succeeds
   in achieving durability by replicating to a quorum of replicas, its command
   may still be rejected "beneath raft". This means that a (deterministic)
   check after replication decides that the command will not be applied to the
   replicated state machine. In that case, the client waiting on the result of
   the command should not be informed of its success. Luckily, this check is
   cheap to perform so we can do it here and when applying the command. See
   Replica.shouldApplyCommand.

2. Some commands perform non-trivial work such as updating Replica configuration
   state or performing Range splits. In those cases, it's likely that the client
   is interested in not only knowing whether it has succeeded in sequencing the
   change in the raft log, but also in knowing when the change has gone into
   effect. There's currently no exposed hook to ask for an acknowledgement only
   after a command has been applied, so for simplicity the current implementation
   only ever acks transactional writes before they have gone into effect. All
   other commands wait until they have been applied to ack their client.

3. Even though we can determine whether a command has succeeded without applying
   it, the effect of the command will not be visible to conflicting commands until
   it is applied. Because of this, the client can be informed of the success of
   a write at this point, but we cannot release that write's latches until the
   write has applied. See ProposalData.signalProposalResult/finishApplication.

\### Benchmarks

The change appears to result in an **8-10%** improvement to throughput and a
**6-10%** reduction in p50 latency across the board on kv0. I ran a series of
tests with different node sizes and difference workload concurrencies and the
win seemed pretty stable. This was also true regardless of whether the writes
were to a single Raft group or a large number of Raft groups.

```
name                           old ops/sec  new ops/sec  delta
kv0/cores=16/nodes=3/conc=32    24.1k ± 0%   26.1k ± 1%   +8.35%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=48    30.4k ± 1%   32.8k ± 1%   +8.02%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=64    34.6k ± 1%   37.6k ± 0%   +8.79%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72    46.6k ± 1%   50.8k ± 0%   +8.99%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108   58.8k ± 1%   64.0k ± 1%   +8.99%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144   68.1k ± 1%   74.5k ± 1%   +9.45%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=144   55.8k ± 1%   59.7k ± 2%   +7.12%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216   64.4k ± 4%   68.1k ± 4%   +5.65%  (p=0.016 n=5+5)
kv0/cores=72/nodes=3/conc=288   68.8k ± 2%   74.5k ± 3%   +8.39%  (p=0.008 n=5+5)

name                           old p50(ms)  new p50(ms)  delta
kv0/cores=16/nodes=3/conc=32     1.30 ± 0%    1.20 ± 0%   -7.69%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=48     1.50 ± 0%    1.40 ± 0%   -6.67%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=64     1.70 ± 0%    1.60 ± 0%   -5.88%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72     1.40 ± 0%    1.30 ± 0%   -7.14%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108    1.60 ± 0%    1.50 ± 0%   -6.25%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144    1.84 ± 3%    1.70 ± 0%   -7.61%  (p=0.000 n=5+4)
kv0/cores=72/nodes=3/conc=144    2.00 ± 0%    1.80 ± 0%  -10.00%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216    2.46 ± 2%    2.20 ± 0%  -10.57%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=288    2.80 ± 0%    2.60 ± 0%   -7.14%  (p=0.079 n=4+5)

name                           old p99(ms)  new p99(ms)  delta
kv0/cores=16/nodes=3/conc=32     3.50 ± 0%    3.50 ± 0%     ~     (all equal)
kv0/cores=16/nodes=3/conc=48     4.70 ± 0%    4.58 ± 3%     ~     (p=0.167 n=5+5)
kv0/cores=16/nodes=3/conc=64     5.50 ± 0%    5.20 ± 0%   -5.45%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72     5.00 ± 0%    4.70 ± 0%   -6.00%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108    5.80 ± 0%    5.50 ± 0%   -5.17%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144    6.48 ± 3%    6.18 ± 3%   -4.63%  (p=0.079 n=5+5)
kv0/cores=72/nodes=3/conc=144    11.0 ± 0%    10.5 ± 0%   -4.55%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216    13.4 ± 2%    13.2 ± 5%     ~     (p=0.683 n=5+5)
kv0/cores=72/nodes=3/conc=288    18.2 ± 4%    17.2 ± 3%   -5.70%  (p=0.079 n=5+5)
```

Release note (performance improvement): Raft entries no longer wait to
be applied to the RocksDB storage engine before signaling their success
to clients, they now only wait until they are committed in their Raft log.
craig bot pushed a commit that referenced this pull request Aug 13, 2019
38954: storage: ack Raft proposals after Raft log commit, not state machine apply r=ajwerner a=nvanbenschoten

Informs #17500.

This is a partial revival of #18710 and a culmination of more recent thinking in #17500 (comment).

The change adjusts the Raft processing loop so that it acknowledges the success of raft entries as soon as it learns that they have been durably committed to the raft log instead of after they have been applied to the proposer replica's replicated state machine. This not only pulls the application latency out of the hot path for Raft proposals, but it also pulls the next raft ready iteration's write to its Raft log (with the associated fsync) out of the hot path for Raft proposals.

This is safe because a proposal through raft is known to have succeeded as soon as it is replicated to a quorum of replicas (i.e. has committed in the raft log). The proposal does not need to wait for its effects to be applied in order to know whether its changes will succeed or fail. The raft log is the provider of atomicity and durability for replicated writes, not (ignoring log truncation) the replicated state machine itself, so a client can be confident in the result of a write as soon as the raft log confirms that it has succeeded.

However, there are a few complications in acknowledging the success of a proposal at this stage:

1. Committing an entry in the raft log and having the command in that entry succeed are similar but not equivalent concepts. Even if the entry succeeds in achieving durability by replicating to a quorum of replicas, its command may still be rejected "beneath raft". This means that a (deterministic) check after replication decides that the command will not be applied to the replicated state machine. In that case, the client waiting on the result of the command should not be informed of its success. Luckily, this check is cheap to perform so we can do it here and when applying the command. See `Replica.shouldApplyCommand`.
2. Some commands perform non-trivial work such as updating Replica configuration state or performing Range splits. In those cases, it's likely that the client is interested in not only knowing whether it has succeeded in sequencing the change in the raft log, but also in knowing when the change has gone into effect. There's currently no exposed hook to ask for an acknowledgment only after a command has been applied, so for simplicity, the current implementation only ever acks transactional writes before they have gone into effect. All other commands wait until they have been applied to ack their client.
3. Even though we can determine whether a command has succeeded without applying it, the effect of the command will not be visible to conflicting commands until it is applied. Because of this, the client can be informed of the success of a write at this point, but we cannot release that write's latches until the write has applied. See `ProposalData.signalProposalResult/finishApplication`.

### Benchmarks

The change appears to result in an **8-10%** improvement to throughput and a **6-10%** reduction in p50 latency across the board on kv0. I ran a series of tests with different node sizes and difference workload concurrencies and the win seemed pretty stable. This was also true regardless of whether the writes were to a single Raft group or a large number of Raft groups.

```
name                           old ops/sec  new ops/sec  delta
kv0/cores=16/nodes=3/conc=32    24.1k ± 0%   26.1k ± 1%   +8.35%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=48    30.4k ± 1%   32.8k ± 1%   +8.02%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=64    34.6k ± 1%   37.6k ± 0%   +8.79%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72    46.6k ± 1%   50.8k ± 0%   +8.99%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108   58.8k ± 1%   64.0k ± 1%   +8.99%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144   68.1k ± 1%   74.5k ± 1%   +9.45%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=144   55.8k ± 1%   59.7k ± 2%   +7.12%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216   64.4k ± 4%   68.1k ± 4%   +5.65%  (p=0.016 n=5+5)
kv0/cores=72/nodes=3/conc=288   68.8k ± 2%   74.5k ± 3%   +8.39%  (p=0.008 n=5+5)

name                           old p50(ms)  new p50(ms)  delta
kv0/cores=16/nodes=3/conc=32     1.30 ± 0%    1.20 ± 0%   -7.69%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=48     1.50 ± 0%    1.40 ± 0%   -6.67%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=64     1.70 ± 0%    1.60 ± 0%   -5.88%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72     1.40 ± 0%    1.30 ± 0%   -7.14%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108    1.60 ± 0%    1.50 ± 0%   -6.25%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144    1.84 ± 3%    1.70 ± 0%   -7.61%  (p=0.000 n=5+4)
kv0/cores=72/nodes=3/conc=144    2.00 ± 0%    1.80 ± 0%  -10.00%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216    2.46 ± 2%    2.20 ± 0%  -10.57%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=288    2.80 ± 0%    2.60 ± 0%   -7.14%  (p=0.079 n=4+5)

name                           old p99(ms)  new p99(ms)  delta
kv0/cores=16/nodes=3/conc=32     3.50 ± 0%    3.50 ± 0%     ~     (all equal)
kv0/cores=16/nodes=3/conc=48     4.70 ± 0%    4.58 ± 3%     ~     (p=0.167 n=5+5)
kv0/cores=16/nodes=3/conc=64     5.50 ± 0%    5.20 ± 0%   -5.45%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72     5.00 ± 0%    4.70 ± 0%   -6.00%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108    5.80 ± 0%    5.50 ± 0%   -5.17%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144    6.48 ± 3%    6.18 ± 3%   -4.63%  (p=0.079 n=5+5)
kv0/cores=72/nodes=3/conc=144    11.0 ± 0%    10.5 ± 0%   -4.55%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216    13.4 ± 2%    13.2 ± 5%     ~     (p=0.683 n=5+5)
kv0/cores=72/nodes=3/conc=288    18.2 ± 4%    17.2 ± 3%   -5.70%  (p=0.079 n=5+5)
```

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants