Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: ack Raft proposals after Raft log commit, not state machine apply #38954

Merged
merged 1 commit into from
Aug 13, 2019

Commits on Aug 13, 2019

  1. storage: ack Raft proposals after Raft log commit, not state machine …

    …apply
    
    Informs cockroachdb#17500.
    
    This is a partial revival of cockroachdb#18710 and a culmination of more recent
    thinking in cockroachdb#17500 (comment).
    
    The change adjusts the Raft processing loop so that it acknowledges the success
    of raft entries as soon as it learns that they have been durably committed to
    the raft log instead of after they have been applied to the proposer replica's
    replicated state machine. This not only pulls the application latency out of the
    hot path for Raft proposals, but it also pulls the next raft ready iteration's
    write to its Raft log (with the associated fsync) out of the hot path for Raft
    proposals.
    
    This is safe because a proposal through raft is known to have succeeded as
    soon as it is replicated to a quorum of replicas (i.e. has committed in the
    raft log). The proposal does not need to wait for its effects to be applied
    in order to know whether its changes will succeed or fail. The raft log is
    the provider of atomicity and durability for replicated writes, not (ignoring
    log truncation) the replicated state machine itself, so a client can be
    confident in the result of a write as soon as the raft log confirms that it
    has succeeded.
    
    However, there are a few complications to acknowledging the success of a
    proposal at this stage:
    
    1. Committing an entry in the raft log and having the command in that entry
       succeed are similar but not equivalent concepts. Even if the entry succeeds
       in achieving durability by replicating to a quorum of replicas, its command
       may still be rejected "beneath raft". This means that a (deterministic)
       check after replication decides that the command will not be applied to the
       replicated state machine. In that case, the client waiting on the result of
       the command should not be informed of its success. Luckily, this check is
       cheap to perform so we can do it here and when applying the command. See
       Replica.shouldApplyCommand.
    
    2. Some commands perform non-trivial work such as updating Replica configuration
       state or performing Range splits. In those cases, it's likely that the client
       is interested in not only knowing whether it has succeeded in sequencing the
       change in the raft log, but also in knowing when the change has gone into
       effect. There's currently no exposed hook to ask for an acknowledgement only
       after a command has been applied, so for simplicity the current implementation
       only ever acks transactional writes before they have gone into effect. All
       other commands wait until they have been applied to ack their client.
    
    3. Even though we can determine whether a command has succeeded without applying
       it, the effect of the command will not be visible to conflicting commands until
       it is applied. Because of this, the client can be informed of the success of
       a write at this point, but we cannot release that write's latches until the
       write has applied. See ProposalData.signalProposalResult/finishApplication.
    
    \### Benchmarks
    
    The change appears to result in an **8-10%** improvement to throughput and a
    **6-10%** reduction in p50 latency across the board on kv0. I ran a series of
    tests with different node sizes and difference workload concurrencies and the
    win seemed pretty stable. This was also true regardless of whether the writes
    were to a single Raft group or a large number of Raft groups.
    
    ```
    name                           old ops/sec  new ops/sec  delta
    kv0/cores=16/nodes=3/conc=32    24.1k ± 0%   26.1k ± 1%   +8.35%  (p=0.008 n=5+5)
    kv0/cores=16/nodes=3/conc=48    30.4k ± 1%   32.8k ± 1%   +8.02%  (p=0.008 n=5+5)
    kv0/cores=16/nodes=3/conc=64    34.6k ± 1%   37.6k ± 0%   +8.79%  (p=0.008 n=5+5)
    kv0/cores=36/nodes=3/conc=72    46.6k ± 1%   50.8k ± 0%   +8.99%  (p=0.008 n=5+5)
    kv0/cores=36/nodes=3/conc=108   58.8k ± 1%   64.0k ± 1%   +8.99%  (p=0.008 n=5+5)
    kv0/cores=36/nodes=3/conc=144   68.1k ± 1%   74.5k ± 1%   +9.45%  (p=0.008 n=5+5)
    kv0/cores=72/nodes=3/conc=144   55.8k ± 1%   59.7k ± 2%   +7.12%  (p=0.008 n=5+5)
    kv0/cores=72/nodes=3/conc=216   64.4k ± 4%   68.1k ± 4%   +5.65%  (p=0.016 n=5+5)
    kv0/cores=72/nodes=3/conc=288   68.8k ± 2%   74.5k ± 3%   +8.39%  (p=0.008 n=5+5)
    
    name                           old p50(ms)  new p50(ms)  delta
    kv0/cores=16/nodes=3/conc=32     1.30 ± 0%    1.20 ± 0%   -7.69%  (p=0.008 n=5+5)
    kv0/cores=16/nodes=3/conc=48     1.50 ± 0%    1.40 ± 0%   -6.67%  (p=0.008 n=5+5)
    kv0/cores=16/nodes=3/conc=64     1.70 ± 0%    1.60 ± 0%   -5.88%  (p=0.008 n=5+5)
    kv0/cores=36/nodes=3/conc=72     1.40 ± 0%    1.30 ± 0%   -7.14%  (p=0.008 n=5+5)
    kv0/cores=36/nodes=3/conc=108    1.60 ± 0%    1.50 ± 0%   -6.25%  (p=0.008 n=5+5)
    kv0/cores=36/nodes=3/conc=144    1.84 ± 3%    1.70 ± 0%   -7.61%  (p=0.000 n=5+4)
    kv0/cores=72/nodes=3/conc=144    2.00 ± 0%    1.80 ± 0%  -10.00%  (p=0.008 n=5+5)
    kv0/cores=72/nodes=3/conc=216    2.46 ± 2%    2.20 ± 0%  -10.57%  (p=0.008 n=5+5)
    kv0/cores=72/nodes=3/conc=288    2.80 ± 0%    2.60 ± 0%   -7.14%  (p=0.079 n=4+5)
    
    name                           old p99(ms)  new p99(ms)  delta
    kv0/cores=16/nodes=3/conc=32     3.50 ± 0%    3.50 ± 0%     ~     (all equal)
    kv0/cores=16/nodes=3/conc=48     4.70 ± 0%    4.58 ± 3%     ~     (p=0.167 n=5+5)
    kv0/cores=16/nodes=3/conc=64     5.50 ± 0%    5.20 ± 0%   -5.45%  (p=0.008 n=5+5)
    kv0/cores=36/nodes=3/conc=72     5.00 ± 0%    4.70 ± 0%   -6.00%  (p=0.008 n=5+5)
    kv0/cores=36/nodes=3/conc=108    5.80 ± 0%    5.50 ± 0%   -5.17%  (p=0.008 n=5+5)
    kv0/cores=36/nodes=3/conc=144    6.48 ± 3%    6.18 ± 3%   -4.63%  (p=0.079 n=5+5)
    kv0/cores=72/nodes=3/conc=144    11.0 ± 0%    10.5 ± 0%   -4.55%  (p=0.008 n=5+5)
    kv0/cores=72/nodes=3/conc=216    13.4 ± 2%    13.2 ± 5%     ~     (p=0.683 n=5+5)
    kv0/cores=72/nodes=3/conc=288    18.2 ± 4%    17.2 ± 3%   -5.70%  (p=0.079 n=5+5)
    ```
    
    Release note (performance improvement): Raft entries no longer wait to
    be applied to the RocksDB storage engine before signaling their success
    to clients, they now only wait until they are committed in their Raft log.
    nvanbenschoten committed Aug 13, 2019
    Configuration menu
    Copy the full SHA
    87aaea7 View commit details
    Browse the repository at this point in the history