Skip to content

Commit

Permalink
kv: re-tune Raft knobs for high throughput WAN replication
Browse files Browse the repository at this point in the history
Fixes #49564.

The knobs were tuned four years ago with a focus on stability during the
code yellow (see 89e8fe3 and 36b7640). They were tuned primarily in
response to observed instability due to long handleRaftReady pauses.
Since then, a lot has changed:

- we now batch the application of Raft entries in the same raft.Ready struct
- we now acknowledge the proposers of Raft entries before their application
- we now set a MaxCommittedSizePerReady value to prevent the size of the
committed entries in a single raft.Ready struct from growing too large. We
introduced this knob a few years ago. Even on its own, it appears to
invalidate the motivating reason for the tuning of the other knobs
- we now default to 512MB ranges, so we expect to have 1/8 of the number of
total ranges on a node

In response to these database changes, this commit makes the following
adjustments to the replication knobs:
- increase `RaftMaxSizePerMsg` from 16 KB to 32 KB
- increase `RaftMaxInflightMsgs` from 64 to 128
- increase `RaftLogTruncationThreshold` from 4 MB to 8 MB
- increase `RaftProposalQuota` from 1 MB to 4 MB

Combined, these changes increase the per-replica replication window size
from 1 MB to 4 MB. This should allow for higher throughput replication,
especially over high latency links.

To test this, we run a global cluster (nodes in us-east1, us-west1, and
europe-west1) and write 10 KB blocks as fast as possible to a single
Range. This is similar to a workload we see customers run in testing
and production environments.

```
\# Setup cluster
roachprod create nathan-geo -n=4 --gce-machine-type=n1-standard-16 --geo --gce-zones='us-east1-b,us-west1-b,europe-west1-b'
roachprod stage nathan-geo cockroach
roachprod start nathan-geo:1-3

\# Setup dataset
roachprod run nathan-geo:4 -- './cockroach workload init kv {pgurl:1}'
roachprod sql nathan-geo:1 -- -e "ALTER TABLE kv.kv CONFIGURE ZONE USING constraints = COPY FROM PARENT, lease_preferences = '[[+region=us-east1]]'"

\# Run workload before tuning
roachprod stop nathan-geo:1-3 && roachprod start nathan-geo:1-3
roachprod run nathan-geo:4 -- './cockroach workload run kv --ramp=15s --duration=3m --sequential --min-block-bytes=10000 --max-block-bytes=10000 --concurrency=128 {pgurl:1}'

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
180.0s        0         115524          641.8    199.3    201.3    251.7    285.2    604.0  write

\# Run workload after tuning
roachprod stop nathan-geo:1-3 && COCKROACH_RAFT_MAX_INFLIGHT_MSGS=128 COCKROACH_RAFT_MAX_SIZE_PER_MSG=32768 COCKROACH_RAFT_LOG_TRUNCATION_THRESHOLD=16777216 roachprod start nathan-geo:1-3
roachprod run nathan-geo:4 -- './cockroach workload run kv --ramp=15s --duration=3m --sequential --min-block-bytes=10000 --max-block-bytes=10000 --concurrency=128 --write-seq=S123829 {pgurl:1}'

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  180.0s        0         288512         1602.9     79.8     75.5    104.9    209.7    738.2  write
```

Before the change, we see p50 latency at 3x the expected replication
latency. This is due to throttling on the leader replica. After the
change, we see p50 latencies exactly where we expect them to be. Higher
percentile latencies improve accordingly. We also see a 150% increase in
throughput on the workload. This is reflected in the rate at which we
write to disk, which jumps from ~45 MB/s on each node to ~120 MB/s on
each node. Finally, we do not see a corresponding increase in Raft ready
latency, which was the driving reason for the knobs being tuned so low.

Release note (performance improvement): default replication
configurations have been tuned to support higher replication throughput
in high-latency replication quorums.
  • Loading branch information
nvanbenschoten committed May 28, 2020
1 parent d5dd67b commit 258b965
Show file tree
Hide file tree
Showing 2 changed files with 29 additions and 11 deletions.
32 changes: 21 additions & 11 deletions pkg/base/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -107,23 +107,23 @@ var (
// is responsible for ensuring the raft log doesn't grow without bound by
// making sure the leader doesn't get too far ahead.
defaultRaftLogTruncationThreshold = envutil.EnvOrDefaultInt64(
"COCKROACH_RAFT_LOG_TRUNCATION_THRESHOLD", 4<<20 /* 4 MB */)
"COCKROACH_RAFT_LOG_TRUNCATION_THRESHOLD", 8<<20 /* 8 MB */)

// defaultRaftMaxSizePerMsg specifies the maximum aggregate byte size of Raft
// log entries that a leader will send to followers in a single MsgApp.
defaultRaftMaxSizePerMsg = envutil.EnvOrDefaultInt(
"COCKROACH_RAFT_MAX_SIZE_PER_MSG", 16<<10 /* 16 KB */)
"COCKROACH_RAFT_MAX_SIZE_PER_MSG", 32<<10 /* 32 KB */)

// defaultRaftMaxSizeCommittedSizePerReady specifies the maximum aggregate
// byte size of the committed log entries which a node will receive in a
// single Ready.
defaultRaftMaxCommittedSizePerReady = envutil.EnvOrDefaultInt(
"COCKROACH_RAFT_MAX_COMMITTED_SIZE_PER_READY", 64<<20 /* 64 MB */)

// defaultRaftMaxSizePerMsg specifies how many "inflight" messages a leader
// will send to a follower without hearing a response.
// defaultRaftMaxInflightMsgs specifies how many "inflight" MsgApps a leader
// will send to a given follower without hearing a response.
defaultRaftMaxInflightMsgs = envutil.EnvOrDefaultInt(
"COCKROACH_RAFT_MAX_INFLIGHT_MSGS", 64)
"COCKROACH_RAFT_MAX_INFLIGHT_MSGS", 128)
)

type lazyHTTPClient struct {
Expand Down Expand Up @@ -493,19 +493,22 @@ type RaftConfig struct {
RaftMaxUncommittedEntriesSize uint64

// RaftMaxSizePerMsg controls the maximum aggregate byte size of Raft log
// entries the leader will send to followers in a single MsgApp.
// entries the leader will send to followers in a single MsgApp. Smaller
// value lowers the raft recovery cost (during initial probing and after
// message loss during normal operation). On the other hand, it limits the
// throughput during normal replication.
RaftMaxSizePerMsg uint64

// RaftMaxCommittedSizePerReady controls the maximum aggregate byte size of
// committed Raft log entries a replica will receive in a single Ready.
RaftMaxCommittedSizePerReady uint64

// RaftMaxInflightMsgs controls how many "inflight" messages Raft will send
// RaftMaxInflightMsgs controls how many "inflight" MsgApps Raft will send
// to a follower without hearing a response. The total number of Raft log
// entries is a combination of this setting and RaftMaxSizePerMsg. The
// current default settings provide for up to 1 MB of raft log to be sent
// current default settings provide for up to 4 MB of raft log to be sent
// without acknowledgement. With an average entry size of 1 KB that
// translates to ~1024 commands that might be executed in the handling of a
// translates to ~4096 commands that might be executed in the handling of a
// single raft.Ready operation.
RaftMaxInflightMsgs int

Expand Down Expand Up @@ -536,7 +539,7 @@ func (cfg *RaftConfig) SetDefaults() {
if cfg.RaftProposalQuota == 0 {
// By default, set this to a fraction of RaftLogMaxSize. See the comment
// on the field for the tradeoffs of setting this higher or lower.
cfg.RaftProposalQuota = cfg.RaftLogTruncationThreshold / 4
cfg.RaftProposalQuota = cfg.RaftLogTruncationThreshold / 2
}
if cfg.RaftMaxUncommittedEntriesSize == 0 {
// By default, set this to twice the RaftProposalQuota. The logic here
Expand All @@ -554,7 +557,6 @@ func (cfg *RaftConfig) SetDefaults() {
if cfg.RaftMaxInflightMsgs == 0 {
cfg.RaftMaxInflightMsgs = defaultRaftMaxInflightMsgs
}

if cfg.RaftDelaySplitToSuppressSnapshotTicks == 0 {
// The Raft Ticks interval defaults to 200ms, and an election is 15
// ticks. Add a generous amount of ticks to make sure even a backed up
Expand All @@ -567,6 +569,14 @@ func (cfg *RaftConfig) SetDefaults() {
// The resulting delay configured here is about 50s.
cfg.RaftDelaySplitToSuppressSnapshotTicks = 3*cfg.RaftElectionTimeoutTicks + 200
}

// Minor validation to ensure sane tuning.
if cfg.RaftProposalQuota > int64(cfg.RaftMaxUncommittedEntriesSize) {
panic("raft proposal quota should not be above max uncommitted entries size")
}
if cfg.RaftProposalQuota < int64(cfg.RaftMaxSizePerMsg)*int64(cfg.RaftMaxInflightMsgs) {
panic("raft proposal quota should not be below per-replica replication window size")
}
}

// RaftElectionTimeout returns the raft election timeout, as computed from the
Expand Down
8 changes: 8 additions & 0 deletions pkg/kv/kvserver/client_raft_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1400,6 +1400,11 @@ func TestLogGrowthWhenRefreshingPendingCommands(t *testing.T) {
sc.RaftElectionTimeoutTicks = 1000000
// Reduce the max uncommitted entry size.
sc.RaftMaxUncommittedEntriesSize = 64 << 10 // 64 KB
// RaftProposalQuota cannot exceed RaftMaxUncommittedEntriesSize.
sc.RaftProposalQuota = int64(sc.RaftMaxUncommittedEntriesSize)
// RaftMaxInflightMsgs * RaftMaxSizePerMsg cannot exceed RaftProposalQuota.
sc.RaftMaxInflightMsgs = 16
sc.RaftMaxSizePerMsg = 1 << 10 // 1 KB
// Disable leader transfers during leaseholder changes so that we
// can easily create leader-not-leaseholder scenarios.
sc.TestingKnobs.DisableLeaderFollowsLeaseholder = true
Expand Down Expand Up @@ -5169,6 +5174,9 @@ func TestReplicaRemovalClosesProposalQuota(t *testing.T) {
// Set the proposal quota to a tiny amount so that each write will
// exceed it.
RaftProposalQuota: 512,
// RaftMaxInflightMsgs * RaftMaxSizePerMsg cannot exceed RaftProposalQuota.
RaftMaxInflightMsgs: 2,
RaftMaxSizePerMsg: 256,
},
},
ReplicationMode: base.ReplicationManual,
Expand Down

0 comments on commit 258b965

Please sign in to comment.