Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: re-tune Raft replication knobs #49564

Closed
nvanbenschoten opened this issue May 26, 2020 · 0 comments · Fixed by #49619
Closed

kv: re-tune Raft replication knobs #49564

nvanbenschoten opened this issue May 26, 2020 · 0 comments · Fixed by #49619
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented May 26, 2020

There's a history of tuning the Raft configurations (raft.Config) to hit a balance between stability and high throughput under steady-state load. Most of this tuning was performed back during the stability code yellow: for instance, see 89e8fe3 and 36b7640.

These knobs were tuned back when we were observing instability due to long handleRaftReady pauses. Since then, a lot has changed:

  • we now batch the application of Raft entries in the same raft.Ready struct
  • we now acknowledge the proposers of Raft entries before their application
  • we now set a MaxCommittedSizePerReady value to prevent the size of the committed entries in a single raft.Ready struct from growing too large. We introduced this knob a few years ago. Even on its own, it appears to invalidate the driving reason for the tuning of the other knobs
  • we now default to 512MB ranges, so we expect to have 1/8 of the number of total ranges on a node

Since so much has changed in this area, the current values are probably no longer ideal. We should re-tune them.

Meanwhile, we've also seen customer experiments that indicate that the default tuning is too conservative, especially when replicating over a WAN link. Even under low load, the customer saw unexpected delays due to small default values for MaxSizePerMsg and MaxInflightMsgs, which combined effectively sets a 1MB window size for each follower replica in a Range. Assuming a 50ms RTT to each replica in a WAN replication topology, this places a ceiling on replication throughput for a range to 20 MB/s, which seems low. Ideally, we'd get this up closer to 80-100 MB/s. Bumping these values up solved the customer issue.

My current guess is that we'll want to double the default values of defaultRaftMaxSizePerMsg and defaultRaftMaxInflightMsgs to 32 KB and 128, respectively. This will allow us to more effectively batch large writes and create deeper pipelines of MsgApps over WAN links.

@nvanbenschoten nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. labels May 26, 2020
@nvanbenschoten nvanbenschoten self-assigned this May 26, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 27, 2020
Fixes cockroachdb#49564.

The knobs were tuned four years ago with a focus on stability during the
code yellow (see 89e8fe3 and 36b7640). They were tuned primarily in
response to observed instability due to long handleRaftReady pauses.
Since then, a lot has changed:

- we now batch the application of Raft entries in the same raft.Ready struct
- we now acknowledge the proposers of Raft entries before their application
- we now set a MaxCommittedSizePerReady value to prevent the size of the
committed entries in a single raft.Ready struct from growing too large. We
introduced this knob a few years ago. Even on its own, it appears to
invalidate the motivating reason for the tuning of the other knobs
- we now default to 512MB ranges, so we expect to have 1/8 of the number of
total ranges on a node

In response to these database changes, this commit makes the following
adjustments to the replication knobs:
- increase `RaftMaxSizePerMsg` from 16 KB to 32 KB
- increase `RaftMaxInflightMsgs` from 64 to 128
- increase `RaftLogTruncationThreshold` from 4 MB to 8 MB
- increase `RaftProposalQuota` from 1 MB to 4 MB

Combined, these changes increase the per-replica replication window size
from 1 MB to 4 MB. This should allow for higher throughput replication,
especially over high latency links.

To test this, we run a global cluster (nodes in us-east1, us-west1, and
europe-west1) and write 10 KB blocks as fast as possible to a single
Range. This is similar to a workload we see customers run in testing
and production environments.

```
\# Setup cluster
roachprod create nathan-geo -n=4 --gce-machine-type=n1-standard-16 --geo --gce-zones='us-east1-b,us-west1-b,europe-west1-b'
roachprod stage nathan-geo cockroach
roachprod start nathan-geo:1-3

\# Setup dataset
roachprod run nathan-geo:4 -- './cockroach workload init kv {pgurl:1}'
roachprod sql nathan-geo:1 -- -e "ALTER TABLE kv.kv CONFIGURE ZONE USING constraints = COPY FROM PARENT, lease_preferences = '[[+region=us-east1]]'"

\# Run workload before tuning
roachprod stop nathan-geo:1-3 && roachprod start nathan-geo:1-3
roachprod run nathan-geo:4 -- './cockroach workload run kv --ramp=15s --duration=3m --sequential --min-block-bytes=10000 --max-block-bytes=10000 --concurrency=128 {pgurl:1}'

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
180.0s        0         115524          641.8    199.3    201.3    251.7    285.2    604.0  write

\# Run workload after tuning
roachprod stop nathan-geo:1-3 && COCKROACH_RAFT_MAX_INFLIGHT_MSGS=128 COCKROACH_RAFT_MAX_SIZE_PER_MSG=32768 COCKROACH_RAFT_LOG_TRUNCATION_THRESHOLD=16777216 roachprod start nathan-geo:1-3
roachprod run nathan-geo:4 -- './cockroach workload run kv --ramp=15s --duration=3m --sequential --min-block-bytes=10000 --max-block-bytes=10000 --concurrency=128 --write-seq=S123829 {pgurl:1}'

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  180.0s        0         288512         1602.9     79.8     75.5    104.9    209.7    738.2  write
```

Before the change, we see p50 latency at 3x the expected replication
latency. This is due to throttling on the leader replica. After the
change, we see p50 latencies exactly where we expect them to be. Higher
percentile latencies improve accordingly. We also see a 150% increase in
throughput on the workload. This is reflected in the rate at which we
write to disk, which jumps from ~45 MB/s on each node to ~120 MB/s on
each node. Finally, we do not see a corresponding increase in Raft ready
latency, which was the driving reason for the knobs being tuned so low.

Release note (performance improvement): default replication
configurations have been tuned to support higher replication throughput
in high-latency replication quorums.
craig bot pushed a commit that referenced this issue May 28, 2020
49619: kv: re-tune Raft knobs for high throughput WAN replication r=nvanbenschoten a=nvanbenschoten

Fixes #49564.

The knobs were tuned four years ago with a focus on stability during the code yellow (see 89e8fe3 and 36b7640). They were tuned primarily in response to observed instability due to long handleRaftReady pauses. Since then, a lot has changed:

- we now batch the application of Raft entries in the same `raft.Ready` struct
- we now acknowledge the proposers of Raft entries before their application
- we now set a `MaxCommittedSizePerReady` value to prevent the size of the committed entries in a single `raft.Ready` struct from growing too large. We introduced this knob a few years ago. Even on its own, it appears to invalidate the motivating reason for the tuning of the other knobs
- we now default to 512MB ranges, so we expect to have 1/8 of the number of total ranges on a node

In response to these database changes, this commit makes the following adjustments to the replication knobs:
**- increase `RaftMaxSizePerMsg` from 16 KB to 32 KB**
**- increase `RaftMaxInflightMsgs` from 64 to 128**
**- increase `RaftLogTruncationThreshold` from 4 MB to 8 MB**
**- increase `RaftProposalQuota` from 1 MB to 4 MB**

Combined, these changes increase the per-replica replication window size from 1 MB to 4 MB. This should allow for higher throughput replication, especially over high latency links.

To test this, we run a global cluster (nodes in us-east1, us-west1, and europe-west1) and write 10 KB blocks as fast as possible to a single Range. This is similar to a workload we see customers run in testing and production environments.

<img width="485" alt="Screen Shot 2020-05-27 at 7 10 50 PM" src="https://user-images.githubusercontent.com/5438456/83081584-0b42a900-a04f-11ea-8757-90b251ff498b.png">

```
# Setup cluster
roachprod create nathan-geo -n=4 --gce-machine-type=n1-standard-16 --geo --gce-zones='us-east1-b,us-west1-b,europe-west1-b'
roachprod stage nathan-geo cockroach
roachprod start nathan-geo:1-3

# Setup dataset
roachprod run nathan-geo:4 -- './cockroach workload init kv {pgurl:1}'
roachprod sql nathan-geo:1 -- -e "ALTER TABLE kv.kv CONFIGURE ZONE USING constraints = COPY FROM PARENT, lease_preferences = '[[+region=us-east1]]'"

# Run workload before tuning
roachprod stop nathan-geo:1-3 && roachprod start nathan-geo:1-3
roachprod run nathan-geo:4 -- './cockroach workload run kv --ramp=15s --duration=3m --sequential --min-block-bytes=10000 --max-block-bytes=10000 --concurrency=128 {pgurl:1}'

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
180.0s        0         115524          641.8    199.3    201.3    251.7    285.2    604.0  write

# Run workload after tuning
roachprod stop nathan-geo:1-3 && COCKROACH_RAFT_MAX_INFLIGHT_MSGS=128 COCKROACH_RAFT_MAX_SIZE_PER_MSG=32768 COCKROACH_RAFT_LOG_TRUNCATION_THRESHOLD=16777216 roachprod start nathan-geo:1-3
roachprod run nathan-geo:4 -- './cockroach workload run kv --ramp=15s --duration=3m --sequential --min-block-bytes=10000 --max-block-bytes=10000 --concurrency=128 --write-seq=S123829 {pgurl:1}'

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  180.0s        0         288512         1602.9     79.8     75.5    104.9    209.7    738.2  write
```

Before the change, we see p50 latency at 3x the expected replication latency. This is due to throttling on the leader replica. After the change, we see p50 latencies exactly where we expect them to be. Higher percentile latencies improve accordingly.

<img width="982" alt="Screen Shot 2020-05-27 at 7 10 31 PM" src="https://user-images.githubusercontent.com/5438456/83081610-17c70180-a04f-11ea-8a5e-c87808b69450.png">

We also see a 150% increase in throughput on the workload. This is reflected in the rate at which we write to disk, which jumps from ~45 MB/s on each node to ~120 MB/s on each node.

<img width="978" alt="Screen Shot 2020-05-27 at 7 10 21 PM" src="https://user-images.githubusercontent.com/5438456/83081623-1d244c00-a04f-11ea-8b48-d6c095643d9f.png">

Finally, we do not see a corresponding increase in Raft ready latency, which was the driving reason for the knobs being tuned so low.

Release note (performance improvement): default replication configurations have been tuned to support higher replication throughput in high-latency replication quorums.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in 258b965 May 28, 2020
jbowens pushed a commit to jbowens/cockroach that referenced this issue Jun 1, 2020
Fixes cockroachdb#49564.

The knobs were tuned four years ago with a focus on stability during the
code yellow (see 89e8fe3 and 36b7640). They were tuned primarily in
response to observed instability due to long handleRaftReady pauses.
Since then, a lot has changed:

- we now batch the application of Raft entries in the same raft.Ready struct
- we now acknowledge the proposers of Raft entries before their application
- we now set a MaxCommittedSizePerReady value to prevent the size of the
committed entries in a single raft.Ready struct from growing too large. We
introduced this knob a few years ago. Even on its own, it appears to
invalidate the motivating reason for the tuning of the other knobs
- we now default to 512MB ranges, so we expect to have 1/8 of the number of
total ranges on a node

In response to these database changes, this commit makes the following
adjustments to the replication knobs:
- increase `RaftMaxSizePerMsg` from 16 KB to 32 KB
- increase `RaftMaxInflightMsgs` from 64 to 128
- increase `RaftLogTruncationThreshold` from 4 MB to 8 MB
- increase `RaftProposalQuota` from 1 MB to 4 MB

Combined, these changes increase the per-replica replication window size
from 1 MB to 4 MB. This should allow for higher throughput replication,
especially over high latency links.

To test this, we run a global cluster (nodes in us-east1, us-west1, and
europe-west1) and write 10 KB blocks as fast as possible to a single
Range. This is similar to a workload we see customers run in testing
and production environments.

```
\# Setup cluster
roachprod create nathan-geo -n=4 --gce-machine-type=n1-standard-16 --geo --gce-zones='us-east1-b,us-west1-b,europe-west1-b'
roachprod stage nathan-geo cockroach
roachprod start nathan-geo:1-3

\# Setup dataset
roachprod run nathan-geo:4 -- './cockroach workload init kv {pgurl:1}'
roachprod sql nathan-geo:1 -- -e "ALTER TABLE kv.kv CONFIGURE ZONE USING constraints = COPY FROM PARENT, lease_preferences = '[[+region=us-east1]]'"

\# Run workload before tuning
roachprod stop nathan-geo:1-3 && roachprod start nathan-geo:1-3
roachprod run nathan-geo:4 -- './cockroach workload run kv --ramp=15s --duration=3m --sequential --min-block-bytes=10000 --max-block-bytes=10000 --concurrency=128 {pgurl:1}'

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
180.0s        0         115524          641.8    199.3    201.3    251.7    285.2    604.0  write

\# Run workload after tuning
roachprod stop nathan-geo:1-3 && COCKROACH_RAFT_MAX_INFLIGHT_MSGS=128 COCKROACH_RAFT_MAX_SIZE_PER_MSG=32768 COCKROACH_RAFT_LOG_TRUNCATION_THRESHOLD=16777216 roachprod start nathan-geo:1-3
roachprod run nathan-geo:4 -- './cockroach workload run kv --ramp=15s --duration=3m --sequential --min-block-bytes=10000 --max-block-bytes=10000 --concurrency=128 --write-seq=S123829 {pgurl:1}'

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  180.0s        0         288512         1602.9     79.8     75.5    104.9    209.7    738.2  write
```

Before the change, we see p50 latency at 3x the expected replication
latency. This is due to throttling on the leader replica. After the
change, we see p50 latencies exactly where we expect them to be. Higher
percentile latencies improve accordingly. We also see a 150% increase in
throughput on the workload. This is reflected in the rate at which we
write to disk, which jumps from ~45 MB/s on each node to ~120 MB/s on
each node. Finally, we do not see a corresponding increase in Raft ready
latency, which was the driving reason for the knobs being tuned so low.

Release note (performance improvement): default replication
configurations have been tuned to support higher replication throughput
in high-latency replication quorums.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant