Skip to content

Commit

Permalink
Merge #98584
Browse files Browse the repository at this point in the history
98584: base: increase `RaftTickInterval` from 200 ms to 500 ms r=erikgrinaker a=erikgrinaker

**base: don't express `RaftDelaySplitToSuppressSnapshot` in ticks**

Expressing this parameter in Raft ticks was just confusing, and changing the Raft tick interval will inadvertently change this value. It had no functional dependence on Raft ticks.

The wall-time value remains roughly the same.

Epic: none
Release note: None
  
**base: increase `RaftTickInterval` from 200 ms to 500 ms**

Tick costs for unquiesced ranges can use a large amount of CPU on nodes with many replicas. Increasing the tick interval from 200 ms to 500 ms reduces this CPU cost by 60%. On a 3-node cluster with 50.000 unquiesced ranges, this reduced the total CPU usage when idle from 54% to 32%.

All derived intervals and timeouts have been adjusted such that they remain the same in wall time.

This increases the latency (from 200 to 500 ms) for tick-driven actions:

* Transfers of Raft leadership to leaseholders.
* Follower overload pausing.
* Updating the node liveness map.
* Updating the IO thresholds map.

Furthermore, because it reduces the resolution of the randomized Raft election timeout interval from [10-20) ticks to [4-8) ticks, it increases the chance of collisions and thus the chance of unsuccessful elections.

Environment variables have been added to adjust this and any tick-dependant values at runtime in case problems arise.

Epic: none
Release note (performance improvement): The Raft tick interval has been increased from 200 ms to 500 ms in order to reduce per-replica CPU costs, and can now be adjusted via `COCKROACH_RAFT_TICK_INTERVAL`. Dependant parameters such as the Raft election timeout (`COCKROACH_RAFT_ELECTION_TIMEOUT_TICKS`), reproposal timeout (`COCKROACH_RAFT_REPROPOSAL_TIMEOUT_TICKS`), and heartbeat interval (`COCKROACH_RAFT_HEARTBEAT_INTERVAL_TICKS`) have been adjusted such that their wall-time value remains the same.

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
  • Loading branch information
craig[bot] and erikgrinaker committed Mar 16, 2023
2 parents 4dc10b5 + 5e6698e commit b2c9fa6
Show file tree
Hide file tree
Showing 6 changed files with 36 additions and 40 deletions.
42 changes: 20 additions & 22 deletions pkg/base/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -53,18 +53,10 @@ const (
defaultSQLAddr = ":" + DefaultPort
defaultHTTPAddr = ":" + DefaultHTTPPort

// defaultRaftTickInterval is the default resolution of the Raft timer.
defaultRaftTickInterval = 200 * time.Millisecond

// NB: this can't easily become a variable as the UI hard-codes it to 10s.
// See https://github.com/cockroachdb/cockroach/issues/20310.
DefaultMetricsSampleInterval = 10 * time.Second

// defaultRaftHeartbeatIntervalTicks is the default value for
// RaftHeartbeatIntervalTicks, which determines the number of ticks between
// each heartbeat.
defaultRaftHeartbeatIntervalTicks = 5

// defaultRangeLeaseRenewalFraction specifies what fraction the range lease
// renewal duration should be of the range lease active time. For example,
// with a value of 0.2 and a lease duration of 10 seconds, leases would be
Expand Down Expand Up @@ -222,6 +214,16 @@ var (
// https://github.com/cockroachdb/cockroach/issues/93397.
defaultRPCHeartbeatTimeout = 3 * NetworkTimeout

// defaultRaftTickInterval is the default resolution of the Raft timer.
defaultRaftTickInterval = envutil.EnvOrDefaultDuration(
"COCKROACH_RAFT_TICK_INTERVAL", 500*time.Millisecond)

// defaultRaftHeartbeatIntervalTicks is the default value for
// RaftHeartbeatIntervalTicks, which determines the number of ticks between
// each heartbeat.
defaultRaftHeartbeatIntervalTicks = envutil.EnvOrDefaultInt(
"COCKROACH_RAFT_HEARTBEAT_INTERVAL_TICKS", 2)

// defaultRaftElectionTimeoutTicks specifies the minimum number of Raft ticks
// before holding an election. The actual election timeout per replica is
// multiplied by a random factor of 1-2, to avoid ties.
Expand All @@ -233,12 +235,12 @@ var (
// SystemClass, avoiding head-of-line blocking by general RPC traffic. The 1-2
// random factor provides an additional buffer.
defaultRaftElectionTimeoutTicks = envutil.EnvOrDefaultInt(
"COCKROACH_RAFT_ELECTION_TIMEOUT_TICKS", 10)
"COCKROACH_RAFT_ELECTION_TIMEOUT_TICKS", 4)

// defaultRaftReproposalTimeoutTicks is the number of ticks before reproposing
// a Raft command.
defaultRaftReproposalTimeoutTicks = envutil.EnvOrDefaultInt(
"COCKROACH_RAFT_REPROPOSAL_TIMEOUT_TICKS", 15)
"COCKROACH_RAFT_REPROPOSAL_TIMEOUT_TICKS", 6)

// defaultRaftLogTruncationThreshold specifies the upper bound that a single
// Range's Raft log can grow to before log truncations are triggered while at
Expand Down Expand Up @@ -552,7 +554,7 @@ type RaftConfig struct {
// backup/restore.
//
// -1 to disable.
RaftDelaySplitToSuppressSnapshotTicks int
RaftDelaySplitToSuppressSnapshot time.Duration
}

// SetDefaults initializes unset fields.
Expand Down Expand Up @@ -616,17 +618,13 @@ func (cfg *RaftConfig) SetDefaults() {
cfg.RaftMaxInflightBytes = other
}

if cfg.RaftDelaySplitToSuppressSnapshotTicks == 0 {
// The Raft Ticks interval defaults to 200ms, and an election is 10
// ticks. Add a generous amount of ticks to make sure even a backed up
// Raft snapshot queue is going to make progress when a (not overly
// concurrent) amount of splits happens.
// The generous amount should result in a delay sufficient to
// transmit at least one snapshot with the slow delay, which
// with default settings is max 512MB at 32MB/s, ie 16 seconds.
//
// The resulting delay configured here is 46s.
cfg.RaftDelaySplitToSuppressSnapshotTicks = 3*cfg.RaftElectionTimeoutTicks + 200
if cfg.RaftDelaySplitToSuppressSnapshot == 0 {
// Use a generous delay to make sure even a backed up Raft snapshot queue is
// going to make progress when a (not overly concurrent) amount of splits
// happens. The generous amount should result in a delay sufficient to
// transmit at least one snapshot with the slow delay, which with default
// settings is max 512MB at 32MB/s, ie 16 seconds.
cfg.RaftDelaySplitToSuppressSnapshot = 45 * time.Second
}

// Minor validation to ensure sane tuning.
Expand Down
10 changes: 5 additions & 5 deletions pkg/base/testdata/raft_config
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
echo
----
(base.RaftConfig) {
RaftTickInterval: (time.Duration) 200ms,
RaftElectionTimeoutTicks: (int) 10,
RaftReproposalTimeoutTicks: (int) 15,
RaftHeartbeatIntervalTicks: (int) 5,
RaftTickInterval: (time.Duration) 500ms,
RaftElectionTimeoutTicks: (int) 4,
RaftReproposalTimeoutTicks: (int) 6,
RaftHeartbeatIntervalTicks: (int) 2,
RangeLeaseDuration: (time.Duration) 6s,
RangeLeaseRenewalFraction: (float64) 0.5,
RaftLogTruncationThreshold: (int64) 16777216,
Expand All @@ -14,7 +14,7 @@ echo
RaftMaxCommittedSizePerReady: (uint64) 67108864,
RaftMaxInflightMsgs: (int) 128,
RaftMaxInflightBytes: (uint64) 268435456,
RaftDelaySplitToSuppressSnapshotTicks: (int) 230
RaftDelaySplitToSuppressSnapshot: (time.Duration) 45s
}
RaftHeartbeatInterval: 1s
RaftElectionTimeout: 2s
Expand Down
1 change: 0 additions & 1 deletion pkg/kv/kvserver/client_raft_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -5231,7 +5231,6 @@ func TestProcessSplitAfterRightHandSideHasBeenRemoved(t *testing.T) {
},
},
RaftConfig: base.RaftConfig{
RaftDelaySplitToSuppressSnapshotTicks: 0,
// Make the tick interval short so we don't need to wait too long for the
// partitioned leader to time out.
RaftTickInterval: 10 * time.Millisecond,
Expand Down
2 changes: 1 addition & 1 deletion pkg/kv/kvserver/client_split_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1416,7 +1416,7 @@ func runSetupSplitSnapshotRace(
RaftConfig: base.RaftConfig{
// Disable the split delay mechanism, or it'll spend 10s going in circles.
// (We can't set it to zero as otherwise the default overrides us).
RaftDelaySplitToSuppressSnapshotTicks: -1,
RaftDelaySplitToSuppressSnapshot: -1,
},
}
}
Expand Down
17 changes: 8 additions & 9 deletions pkg/kv/kvserver/split_delay_helper.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ import (

type splitDelayHelperI interface {
RaftStatus(context.Context) (roachpb.RangeID, *raft.Status)
MaxTicks() int
MaxDelay() time.Duration
TickDuration() time.Duration
Sleep(context.Context, time.Duration)
}
Expand Down Expand Up @@ -56,21 +56,21 @@ func (sdh *splitDelayHelper) Sleep(ctx context.Context, dur time.Duration) {
}
}

func (sdh *splitDelayHelper) MaxTicks() int {
func (sdh *splitDelayHelper) MaxDelay() time.Duration {
// There is a related mechanism regarding snapshots and splits that is worth
// pointing out here: Incoming MsgApp (see the _ assignment below) are
// dropped if they are addressed to uninitialized replicas likely to become
// initialized via a split trigger. These MsgApp are sent approximately once
// per heartbeat interval, but sometimes there's an additional delay thanks
// to having to wait for a GC run. In effect, it shouldn't take more than a
// small number of heartbeats until the follower leaves probing status, so
// MaxTicks should at least match that.
// MaxDelay should at least match that.
_ = maybeDropMsgApp // guru assignment
// Snapshots can come up for other reasons and at the end of the day, the
// delay introduced here needs to make sure that the snapshot queue
// processes at a higher rate than splits happen, so the number of attempts
// will typically be much higher than what's suggested by maybeDropMsgApp.
return (*Replica)(sdh).store.cfg.RaftDelaySplitToSuppressSnapshotTicks
return (*Replica)(sdh).store.cfg.RaftDelaySplitToSuppressSnapshot
}

func (sdh *splitDelayHelper) TickDuration() time.Duration {
Expand All @@ -79,9 +79,8 @@ func (sdh *splitDelayHelper) TickDuration() time.Duration {
}

func maybeDelaySplitToAvoidSnapshot(ctx context.Context, sdh splitDelayHelperI) string {
maxDelaySplitToAvoidSnapshotTicks := sdh.MaxTicks()
tickDur := sdh.TickDuration()
budget := tickDur * time.Duration(maxDelaySplitToAvoidSnapshotTicks)
budget := sdh.MaxDelay()

var slept time.Duration
var problems []string
Expand Down Expand Up @@ -173,13 +172,13 @@ func maybeDelaySplitToAvoidSnapshot(ctx context.Context, sdh splitDelayHelperI)

lastProblems = problems

// The second factor starts out small and reaches ~0.7 approximately at i=maxDelaySplitToAvoidSnapshotTicks.
// In effect we loop approximately 2*maxDelaySplitToAvoidSnapshotTicks to exhaust the entire budget we have.
// The second factor starts out small and reaches ~0.7 approximately at i=budget/tickDur.
// In effect we loop approximately 2*MaxDelay to exhaust the entire budget we have.
// By having shorter sleeps at the beginning, we optimize for the common case in which things get fixed up
// quickly early on. In particular, splitting in a tight loop will usually always wait on the election of the
// previous split's right-hand side, which finishes within a few network latencies (which is typically much
// less than a full tick).
sleepDur := time.Duration(float64(tickDur) * (1.0 - math.Exp(-float64(i-1)/float64(maxDelaySplitToAvoidSnapshotTicks+1))))
sleepDur := time.Duration(float64(tickDur) * (1.0 - math.Exp(-float64(i-1)/float64(budget/tickDur+1))))
sdh.Sleep(ctx, sleepDur)
slept += sleepDur

Expand Down
4 changes: 2 additions & 2 deletions pkg/kv/kvserver/split_delay_helper_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@ func (h *testSplitDelayHelper) RaftStatus(context.Context) (roachpb.RangeID, *ra
return h.rangeID, h.raftStatus
}

func (h *testSplitDelayHelper) MaxTicks() int {
return h.numAttempts
func (h *testSplitDelayHelper) MaxDelay() time.Duration {
return time.Duration(h.numAttempts) * h.TickDuration()
}

func (h *testSplitDelayHelper) TickDuration() time.Duration {
Expand Down

0 comments on commit b2c9fa6

Please sign in to comment.