backport-19.1: storage: truncate raft log less aggressively when replica is missing #38642

tbg · 2019-07-03T07:36:38Z

Backport 2/2 commits from #38484.

/cc @cockroachdb/release

Previously, we'd hold off on truncating the raft log if a replica was
missing but contactable in the last 10 seconds. This meant that if a
node was down for more than 10 seconds, there was a good chance that
we'd truncate logs for some of its replicas (especially for its
high-traffic ones) and it would need snapshots for them when it came
back up.

This was for two reasons. First, we've historically assumed that it's
cheaper to catch a far-behind node up with a snapshot than with entries.
Second, when we replicated the raft log truncation, we'd have to include
it in snapshots anyway, but that's changed recently.

The problem is when a node is down for a short time but more than 10
seconds (think a couple minutes). It might come back up to find that it
needs a snapshot for most ranges. We rate limit snapshots fairly
aggressively because they've been disruptive in the past, but this means
that it could potentially take hours for a node to recover from a 2
minute outage.

This would be merely unfortunate if there wasn't a second compounding
issue. A recently restarted node has a cold leaseholder cache. When it
gets traffic for one of its replicas, it first tries itself as the
leaseholder (maybe it will get lucky and won't need the network hop).
Unfortunately, if the replica needs a snapshot, this decision currently
blocks on it. This means that requests sent to the recently started node
could block for as long as the heavily-throttled snapshots take, hours
or even days.

Short outages of more than 10 seconds are reasonably common with routine
maintenance (rolling to a new version, swapping hardware, etc), so it's
likely that customers will hit this (and one did).

This commit ties the threshold that the raft log queue uses for recent
activity to server.time_until_store_dead, which is already the
threshold where we give up on a down node and start transferring away
its replicas. It defaults to 5 minutes, which is still short, but it's
more reasonable than 10 seconds. Crucially, it also is a cluster
setting, so it can be overridden.

We'd like to move even further in the direction of leniency about raft
log truncation while a replica is missing, but this is a compromise
that's intended to be less risky to backport to 19.1.x.

Partial mitigation for #37906

Potentially also helps with #36879

Release note (bug fix): Nodes that have been down now recover more
quickly when they rejoin, assuming they weren't down for much more
than the value of the server.time_until_store_dead cluster setting
(which defaults to 5 minutes).

Previously, we'd hold off on truncating the raft log if a replica was missing but contactable in the last 10 seconds. This meant that if a node was down for *more* than 10 seconds, there was a good chance that we'd truncate logs for some of its replicas (especially for its high-traffic ones) and it would need snapshots for them when it came back up. This was for two reasons. First, we've historically assumed that it's cheaper to catch a far-behind node up with a snapshot than with entries. Second, snapshots historically had to include the Raft log which implied a need to keep the size of the Raft log tightly controlled due to being pulled into memory at the snapshot receiver, but that's changed recently. The problem is when a node is down for longer than 10 seconds but shorter than the time it takes to upreplicate all of its ranges onto new nodes. It might come back up to find that it needs a snapshot for most ranges. We rate limit snapshots fairly aggressively because they've been disruptive in the past, but this means that it could potentially take hours for a node to recover from a 2 minute outage. This would be merely unfortunate if there wasn't a second compounding issue. A recently restarted node has a cold leaseholder cache. When it gets traffic for one of its replicas, it first tries itself as the leaseholder (maybe it will get lucky and won't need the network hop). Unfortunately, if the replica needs a snapshot, this decision currently blocks on it. This means that requests sent to the recently started node could block for as long as the heavily-throttled snapshots take, hours or even days. Short outages of more than 10 seconds are reasonably common with routine maintenance (rolling to a new version, swapping hardware, etc), so it's likely that customers will hit this (and one did). This commit avoids truncating the log past any follower's position when all replicas have recently been active (the quota pool keeps it from growing without bound in this case). If at least one replica hasn't recently been active, it holds off any truncation until the log reaches a size threshold. Partial mitigation for cockroachdb#37906 Potentially also helps with cockroachdb#36879 Release note (bug fix): Nodes that have been down now recover more quickly when they rejoin, assuming they weren't down for much more than the value of the `server.time_until_store_dead` cluster setting (which defaults to 5 minutes).

cockroach-teamcity · 2019-07-03T07:36:45Z

This change is

andy-kimball

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball)

danhhz and others added 2 commits July 3, 2019 09:35

roachtest: touch up restart test

9edee5b

tbg requested review from andy-kimball and a team July 3, 2019 07:36

andy-kimball approved these changes Jul 4, 2019

View reviewed changes

tbg merged commit 1a14d34 into cockroachdb:release-19.1 Jul 4, 2019

tbg deleted the backport19.1-38484 branch July 4, 2019 04:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backport-19.1: storage: truncate raft log less aggressively when replica is missing #38642

backport-19.1: storage: truncate raft log less aggressively when replica is missing #38642

tbg commented Jul 3, 2019

cockroach-teamcity commented Jul 3, 2019

andy-kimball left a comment

backport-19.1: storage: truncate raft log less aggressively when replica is missing #38642

backport-19.1: storage: truncate raft log less aggressively when replica is missing #38642

Conversation

tbg commented Jul 3, 2019

cockroach-teamcity commented Jul 3, 2019

andy-kimball left a comment

Choose a reason for hiding this comment