Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: truncate raft log less aggressively when replica is missing #38484

Merged
merged 2 commits into from
Jul 2, 2019

Commits on Jul 2, 2019

  1. storage: truncate raft log less aggressively when replica is missing

    Previously, we'd hold off on truncating the raft log if a replica was
    missing but contactable in the last 10 seconds. This meant that if a
    node was down for *more* than 10 seconds, there was a good chance that
    we'd truncate logs for some of its replicas (especially for its
    high-traffic ones) and it would need snapshots for them when it came
    back up.
    
    This was for two reasons. First, we've historically assumed that it's
    cheaper to catch a far-behind node up with a snapshot than with entries.
    Second, snapshots historically had to include the Raft log which implied
    a need to keep the size of the Raft log tightly controlled due to being
    pulled into memory at the snapshot receiver, but that's changed
    recently.
    
    The problem is when a node is down for longer than 10 seconds but
    shorter than the time it takes to upreplicate all of its ranges onto new
    nodes. It might come back up to find that it needs a snapshot for most
    ranges. We rate limit snapshots fairly aggressively because they've been
    disruptive in the past, but this means that it could potentially take
    hours for a node to recover from a 2 minute outage.
    
    This would be merely unfortunate if there wasn't a second compounding
    issue. A recently restarted node has a cold leaseholder cache. When it
    gets traffic for one of its replicas, it first tries itself as the
    leaseholder (maybe it will get lucky and won't need the network hop).
    Unfortunately, if the replica needs a snapshot, this decision currently
    blocks on it. This means that requests sent to the recently started node
    could block for as long as the heavily-throttled snapshots take, hours
    or even days.
    
    Short outages of more than 10 seconds are reasonably common with routine
    maintenance (rolling to a new version, swapping hardware, etc), so it's
    likely that customers will hit this (and one did).
    
    This commit avoids truncating the log past any follower's position when
    all replicas have recently been active (the quota pool keeps it from
    growing without bound in this case). If at least one replica hasn't
    recently been active, it holds off any truncation until the log reaches
    a size threshold.
    
    Partial mitigation for cockroachdb#37906
    
    Potentially also helps with cockroachdb#36879
    
    Release note (bug fix): Nodes that have been down now recover more
    quickly when they rejoin, assuming they weren't down for much more
    than the value of the `server.time_until_store_dead` cluster setting
    (which defaults to 5 minutes).
    danhhz authored and tbg committed Jul 2, 2019
    Configuration menu
    Copy the full SHA
    3740399 View commit details
    Browse the repository at this point in the history
  2. roachtest: touch up restart test

    tbg committed Jul 2, 2019
    Configuration menu
    Copy the full SHA
    72fd63d View commit details
    Browse the repository at this point in the history