Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: truncate raft log less aggressively when replica is missing #38484

Merged
merged 2 commits into from
Jul 2, 2019

Conversation

danhhz
Copy link
Contributor

@danhhz danhhz commented Jun 27, 2019

Previously, we'd hold off on truncating the raft log if a replica was
missing but contactable in the last 10 seconds. This meant that if a
node was down for more than 10 seconds, there was a good chance that
we'd truncate logs for some of its replicas (especially for its
high-traffic ones) and it would need snapshots for them when it came
back up.

This was for two reasons. First, we've historically assumed that it's
cheaper to catch a far-behind node up with a snapshot than with entries.
Second, when we replicated the raft log truncation, we'd have to include
it in snapshots anyway, but that's changed recently.

The problem is when a node is down for a short time but more than 10
seconds (think a couple minutes). It might come back up to find that it
needs a snapshot for most ranges. We rate limit snapshots fairly
aggressively because they've been disruptive in the past, but this means
that it could potentially take hours for a node to recover from a 2
minute outage.

This would be merely unfortunate if there wasn't a second compounding
issue. A recently restarted node has a cold leaseholder cache. When it
gets traffic for one of its replicas, it first tries itself as the
leaseholder (maybe it will get lucky and won't need the network hop).
Unfortunately, if the replica needs a snapshot, this decision currently
blocks on it. This means that requests sent to the recently started node
could block for as long as the heavily-throttled snapshots take, hours
or even days.

Short outages of more than 10 seconds are reasonably common with routine
maintenance (rolling to a new version, swapping hardware, etc), so it's
likely that customers will hit this (and one did).

This commit ties the threshold that the raft log queue uses for recent
activity to server.time_until_store_dead, which is already the
threshold where we give up on a down node and start transferring away
its replicas. It defaults to 5 minutes, which is still short, but it's
more reasonable than 10 seconds. Crucially, it also is a cluster
setting, so it can be overridden.

We'd like to move even further in the direction of leniency about raft
log truncation while a replica is missing, but this is a compromise
that's intended to be less risky to backport to 19.1.x.

Partial mitigation for #37906

Potentially also helps with #36879

Release note (bug fix): Nodes that have been down now recover more
quickly when they rejoin, assuming they weren't down for much more
than the value of the server.time_until_store_dead cluster setting
(which defaults to 5 minutes).

@danhhz danhhz requested review from tbg and a team June 27, 2019 02:47
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@danhhz
Copy link
Contributor Author

danhhz commented Jun 27, 2019

I think this still needs a little work. The roachtest is a transcription of what I was doing manually and the first time I ran it, I saw more raft log truncations that I had been during my manual testing. But I ran out of time for the day and figured I'd get out what I had.

@danhhz
Copy link
Contributor Author

danhhz commented Jun 27, 2019

Oh nevermind, the raft log queue successes I was seeing in the graph were times it queued because it wanted to recompute the size. It wasn't truncating anything.

@danhhz
Copy link
Contributor Author

danhhz commented Jun 27, 2019

Before: expected to recover within 2m0s took 13m4.808935s
After: connecting and query finished in 12.962869s

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've diverged on the course of action between

#37906 (comment)

and

#37906 (comment)

("don't truncate when a follower is down except when Raft log is too large" vs "don't truncate for five minutes or when the Raft log is too large")

We should figure out which one it's going to be. My argument for removing the timeout is that with the timeout, we're fixing the problem for <5 minute outages but it's still there for > 5 minute outages. It's quite awkward from the user's perspective to know (or worse, don't know) that there's a clock running against you.

Now there could be an argument that doing the timeout-based thing is simpler and fixes part of the problem and is thus more likely to be useful for inclusion in a short-notice patch release, but I'm not so sure about that. The main reason I think that is this code

} else if !input.LogTooLarge() && decision.NewFirstIndex > progress.Match {
decision.NewFirstIndex = progress.Match
decision.ChosenVia = truncatableIndexChosenViaFollowers
}

which is broken with the code in this PR as written. Essentially, while a node is "RecentActive" which is now five minutes, the truncation is super-cooperative in that it will never cut the dead node off, no matter how large the Raft log gets. We now make this from 10s to 5m which I think we agree isn't a regime we wanted to make accessible in a patch release.

So no matter which option we take, we have to change the truncations somewhat. They must be super-cooperative (as "defined" above) only for a short period of time (probably the current period of 10s because that matches the quota pool which protects from runaway Raft log). After that, they remain "pretty cooperative", taking the dead node's progress into account, unless the Raft log is too large, in which case they stop doing that.

All of that boils down to changing

if !progress.RecentActive {

to if !progress.RecentActive && !input.LogTooLarge() { (and adjust the comment)

Because then

  1. If recently active (remains at 10s), behaves like today
  2. if not recently active, and log not too large, never cut off the follower (dead or alive)
  3. if not recently active, and log is too large: cut off all not recently active followers if we have to.

This seems like a remarkably straightforward change, but as always there's a snag (hacks beget hacks):

if progress.State == raft.ProgressStateProbe {

A down follower will end up in Probing state pretty quickly and the heuristics make the (ill-founded after changing the line above) assumption that a probing node is definitely online. So that line above should get a && progress.RecentActive.

At the end of all this, this implements my suggestion (don't care how long the node is dead, problem will resolve itself when the replica is removed from it or the log gets too large). We could stuff the 5m timeout in there, but it feels pretty unnatural and seems to make things worse, not better.

WDYT?

This was for two reasons. First, we've historically assumed that it's
cheaper to catch a far-behind node up with a snapshot than with entries.
Second, when we replicated the raft log truncation, we'd have to include
it in snapshots anyway, but that's changed recently.

I'd change the second half of that to

Second, snapshots historically had to include the Raft log which implied a need to keep the size of the Raft log tightly controlled due to being pulled into memory at the snapshot receiver, but that's changed recently.

We never "replicated" raft log truncation (instead we made a poor judgment to set it up like that to begin with), and in my mind the main problem with that was always having to pull it into memory at the recipient.

Reviewed 8 of 8 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@danhhz
Copy link
Contributor Author

danhhz commented Jun 27, 2019

I don't think we've diverged in spirit, you're probably reading too much intention on my part in what appears to be me misunderstanding how this code worked as I rushed to get something out.

I was going for exactly what you've said in that comment (and in this one), except indeed with the 5m timeout. My thinking was that it doesn't fix anything over 5m, but it's a cluster setting, so that 5m is adjustable if you're scheduling downtime. That'd be the absolute safest thing we could get into 19.1.3.

I was thinking we should immediately follow up by removing the 5m and getting that into some later 19.1.x release.

My worry here is that the 10s has historically kicked in so fast that it makes me nervous about how often the other heuristics are currently used. In particular, I thought you mentioned something about store fullness in our 1:1 discussion of this, but now that I look, I don't see that in the code. I really just haven't had time to understand this code in detail yet, which is making it difficult for me to have an informed and detailed opinion.

Anyway, you've historically been much more conservative about stability implications that I have and I certainly don't fully grok the practical risks involved here yet, so I'm happy to defer to your judgement that jumping straight to no timeout is fine. I'll update the PR as you mention.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you mentioned something about store fullness in our 1:1

I might've mentioned that, but I wasn't meaning to suggest that that is something that exists today or should exist in the patch release. Store fullness as a criterion becomes important if we reach the point at which we're willing to let the Raft log get "very large" because we decide that the benefit outweighs the cost. We're not even anywhere close to that, right now we're still talking "truncate after a few MB" so we don't have to worry about disk usage just yet.

I was going for exactly what you've said in that comment (and in this one), except indeed with the 5m timeout. My thinking was that it doesn't fix anything over 5m, but it's a cluster setting, so that 5m is adjustable if you're scheduling downtime. That'd be the absolute safest thing we could get into 19.1.3.

I am sympathetic to containing the risk. The only difference I can see between both approaches is that mine will potentially give you more ranges that have a few MBs of logs. Maybe that can cause an issue? Definitely not worth finding out in 2.1.3.
I think you'll need all of the code changes I mentioned anyway (i.e. there will be a "is this node recently active" and "is that node inactive for at least the dead duration") so what I'm proposing should precisely be the "dead duration = infinity when in the log queue" limit of your code. If that roughly checks out and you have a preference for including the duration, go ahead and do it. I'm slightly worried that plumbing yet another number around is a bit of a hassle and we'll have to be careful that we're not shooting ourselves in the foot (i.e. adding a few more tests) but I expect that you'll be doing that anyway since it seems that we're also going to change the impact of the LogTooLarge parameter on the heuristics.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@danhhz
Copy link
Contributor Author

danhhz commented Jun 27, 2019

Okay, I've updated this PR to do what we've discussed, without the 5m timeout. The sad news is it doesn't work. There are tons of ranges larger than the current 4MB limit that get truncated as soon as they get out of the 10s window during the roachtest downtime.

Should we be using a larger size limit while a node is down? In #37906 (comment), you mention if the Raft log is excessively large (say > 2x larger than a snapshot would be, and at least a few MB - maybe much more; tuning this is TBD), sacrifice the down followers (i.e. truncate the log). Was that a suggestion for the long term or for something that you feel is backportable into 19.1.3?

I190627 22:06:38.247073 19331 storage/raft_log_queue.go:577  [n1,raftlog,s1,r83/1:/Table/61/1/2{5/9/17…-7/9/525}] should truncate: true [truncate 4 entries to first index 23 (chosen via: quorum); log too large (33 MiB > 4.0 MiB)]
I190627 22:06:46.285351 28049 storage/raft_log_queue.go:577  [n1,raftlog,s1,r82/1:/Table/61/1/6{5/9/14…-7/9/208}] should truncate: true [truncate 3 entries to first index 20 (chosen via: quorum); log too large (33 MiB > 4.0 MiB)]
I190627 22:06:48.285883 28097 storage/raft_log_queue.go:577  [n1,raftlog,s1,r80/1:/Table/61/1/6{1/10/8…-3/9/26…}] should truncate: true [truncate 3 entries to first index 23 (chosen via: quorum); log too large (33 MiB > 4.0 MiB)]
I190627 22:06:53.850823 36867 storage/raft_log_queue.go:577  [n1,raftlog,s1,r103/1:/Table/60/1/{3/13867-4/18488}] should truncate: true [truncate 3 entries to first index 18 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:06:55.115486 39172 storage/raft_log_queue.go:577  [n1,raftlog,s1,r25/1:/Table/5{5-6}] should truncate: true [truncate 5 entries to first index 16 (chosen via: followers)]
I190627 22:06:58.675756 46811 storage/raft_log_queue.go:577  [n1,raftlog,s1,r61/1:/Table/59/1/{39/6/-…-79/1/-…}] should truncate: true [truncate 4 entries to first index 23 (chosen via: followers)]
I190627 22:06:59.075528 47215 storage/raft_log_queue.go:577  [n1,raftlog,s1,r199/1:/Table/60/1/6{0/81464-1/86064}] should truncate: true [truncate 26 entries to first index 43 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:06:59.092497 47235 storage/raft_log_queue.go:577  [n1,raftlog,s1,r56/1:/Table/60/1/5{4/39540-5/44186}] should truncate: true [truncate 39 entries to first index 56 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:00.684390 50581 storage/raft_log_queue.go:577  [n1,raftlog,s1,r178/1:/Table/57/1/4{4/7/-1…-6/7/-2…}] should truncate: true [truncate 24 entries to first index 36 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:02.344022 54200 storage/raft_log_queue.go:577  [n1,raftlog,s1,r207/1:/Table/60/1/9{1/56489-2/61096}] should truncate: true [truncate 35 entries to first index 52 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:03.273847 56288 storage/raft_log_queue.go:577  [n1,raftlog,s1,r128/1:/Table/60/1/1{2/72157-3/76755}] should truncate: true [truncate 28 entries to first index 45 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:03.709008 56934 storage/raft_log_queue.go:577  [n1,raftlog,s1,r258/1:/Table/60/1/9{7/83524-8/88128}] should truncate: true [truncate 30 entries to first index 45 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:04.650213 59668 storage/raft_log_queue.go:577  [n1,raftlog,s1,r249/1:/Table/57/1/7{5/3/-2…-7/2/-2…}] should truncate: true [truncate 28 entries to first index 46 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:04.885852 59942 storage/raft_log_queue.go:577  [n1,raftlog,s1,r187/1:/Table/60/1/8{0/28293-1/32922}] should truncate: true [truncate 29 entries to first index 52 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:05.153087 60262 storage/raft_log_queue.go:577  [n1,raftlog,s1,r181/1:/Table/60/1/4{6/2579-7/7197}] should truncate: true [truncate 25 entries to first index 42 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:05.250531 60383 storage/raft_log_queue.go:577  [n1,raftlog,s1,r113/1:/Table/60/1/{7/30881-8/35518}] should truncate: true [truncate 4 entries to first index 31 (chosen via: followers)]
I190627 22:07:07.139022 64189 storage/raft_log_queue.go:577  [n1,raftlog,s1,r225/1:/Table/57/1/9{2/3/-1…-4/2/-7…}] should truncate: true [truncate 39 entries to first index 54 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:07.196871 64304 storage/raft_log_queue.go:577  [n1,raftlog,s1,r247/1:/Table/57/1/7{3/3/-2…-5/3/-2…}] should truncate: true [truncate 44 entries to first index 62 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:07.498510 65032 storage/raft_log_queue.go:577  [n1,raftlog,s1,r259/1:/Table/57/1/6{0/4/-2…-2/3/-1…}] should truncate: true [truncate 48 entries to first index 65 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:07.510621 65021 storage/raft_log_queue.go:577  [n1,raftlog,s1,r136/1:/Table/61/1/5{1/7/13…-3/7/78}] should truncate: true [truncate 63 entries to first index 77 (chosen via: quorum); log too large (33 MiB > 4.0 MiB)]
I190627 22:07:07.798300 65608 storage/raft_log_queue.go:577  [n1,raftlog,s1,r177/1:/Table/60/1/8{2/7927-3/12559}] should truncate: true [truncate 30 entries to first index 45 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:07.916871 65863 storage/raft_log_queue.go:577  [n1,raftlog,s1,r204/1:/Table/60/1/7{4/15082-5/19695}] should truncate: true [truncate 34 entries to first index 51 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:08.015045 66185 storage/raft_log_queue.go:577  [n1,raftlog,s1,r244/1:/Table/57/1/3{7/8/-1…-9/8/-2…}] should truncate: true [truncate 33 entries to first index 50 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:08.032318 66276 storage/raft_log_queue.go:577  [n1,raftlog,s1,r271/1:/Table/57/1/6{2/3/-1…-4/2/-5…}] should truncate: true [truncate 36 entries to first index 53 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:08.257849 66585 storage/raft_log_queue.go:577  [n1,raftlog,s1,r200/1:/Table/57/1/{9/7/-1…-11/6/-…}] should truncate: true [truncate 51 entries to first index 76 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:08.377245 66688 storage/raft_log_queue.go:577  [n1,raftlog,s1,r180/1:/Table/60/1/8{4/17169-5/21792}] should truncate: true [truncate 30 entries to first index 47 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:08.725875 67369 storage/raft_log_queue.go:577  [n1,raftlog,s1,r46/1:/Table/54/1/{59/"\x…-74/"\x…}] should truncate: true [truncate 6 entries to first index 25 (chosen via: followers)]
I190627 22:07:08.749852 67429 storage/raft_log_queue.go:577  [n1,raftlog,s1,r60/1:/Table/60/1/7{1/1240-2/5855}] should truncate: true [truncate 23 entries to first index 41 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:09.013368 67891 storage/raft_log_queue.go:577  [n1,raftlog,s1,r107/1:/Table/61/1/4{5/8/21…-7/8/857}] should truncate: true [truncate 43 entries to first index 58 (chosen via: quorum); log too large (33 MiB > 4.0 MiB)]
I190627 22:07:09.068036 68150 storage/raft_log_queue.go:577  [n1,raftlog,s1,r146/1:/Table/61/1/8{0/8/710-2/7/24…}] should truncate: true [truncate 39 entries to first index 57 (chosen via: quorum); log too large (33 MiB > 4.0 MiB)]
I190627 22:07:09.202782 68379 storage/raft_log_queue.go:577  [n1,raftlog,s1,r172/1:/Table/60/1/{39/121…-40/273}] should truncate: true [truncate 28 entries to first index 45 (chosen via: quorum); log too large (28 MiB > 4.0 MiB)]
I190627 22:07:09.462028 69074 storage/raft_log_queue.go:577  [n1,raftlog,s1,r202/1:/Table/60/1/7{2/5855-3/10462}] should truncate: true [truncate 34 entries to first index 49 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:09.516353 69133 storage/raft_log_queue.go:577  [n1,raftlog,s1,r211/1:/Table/60/1/6{1/86064-2/90664}] should truncate: true [truncate 32 entries to first index 49 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:09.527323 69139 storage/raft_log_queue.go:577  [n1,raftlog,s1,r44/1:/Table/54/1/{15/"&y…-29/"L\…}] should truncate: true [truncate 4 entries to first index 15 (chosen via: followers)]
I190627 22:07:09.579469 69331 storage/raft_log_queue.go:577  [n1,raftlog,s1,r276/1:/Table/57/1/{69/10/…-71/4/-…}] should truncate: true [truncate 33 entries to first index 50 (chosen via: quorum); log too large (18 MiB > 4.0 MiB)]
I190627 22:07:09.995765 69987 storage/raft_log_queue.go:577  [n1,raftlog,s1,r228/1:/Table/57/1/9{7/10/-…-8/10/-…}] should truncate: true [truncate 44 entries to first index 59 (chosen via: quorum); log too large (12 MiB > 4.0 MiB)]
I190627 22:07:10.415306 70629 storage/raft_log_queue.go:577  [n1,raftlog,s1,r260/1:/Table/60/1/9{8/88128-9/69871}] should truncate: true [truncate 45 entries to first index 62 (chosen via: quorum); log too large (26 MiB > 4.0 MiB)]
I190627 22:07:10.891188 71686 storage/raft_log_queue.go:577  [n1,raftlog,s1,r175/1:/Table/57/1/{7/8/-2…-9/7/-1…}] should truncate: true [truncate 42 entries to first index 54 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:11.188821 72427 storage/raft_log_queue.go:577  [n1,raftlog,s1,r76/1:/Table/61/1/{39/9/1…-41/8/2…}] should truncate: true [truncate 4 entries to first index 21 (chosen via: followers)]
I190627 22:07:11.718932 73275 storage/raft_log_queue.go:577  [n1,raftlog,s1,r253/1:/Table/60/1/9{5/74915-6/78928}] should truncate: true [truncate 41 entries to first index 57 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:11.876554 73411 storage/raft_log_queue.go:577  [n1,raftlog,s1,r144/1:/Table/60/1/4{4/97961-6/2579}] should truncate: true [truncate 37 entries to first index 59 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:11.910626 73471 storage/raft_log_queue.go:577  [n1,raftlog,s1,r140/1:/Table/60/1/2{0/54525-1/59129}] should truncate: true [truncate 42 entries to first index 58 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:12.015883 73740 storage/raft_log_queue.go:577  [n1,raftlog,s1,r251/1:/Table/60/1/9{4/70317-5/74915}] should truncate: true [truncate 33 entries to first index 50 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:12.336867 74537 storage/raft_log_queue.go:577  [n1,raftlog,s1,r201/1:/Table/60/1/{69/966…-71/1240}] should truncate: true [truncate 29 entries to first index 41 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:13.388553 76854 storage/raft_log_queue.go:577  [n1,raftlog,s1,r234/1:/Table/57/1/3{1/10/-…-3/9/-5…}] should truncate: true [truncate 40 entries to first index 57 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:13.532842 76936 storage/raft_log_queue.go:577  [n1,raftlog,s1,r120/1:/Table/61/1/9{5/7/897-7/6/27…}] should truncate: true [truncate 61 entries to first index 77 (chosen via: quorum); log too large (33 MiB > 4.0 MiB)]
I190627 22:07:13.534668 76968 storage/raft_log_queue.go:577  [n1,raftlog,s1,r156/1:/Table/60/1/{29/963…-31/914}] should truncate: true [truncate 26 entries to first index 48 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:13.761636 77380 storage/raft_log_queue.go:577  [n1,raftlog,s1,r261/1:/Table/57/1/7{7/2/-2…-9/1/-2…}] should truncate: true [truncate 56 entries to first index 73 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:13.795031 77390 storage/raft_log_queue.go:577  [n1,raftlog,s1,r84/1:/Table/61/1/2{7/9/525-9/8/23…}] should truncate: true [truncate 57 entries to first index 75 (chosen via: quorum); log too large (33 MiB > 4.0 MiB)]
I190627 22:07:13.830113 77459 storage/raft_log_queue.go:577  [n1,raftlog,s1,r142/1:/Table/60/1/1{3/76755-4/81362}] should truncate: true [truncate 72 entries to first index 88 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:13.958887 77871 storage/raft_log_queue.go:577  [n1,raftlog,s1,r134/1:/Table/61/1/{49/7/2…-51/7/1…}] should truncate: true [truncate 50 entries to first index 65 (chosen via: quorum); log too large (33 MiB > 4.0 MiB)]
I190627 22:07:14.402879 78779 storage/raft_log_queue.go:577  [n1,raftlog,s1,r109/1:/Table/61/1/4{7/8/857-9/7/25…}] should truncate: true [truncate 69 entries to first index 85 (chosen via: quorum); log too large (33 MiB > 4.0 MiB)]
I190627 22:07:14.515087 79155 storage/raft_log_queue.go:577  [n1,raftlog,s1,r254/1:/Table/60/1/9{6/78928-7/83524}] should truncate: true [truncate 40 entries to first index 55 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:14.532131 78888 storage/raft_log_queue.go:577  [n1,raftlog,s1,r274/1:/Table/57/1/6{6/1/-2…-8/1/-2…}] should truncate: true [truncate 25 entries to first index 40 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:14.535870 78890 storage/raft_log_queue.go:577  [n1,raftlog,s1,r196/1:/Table/57/1/{5/6/-1…-7/5/-6…}] should truncate: true [truncate 47 entries to first index 70 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:14.594667 79383 storage/raft_log_queue.go:577  [n1,raftlog,s1,r219/1:/Table/57/1/2{0/9/-8…-2/9/-2…}] should truncate: true [truncate 37 entries to first index 54 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:14.704772 79658 storage/raft_log_queue.go:577  [n1,raftlog,s1,r227/1:/Table/57/1/9{6/1/-5…-7/10/-…}] should truncate: true [truncate 48 entries to first index 65 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:14.772354 79807 storage/raft_log_queue.go:577  [n1,raftlog,s1,r263/1:/Table/57/1/{79/1/-…-80/10/…}] should truncate: true [truncate 41 entries to first index 57 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:14.793502 79848 storage/raft_log_queue.go:577  [n1,raftlog,s1,r110/1:/Table/60/1/1{6/36038-7/40662}] should truncate: true [truncate 43 entries to first index 58 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:15.820468 81441 storage/raft_log_queue.go:577  [n1,raftlog,s1,r188/1:/Table/60/1/{49/164…-50/210…}] should truncate: true [truncate 40 entries to first index 60 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:15.917991 81630 storage/raft_log_queue.go:577  [n1,raftlog,s1,r39/1:/Table/61/1/1{3/1/28…-5/1/16…}] should truncate: true [truncate 2 entries to first index 13 (chosen via: followers)]
I190627 22:07:16.255723 82473 storage/raft_log_queue.go:577  [n1,raftlog,s1,r118/1:/Table/60/1/1{1/67550-2/72157}] should truncate: true [truncate 4 entries to first index 39 (chosen via: followers)]
I190627 22:07:16.389210 82741 storage/raft_log_queue.go:577  [n1,raftlog,s1,r127/1:/Table/60/1/4{1/4884-2/9507}] should truncate: true [truncate 34 entries to first index 51 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:16.421114 82791 storage/raft_log_queue.go:577  [n1,raftlog,s1,r159/1:/Table/60/1/3{5/98256-7/2880}] should truncate: true [truncate 40 entries to first index 56 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:16.716081 83494 storage/raft_log_queue.go:577  [n1,raftlog,s1,r218/1:/Table/57/1/2{8/1/-1…-9/10/-…}] should truncate: true [truncate 44 entries to first index 61 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:17.229670 83535 storage/raft_log_queue.go:577  [n1,raftlog,s1,r267/1:/Table/57/1/8{2/10/-…-4/5/-2…}] should truncate: true [truncate 30 entries to first index 47 (chosen via: quorum); log too large (18 MiB > 4.0 MiB)]
I190627 22:07:17.284987 83974 storage/raft_log_queue.go:577  [n1,raftlog,s1,r135/1:/Table/60/1/1{8/45287-9/49915}] should truncate: true [truncate 37 entries to first index 53 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:17.371468 84200 storage/raft_log_queue.go:577  [n1,raftlog,s1,r250/1:/Table/57/1/5{2/5/-2…-4/4/-8…}] should truncate: true [truncate 63 entries to first index 80 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:17.581771 84757 storage/raft_log_queue.go:577  [n1,raftlog,s1,r48/1:/Table/5{4/1/89/"…-5}] should truncate: true [truncate 3 entries to first index 19 (chosen via: followers)]
I190627 22:07:17.774277 85112 storage/raft_log_queue.go:577  [n1,raftlog,s1,r148/1:/Table/60/1/6{4/33081-5/37693}] should truncate: true [truncate 38 entries to first index 55 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:17.981578 85736 storage/raft_log_queue.go:577  [n1,raftlog,s1,r160/1:/Table/60/1/3{7/2880-8/7500}] should truncate: true [truncate 35 entries to first index 51 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:18.225947 86337 storage/raft_log_queue.go:577  [n1,raftlog,s1,r198/1:/Table/57/1/{1/10/-…-3/7/-2…}] should truncate: true [truncate 48 entries to first index 67 (chosen via: quorum); log too large (21 MiB > 4.0 MiB)]
I190627 22:07:18.935365 87161 storage/raft_log_queue.go:577  [n1,raftlog,s1,r130/1:/Table/60/1/4{3/14127-4/18745}] should truncate: true [truncate 44 entries to first index 61 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:19.001543 87333 storage/raft_log_queue.go:577  [n1,raftlog,s1,r168/1:/Table/60/1/32/{5533-84448}] should truncate: true [truncate 35 entries to first index 51 (chosen via: quorum); log too large (26 MiB > 4.0 MiB)]
I190627 22:07:19.002493 87335 storage/raft_log_queue.go:577  [n1,raftlog,s1,r210/1:/Table/60/1/9{3/65725-4/70317}] should truncate: true [truncate 41 entries to first index 57 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:19.166579 87758 storage/raft_log_queue.go:577  [n1,raftlog,s1,r141/1:/Table/60/1/44/{18745-97961}] should truncate: true [truncate 35 entries to first index 53 (chosen via: quorum); log too large (26 MiB > 4.0 MiB)]
I190627 22:07:19.253234 88073 storage/raft_log_queue.go:577  [n1,raftlog,s1,r255/1:/Table/57/1/5{6/5/-2…-8/4/-1…}] should truncate: true [truncate 52 entries to first index 67 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:19.716212 89309 storage/raft_log_queue.go:577  [n1,raftlog,s1,r275/1:/Table/57/1/6{8/1/-2…-9/10/-…}] should truncate: true [truncate 33 entries to first index 48 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:20.223321 90386 storage/raft_log_queue.go:577  [n1,raftlog,s1,r133/1:/Table/60/1/1{7/40662-8/45287}] should truncate: true [truncate 56 entries to first index 74 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:21.007768 91564 storage/raft_log_queue.go:577  [n1,raftlog,s1,r102/1:/Table/61/1/4{3/9/294-5/8/21…}] should truncate: true [truncate 4 entries to first index 21 (chosen via: followers)]
I190627 22:07:21.567367 93016 storage/raft_log_queue.go:577  [n1,raftlog,s1,r150/1:/Table/60/1/6{5/37693-6/42311}] should truncate: true [truncate 42 entries to first index 59 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:21.704082 93355 storage/raft_log_queue.go:577  [n1,raftlog,s1,r152/1:/Table/60/1/2{7/87120-8/91722}] should truncate: true [truncate 46 entries to first index 61 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:21.760458 93467 storage/raft_log_queue.go:577  [n1,raftlog,s1,r203/1:/Table/60/1/7{3/10462-4/15082}] should truncate: true [truncate 44 entries to first index 61 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:22.953117 95656 storage/raft_log_queue.go:577  [n1,raftlog,s1,r171/1:/Table/60/1/3{8/7500-9/12119}] should truncate: true [truncate 47 entries to first index 64 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:22.965604 95692 storage/raft_log_queue.go:577  [n1,raftlog,s1,r269/1:/Table/57/1/{88/5/-…-90/4/-…}] should truncate: true [truncate 48 entries to first index 67 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:24.313747 98187 storage/raft_log_queue.go:577  [n1,raftlog,s1,r167/1:/Table/60/1/6{8/50580-9/55193}] should truncate: true [truncate 38 entries to first index 62 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:24.402963 98439 storage/raft_log_queue.go:577  [n1,raftlog,s1,r94/1:/Table/61/1/7{4/9/13…-6/9/113}] should truncate: true [truncate 4 entries to first index 21 (chosen via: followers)]
I190627 22:07:24.986739 100036 storage/raft_log_queue.go:577  [n1,raftlog,s1,r85/1:/Table/61/1/6{7/9/208-9/8/20…}] should truncate: true [truncate 62 entries to first index 79 (chosen via: quorum); log too large (33 MiB > 4.0 MiB)]
I190627 22:07:25.055744 100050 storage/raft_log_queue.go:577  [n1,raftlog,s1,r158/1:/Table/60/1/3{4/93658-5/98256}] should truncate: true [truncate 40 entries to first index 56 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:25.072774 100293 storage/raft_log_queue.go:577  [n1,raftlog,s1,r233/1:/Table/57/1/1{5/5/-6…-6/10/-…}] should truncate: true [truncate 42 entries to first index 59 (chosen via: quorum); log too large (19 MiB > 4.0 MiB)]
I190627 22:07:25.532012 101072 storage/raft_log_queue.go:577  [n1,raftlog,s1,r38/1:/Table/61/1/1{8/1/718-9/10/2…}] should truncate: true [truncate 70 entries to first index 97 (chosen via: quorum); log too large (33 MiB > 4.0 MiB)]
I190627 22:07:26.076855 102189 storage/raft_log_queue.go:577  [n1,raftlog,s1,r96/1:/Table/61/1/7{6/9/210-8/8/19…}] should truncate: true [truncate 4 entries to first index 31 (chosen via: followers)]
I190627 22:07:26.733543 103878 storage/raft_log_queue.go:577  [n1,raftlog,s1,r149/1:/Table/61/1/8{2/7/24…-4/7/11…}] should truncate: true [truncate 64 entries to first index 80 (chosen via: quorum); log too large (33 MiB > 4.0 MiB)]
I190627 22:07:26.767626 103888 storage/raft_log_queue.go:577  [n1,raftlog,s1,r41/1:/Table/53{-/1/54/1}] should truncate: true [truncate 12253 entries to first index 12264 (chosen via: quorum); log too large (4.0 MiB > 4.0 MiB)]
I190627 22:07:27.755498 105603 storage/raft_log_queue.go:577  [n1,raftlog,s1,r237/1:/Table/57/1/3{3/9/-5…-5/9/-2…}] should truncate: true [truncate 63 entries to first index 80 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:28.514236 107487 storage/raft_log_queue.go:577  [n1,raftlog,s1,r170/1:/Table/60/1/7{6/51801-7/56418}] should truncate: true [truncate 83 entries to first index 99 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:30.419055 111482 storage/raft_log_queue.go:577  [n1,raftlog,s1,r121/1:/Table/60/1/2{4/88684-5/93284}] should truncate: true [truncate 48 entries to first index 65 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:33.680632 117828 storage/raft_log_queue.go:577  [n1,raftlog,s1,r163/1:/Table/60/1/6{7/46934-8/50580}] should truncate: true [truncate 46 entries to first index 63 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:35.051610 120307 storage/raft_log_queue.go:577  [n1,raftlog,s1,r91/1:/Table/61/1/4{1/9/15…-3/9/294}] should truncate: true [truncate 74 entries to first index 92 (chosen via: quorum); log too large (33 MiB > 4.0 MiB)]
I190627 22:07:35.276196 120843 storage/raft_log_queue.go:577  [n1,raftlog,s1,r157/1:/Table/60/1/3{3/89063-4/93658}] should truncate: true [truncate 42 entries to first index 57 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:36.135407 122733 storage/raft_log_queue.go:577  [n1,raftlog,s1,r161/1:/Table/60/1/6{6/42311-7/46934}] should truncate: true [truncate 50 entries to first index 68 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]
I190627 22:07:41.288571 133761 storage/raft_log_queue.go:577  [n1,raftlog,s1,r24/1:/Table/57{-/1/1/10/…}] should truncate: true [truncate 48 entries to first index 86 (chosen via: quorum); log too large (24 MiB > 4.0 MiB)]
I190627 22:07:42.175631 135517 storage/raft_log_queue.go:577  [n1,raftlog,s1,r87/1:/Table/61/1/{69/8/2…-70/10/…}] should truncate: true [truncate 4 entries to first index 24 (chosen via: followers)]
I190627 22:07:44.430069 141520 storage/raft_log_queue.go:577  [n1,raftlog,s1,r185/1:/Table/60/1/{79/656…-80/282…}] should truncate: true [truncate 78 entries to first index 94 (chosen via: quorum); log too large (20 MiB > 4.0 MiB)]
I190627 22:07:45.402069 143010 storage/raft_log_queue.go:577  [n1,raftlog,s1,r164/1:/Table/61/1/8{6/6/29…-7/8/28…}] should truncate: true [truncate 79 entries to first index 97 (chosen via: quorum); log too large (20 MiB > 4.0 MiB)]
I190627 22:07:46.439351 145930 storage/raft_log_queue.go:577  [n1,raftlog,s1,r191/1:/Table/60/1/8{8/9369…-9/47235}] should truncate: true [truncate 64 entries to first index 81 (chosen via: quorum); log too large (17 MiB > 4.0 MiB)]
I190627 22:07:47.394119 148038 storage/raft_log_queue.go:577  [n1,raftlog,s1,r40/1:/Table/61/1/3{4/9/25…-5/10/4…}] should truncate: true [truncate 4 entries to first index 32 (chosen via: followers)]
I190627 22:07:50.964429 157459 storage/raft_log_queue.go:577  [n1,raftlog,s1,r213/1:/Table/60/1/6{2/90664-3/28451}] should truncate: true [truncate 76 entries to first index 92 (chosen via: quorum); log too large (12 MiB > 4.0 MiB)]
I190627 22:07:52.515959 161336 storage/raft_log_queue.go:577  [n1,raftlog,s1,r115/1:/Table/60/1/11/{33585-67550}] should truncate: true [truncate 6 entries to first index 25 (chosen via: followers)]
I190627 22:08:07.860932 198796 storage/raft_log_queue.go:577  [n1,raftlog,s1,r291/1:/Table/53/1/{54/1-86/6/0}] should truncate: true [truncate 11714 entries to first index 11725 (chosen via: quorum); log too large (4.0 MiB > 4.0 MiB)]
I190627 22:08:16.330629 219048 storage/raft_log_queue.go:577  [n1,raftlog,s1,r138/1:/Table/61/1/5{3/7/78-4/1/28…}] should truncate: true [truncate 96 entries to first index 113 (chosen via: quorum); log too large (8.4 MiB > 4.0 MiB)]
I190627 22:08:22.373529 234240 storage/raft_log_queue.go:577  [n1,raftlog,s1,r174/1:/Table/57/1/7/{5/-69…-8/-26…}] should truncate: true [truncate 2 entries to first index 17 (chosen via: followers)]

@danhhz
Copy link
Contributor Author

danhhz commented Jun 27, 2019

The roachtest finally finished and the query ended up taking 9m39s.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those messages that have non-cooperative truncations ("chose via: quorum") all have... super large logs with very few exceptions. For example

I190627 22:07:30.419055 111482 storage/raft_log_queue.go:577 [n1,raftlog,s1,r121/1:/Table/60/1/2{4/88684-5/93284}] should truncate: true [truncate 48 entries to first index 65 (chosen via: quorum); log too large (34 MiB > 4.0 MiB)]

That's .7mb per entry. I'm willing to bet that there's at least one AddSSTable early in the log that hasn't been truncated yet from the IMPORT in the test. That's sort of good news, because this means it's not necessarily representative of the common case we're trying to fix right now. At 34mb, you're also at least in the general area of where a snapshot makes sense, so this might be working as intended.

Try a run in which you wait for 11 minutes after the IMPORT finishes before killing the node. The difference is that in 10 minutes, the scanner should've suggested every range to the log queue least once, and all of the SSTs should have been truncated away, leaving only small logs behind. If that passes, I think we're in a decent enough place for now.

Also try a run with a 1h downtime (or however long it takes for all ranges to be off the dead node, hopefully much less). We'll want to see that all logs get truncated regularly (if they see writes), using a cooperative approach (i.e. logs never hit 4mb).

We need some due diligence on the quota pool issue Ben mentioned. He was worried that when a node comes back up and has to catch up on (say) 4mb of Raft log, the quota pool on the leader may start blocking proposals until the follower has caught up. The quota pool claims to avoid this problem, but I doubt it mattered much so far so please take a look at the code (and tests!) to convince yourself that it all checks out.

Reviewed 7 of 7 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz)


pkg/storage/raft_log_queue.go, line 361 at r2 (raw file):

// be truncating at a point past it. If a change is made, the ChosenVia is
// updated with the given one.
func (td *truncateDecision) EnsureNotPastIndex(index uint64, chosenVia string) {

Nice improvement. For clarity (there's always the risk of one-offs in truncation because it's unclear if the "truncated" index itself is gone or stays) consider renaming this MaybeLowerNewFirstIndex which leaves no ambiguity.

Copy link
Contributor Author

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A version of the roachtest with an 11m sleep works (though that's super unfortunate, I wish import wasn't such a special case everywhere). There are a bunch of lines like the following, but that makes sense since it can truncate up to where the follower was at before it went away. They seemed to mostly (but not entirely) trail off after the first 60s or so of downtime.

I190627 23:20:15.843432 32749 storage/raft_log_queue.go:577  [n1,raftlog,s1,r4/1:/System{/tsd-tse}] should truncate: true [truncate 2 entries to first index 585 (chosen via: followers)]

Running the 60m variant now. Were you thinking I should also add that as a roachtest or just a one-off?

We need some due diligence on the quota pool issue Ben mentioned. He was worried that when a node comes back up and has to catch up on (say) 4mb of Raft log, the quota pool on the leader may start blocking proposals until the follower has caught up. The quota pool claims to avoid this problem, but I doubt it mattered much so far so please take a look at the code (and tests!) to convince yourself that it all checks out.

I've never looked at the quota pool at all, so sadly I doubt I'll make it to this before I'm ooo.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/storage/raft_log_queue.go, line 361 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Nice improvement. For clarity (there's always the risk of one-offs in truncation because it's unclear if the "truncated" index itself is gone or stays) consider renaming this MaybeLowerNewFirstIndex which leaves no ambiguity.

I've gone through 10 variants of this, none of which I'm happy with. I was hoping to capture what it was semantically doing (and not just what it was literally doing). My hesitation with something that's just a literal description like MaybeLowerNewFirstIndex is that I spent a lot of mental overhead reminding myself that bigger index means truncate more and smaller index means truncate less. I'd love to name it something like DontCutMeOff but less cute-sy.

I wonder if we can make any of this clearer if we rename NewFirstIndex to something like ProposedTruncationIndex and also make ShouldTruncate() into a bool that's calculated at the end of computeTruncateDecision


pkg/storage/raft_log_queue.go, line 443 at r2 (raw file):

	}

	// Advance to the first index, but never truncate past the quorum commit

btw, does this do anything? it seems almost perfectly redundant with the other check below. basically there's some cases where we'll end up with the the "first index" from this instead of "last index" from the next if branch right? is that meaningful?

	if decision.NewFirstIndex < decision.Input.FirstIndex {
		decision.NewFirstIndex = decision.Input.FirstIndex
		decision.ChosenVia = truncatableIndexChosenViaFirstIndex
	}

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never looked at the quota pool at all, so sadly I doubt I'll make it to this before I'm ooo.

I can pick this up.

just a one-off?

^- that one

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz)


pkg/storage/raft_log_queue.go, line 361 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

I've gone through 10 variants of this, none of which I'm happy with. I was hoping to capture what it was semantically doing (and not just what it was literally doing). My hesitation with something that's just a literal description like MaybeLowerNewFirstIndex is that I spent a lot of mental overhead reminding myself that bigger index means truncate more and smaller index means truncate less. I'd love to name it something like DontCutMeOff but less cute-sy.

I wonder if we can make any of this clearer if we rename NewFirstIndex to something like ProposedTruncationIndex and also make ShouldTruncate() into a bool that's calculated at the end of computeTruncateDecision

I like cleaning this up, but I'd hate to have to go back to guessing about whether the "truncation index" is the first index or the one before that.

Just to throw an 11th in there, ProtectIndex(), i.e. "make sure that this index is not truncated away".


pkg/storage/raft_log_queue.go, line 443 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

btw, does this do anything? it seems almost perfectly redundant with the other check below. basically there's some cases where we'll end up with the the "first index" from this instead of "last index" from the next if branch right? is that meaningful?

	if decision.NewFirstIndex < decision.Input.FirstIndex {
		decision.NewFirstIndex = decision.Input.FirstIndex
		decision.ChosenVia = truncatableIndexChosenViaFirstIndex
	}

I think you could replace both by

decision.EnsureNotPastIndex(input.LastIndex, truncatableIndexChosenViaLastIndex)
if decision.NewFirstIndex < decision.Input.FirstIndex {
		decision.NewFirstIndex = decision.Input.FirstIndex
		decision.ChosenVia = truncatableIndexChosenViaFirstIndex
}

For clarity, you may just want to lower QuorumIndex right where it is created so that it's known to be lower than LastIndex. Nobody in their right mind would think that it could've been otherwise for most of the method, but it can. And then what's left to do here is only the "don't accidentally drop below the old first index" which is the snippet you have in your comment. I think that should be it?

Copy link
Contributor Author

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the 60m test twice. The first time it errored at the query with restart.go:62,restart.go:78,test.go:1249: pq: no inbound stream connection the second time the query finished in 20s.

The only thing I see in the node 3 logs is this (and I don't see that flow id in any other node logs):

E190628 02:17:32.710603 814 sql/distsqlrun/flow_registry.go:221  [n3,client=73.83.8.34:34954,user=root] flow id:26d52b7a-4f58-4406-a135-8459d3bdf3a9 : 1 inbound streams timed out after 10s; propagated error throughout flow

I can pick this up.

danke

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/storage/raft_log_queue.go, line 361 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I like cleaning this up, but I'd hate to have to go back to guessing about whether the "truncation index" is the first index or the one before that.

Just to throw an 11th in there, ProtectIndex(), i.e. "make sure that this index is not truncated away".

ProtectIndex sgtm


pkg/storage/raft_log_queue.go, line 443 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I think you could replace both by

decision.EnsureNotPastIndex(input.LastIndex, truncatableIndexChosenViaLastIndex)
if decision.NewFirstIndex < decision.Input.FirstIndex {
		decision.NewFirstIndex = decision.Input.FirstIndex
		decision.ChosenVia = truncatableIndexChosenViaFirstIndex
}

For clarity, you may just want to lower QuorumIndex right where it is created so that it's known to be lower than LastIndex. Nobody in their right mind would think that it could've been otherwise for most of the method, but it can. And then what's left to do here is only the "don't accidentally drop below the old first index" which is the snippet you have in your comment. I think that should be it?

Made a few more adjustments that we discussed offline.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first time it errored at the query with restart.go:62,restart.go:78,test.go:1249: pq: no inbound stream connection

Uh, hmm, oof. I don't like seeing that. In the worst case, this means catching up on "more Raft logs" than before somehow starved the machine to the point of that failure. See anything interesting in the runtime stats printouts?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

@danhhz
Copy link
Contributor Author

danhhz commented Jun 28, 2019

Not particuarly

I190628 02:17:32.311988 167 server/status/runtime.go:498  [n3] runtime stats: 268 MiB RSS, 200 goroutines, 138 MiB/112 MiB/267 MiB GO alloc/idle/total, 42 MiB/46 MiB CGO alloc/total, 0.0 CGO/sec, 0.0/0.0 %(u/s)time, 0.0 %gc (12x), 72 MiB/293 KiB (r/w)net
I190628 02:17:42.312565 167 server/status/runtime.go:498  [n3] runtime stats: 314 MiB RSS, 201 goroutines, 263 MiB/52 MiB/312 MiB GO alloc/idle/total, 26 MiB/34 MiB CGO alloc/total, 411.4 CGO/sec, 84.9/11.0 %(u/s)time, 0.0 %gc (1x), 13 MiB/777 KiB (r/w)net
I190628 02:17:52.312580 167 server/status/runtime.go:498  [n3] runtime stats: 454 MiB RSS, 202 goroutines, 156 MiB/286 MiB/442 MiB GO alloc/idle/total, 70 MiB/79 MiB CGO alloc/total, 277.0 CGO/sec, 76.2/16.7 %(u/s)time, 0.0 %gc (1x), 18 MiB/155 KiB (r/w)net

Full logs from after the restart
cockroach.dan-1561683572-restart-down-for-2m-0003.dan.2019-06-28T02_17_21Z.003715.log

None of them seem particularly worse than, say, node 2 around the same time

I190628 02:16:28.901042 164 server/status/runtime.go:498  [n2] runtime stats: 6.4 GiB RSS, 550 goroutines, 197 MiB/1.4 GiB/596 MiB GO alloc/idle/total, 3.7 GiB/4.6 GiB CGO alloc/total, 40138.8 CGO/sec, 190.1/25.8 %(u/s)time, 0.0 %gc (4x), 20 MiB/24 MiB (r/w)net
I190628 02:16:38.900133 164 server/status/runtime.go:498  [n2] runtime stats: 6.4 GiB RSS, 536 goroutines, 240 MiB/1.4 GiB/596 MiB GO alloc/idle/total, 3.7 GiB/4.6 GiB CGO alloc/total, 43868.0 CGO/sec, 172.8/24.2 %(u/s)time, 0.0 %gc (4x), 22 MiB/27 MiB (r/w)net
I190628 02:16:48.900568 164 server/status/runtime.go:498  [n2] runtime stats: 6.4 GiB RSS, 528 goroutines, 305 MiB/1.4 GiB/596 MiB GO alloc/idle/total, 3.7 GiB/4.6 GiB CGO alloc/total, 45523.7 CGO/sec, 176.3/23.9 %(u/s)time, 0.0 %gc (4x), 23 MiB/28 MiB (r/w)net
I190628 02:16:58.900899 164 server/status/runtime.go:498  [n2] runtime stats: 6.3 GiB RSS, 477 goroutines, 238 MiB/1.4 GiB/596 MiB GO alloc/idle/total, 3.7 GiB/4.6 GiB CGO alloc/total, 46825.9 CGO/sec, 194.0/23.0 %(u/s)time, 0.0 %gc (5x), 23 MiB/29 MiB (r/w)net
I190628 02:17:08.900330 164 server/status/runtime.go:498  [n2] runtime stats: 6.3 GiB RSS, 481 goroutines, 276 MiB/1.4 GiB/596 MiB GO alloc/idle/total, 3.7 GiB/4.6 GiB CGO alloc/total, 44561.9 CGO/sec, 204.2/25.5 %(u/s)time, 0.0 %gc (4x), 21 MiB/27 MiB (r/w)net
I190628 02:17:18.900509 164 server/status/runtime.go:498  [n2] runtime stats: 6.4 GiB RSS, 280 goroutines, 303 MiB/1.4 GiB/596 MiB GO alloc/idle/total, 3.7 GiB/4.7 GiB CGO alloc/total, 43323.0 CGO/sec, 159.8/23.1 %(u/s)time, 0.0 %gc (4x), 20 MiB/26 MiB (r/w)net
I190628 02:17:28.901401 164 server/status/runtime.go:498  [n2] runtime stats: 6.4 GiB RSS, 212 goroutines, 183 MiB/1.4 GiB/643 MiB GO alloc/idle/total, 3.7 GiB/4.7 GiB CGO alloc/total, 547.9 CGO/sec, 58.2/8.5 %(u/s)time, 0.0 %gc (14x), 424 MiB/362 MiB (r/w)net
I190628 02:17:38.900953 164 server/status/runtime.go:498  [n2] runtime stats: 6.4 GiB RSS, 211 goroutines, 251 MiB/1.4 GiB/644 MiB GO alloc/idle/total, 3.7 GiB/4.6 GiB CGO alloc/total, 1094.0 CGO/sec, 26.5/5.0 %(u/s)time, 0.0 %gc (6x), 216 MiB/172 MiB (r/w)net
I190628 02:17:48.900973 164 server/status/runtime.go:498  [n2] runtime stats: 6.4 GiB RSS, 211 goroutines, 303 MiB/1.3 GiB/646 MiB GO alloc/idle/total, 3.7 GiB/4.6 GiB CGO alloc/total, 540.5 CGO/sec, 5.6/1.6 %(u/s)time, 0.0 %gc (0x), 581 KiB/6.4 MiB (r/w)net
I190628 02:17:58.902162 164 server/status/runtime.go:498  [n2] runtime stats: 6.4 GiB RSS, 211 goroutines, 312 MiB/1.3 GiB/646 MiB GO alloc/idle/total, 3.7 GiB/4.6 GiB CGO alloc/total, 175.4 CGO/sec, 1.6/0.6 %(u/s)time, 0.0 %gc (0x), 479 KiB/247 KiB (r/w)net

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the changes so far LGTM. I suppose you're out from this point on? I can take over this PR and a) figure out this inbound stream thing and b) look at the quota pool

@jordanlewis what's the latest wisdom on "no inbound stream" right after restarting a node? Is that something we've seen before? The explanations for it that I've heard so far revolve around overloading a node, so I'm worried that's what we're doing here, even though that's unlikely and nothing in the logs suggests it.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)


pkg/cmd/roachtest/restart.go, line 42 at r3 (raw file):

	// TODO(dan): It seems like some part of `fixtures import` should do this for
	// you. It's odd that a cluster that's freshly initialized with fixtures
	// import behaves so differently than it does 10m after.

You could add a comment that in all likelyhood it'll be a lot less than 11 minutes until we're good, but it's sort of difficult to say exactly. The raft log truncation is reactive so really you're expecting them all to be processed eagerly, which I'd be surprised if it took >1m. But that's not what we're testing here so that generous timeout is good. The comment could still help folks who just want to run it a bunch locally and want to lower the timeout.

Copy link
Contributor Author

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran it a third time and got the inbound stream error again, so this is definitely a real problem. I left it up at dan-1561741365-restart-down-for-2m in case you can gather anything from it.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't be able to look today, but that's not good. Maybe we are horking up grpc somehow by sending loads of entries at once.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)

@danhhz
Copy link
Contributor Author

danhhz commented Jun 28, 2019

Out of curiosity, I ran the 60m roachtest with 19.1.2 and it passed.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great news, this reproduced immediately

restart.go:62,restart.go:78,test.go:1249: pq: no inbound stream connection

Going to poke more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately my attempts of containing that repro haven't born fruit yet. The roachtest stops n3, runs tpcc with --nowait for ten minutes, then restarts n3 and immediately runs a COUNT(*) against it, which then tends to catch the in inbound stream error. I got this on my first try, during which I left the node offline for 10 minutes.

I've tried to repeat this on the same cluster multiple times now, and it never worked. I made various adjustments that basically disable log truncation while n3 is down (to have more log to catch up on, which I thought would exacerbate the problem) as well as logging RequestFlow etc similar to #31361 (comment).
I also switched to kv (which is much faster at generating Raft log; a few minutes give me many 100ks of entries for n3 to catch up on). But this "just works", and it works so flawlessly that I'm really quite surprised. For example, here's the raflog "behind" graph:

image.png

The ingredient that is missing with kv is that I didn't ingest a dataset. I bet I need lots of ranges catching up at around the same time. But somehow with tpcc I also am not reproducing this any more. Maybe there are residual snapshots that play a role; I bumped the max rate up significantly on this cluster at some point so those would go away very quickly, maybe that changes the outcome.

Anyway. Hmm. Going to run this some more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this PR vanilla as is (with the 2 minute timeout) and got one failure out of ten runs. I can work with that, more tomorrow.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image.png

image copy 1.png

image copy 2.png

image copy 3.png

image copy 4.png

From first failure. Nothing really interesting to see. In the logs I don't see anything noteworthy before the inbound stream timeout except that there are some Raft snapshots (this may or may not matter).

The second batch (in which I let the node down for 10 minutes, and lower the inbound conn timeout to 5s) has passed 8 runs (2 still going), sigh.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all 10 have passed. I'll start another 16, this time leaving the node down for 20 minutes (and also incorporating the grpc-level debugging from #31361 so that if it strikes I hopefully get something out of it).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)

@tbg
Copy link
Member

tbg commented Jul 2, 2019

So I think I'm seeing something unexpected, but it has nothing to do with networking issues. I have logging in gRPC every time a connection closed or a keepalive times out, and there's nothing (and I know the logging works because it does fire when n3 is shut down initially).

20:40:10.630837 3169061 sql/distsqlrun/server.go:561  received SetupFlow request from n3 for flow deb5307f-a257-4212-b382-2d98e8c7cf9c
20:40:10.631002 2725334 sql/distsqlrun/server.go:561  received SetupFlow request from n3 for flow deb5307f-a257-4212-b382-2d98e8c7cf9c
20:40:10.631202 2725335 sql/distsqlrun/outbox.go:226  [n2] outbox: calling FlowStream
20:40:10.631274 2725335 sql/distsqlrun/outbox.go:237  [n2] outbox: FlowStream returned

There's never a corresponding "calling FlowStream" for n1 and so of course n3 times out waiting for the incoming stream.
I'm relatively sure I'm just rediscovering some variant of

#31361 (comment),

that is, the circuit breaker for dialing n3 is open. Which makes some sense owing to the fact that n3 just got restarted, as does happen in this test.
I'm going to start another round of repro with logging for that case and the 2 min downtime (after all, the downtime shouldn't matter, and by all accounts so far, doesn't).

@tbg
Copy link
Member

tbg commented Jul 2, 2019

Yeah, was able to confirm this in another repro. I bungled the logging (so I'm not seeing the actual error returned from nodedialer.Dial but it's definitely that.

I'm also seeing in the logging that shortly after that, the breaker to n3 resets to open, so I think this is just a timing issue where a node is asked to connect to n3 while its breaker for n3 is still tripped (I can see from the logs that the breaker trips when the node goes down and only untrips immediately it comes back online, as it's supposed to work).

The good news is that this is all unrelated to what this PR is trying to accomplish, and so I feel comfortable merging.

The bad news is that something isn't working perfectly here, but it's not major. Filing an issue and moving on.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the original experiment from again, too: #37906 (comment)

I saw 56 Raft snapshots after restarting the node, and ~1.2 million Raft log entries were transferred in approximately 15 minutes. (This doesn't necessarily mean that they were going at full speed for that time, the workload wasn't running and so ranges were likely quiesced and got woken up sporadically by scanners, at which point they'd catch up). However, there's still lots to investigate here, but I think this is a good first PR to improve things.

image.png

Reviewed 3 of 3 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)

danhhz and others added 2 commits July 2, 2019 16:16
Previously, we'd hold off on truncating the raft log if a replica was
missing but contactable in the last 10 seconds. This meant that if a
node was down for *more* than 10 seconds, there was a good chance that
we'd truncate logs for some of its replicas (especially for its
high-traffic ones) and it would need snapshots for them when it came
back up.

This was for two reasons. First, we've historically assumed that it's
cheaper to catch a far-behind node up with a snapshot than with entries.
Second, snapshots historically had to include the Raft log which implied
a need to keep the size of the Raft log tightly controlled due to being
pulled into memory at the snapshot receiver, but that's changed
recently.

The problem is when a node is down for longer than 10 seconds but
shorter than the time it takes to upreplicate all of its ranges onto new
nodes. It might come back up to find that it needs a snapshot for most
ranges. We rate limit snapshots fairly aggressively because they've been
disruptive in the past, but this means that it could potentially take
hours for a node to recover from a 2 minute outage.

This would be merely unfortunate if there wasn't a second compounding
issue. A recently restarted node has a cold leaseholder cache. When it
gets traffic for one of its replicas, it first tries itself as the
leaseholder (maybe it will get lucky and won't need the network hop).
Unfortunately, if the replica needs a snapshot, this decision currently
blocks on it. This means that requests sent to the recently started node
could block for as long as the heavily-throttled snapshots take, hours
or even days.

Short outages of more than 10 seconds are reasonably common with routine
maintenance (rolling to a new version, swapping hardware, etc), so it's
likely that customers will hit this (and one did).

This commit avoids truncating the log past any follower's position when
all replicas have recently been active (the quota pool keeps it from
growing without bound in this case). If at least one replica hasn't
recently been active, it holds off any truncation until the log reaches
a size threshold.

Partial mitigation for cockroachdb#37906

Potentially also helps with cockroachdb#36879

Release note (bug fix): Nodes that have been down now recover more
quickly when they rejoin, assuming they weren't down for much more
than the value of the `server.time_until_store_dead` cluster setting
(which defaults to 5 minutes).
@tbg
Copy link
Member

tbg commented Jul 2, 2019

bors r+

craig bot pushed a commit that referenced this pull request Jul 2, 2019
38484: storage: truncate raft log less aggressively when replica is missing r=tbg a=danhhz

Previously, we'd hold off on truncating the raft log if a replica was
missing but contactable in the last 10 seconds. This meant that if a
node was down for *more* than 10 seconds, there was a good chance that
we'd truncate logs for some of its replicas (especially for its
high-traffic ones) and it would need snapshots for them when it came
back up.

This was for two reasons. First, we've historically assumed that it's
cheaper to catch a far-behind node up with a snapshot than with entries.
Second, when we replicated the raft log truncation, we'd have to include
it in snapshots anyway, but that's changed recently.

The problem is when a node is down for a short time but more than 10
seconds (think a couple minutes). It might come back up to find that it
needs a snapshot for most ranges. We rate limit snapshots fairly
aggressively because they've been disruptive in the past, but this means
that it could potentially take hours for a node to recover from a 2
minute outage.

This would be merely unfortunate if there wasn't a second compounding
issue. A recently restarted node has a cold leaseholder cache. When it
gets traffic for one of its replicas, it first tries itself as the
leaseholder (maybe it will get lucky and won't need the network hop).
Unfortunately, if the replica needs a snapshot, this decision currently
blocks on it. This means that requests sent to the recently started node
could block for as long as the heavily-throttled snapshots take, hours
or even days.

Short outages of more than 10 seconds are reasonably common with routine
maintenance (rolling to a new version, swapping hardware, etc), so it's
likely that customers will hit this (and one did).

This commit ties the threshold that the raft log queue uses for recent
activity to `server.time_until_store_dead`, which is already the
threshold where we give up on a down node and start transferring away
its replicas. It defaults to 5 minutes, which is still short, but it's
more reasonable than 10 seconds. Crucially, it also is a cluster
setting, so it can be overridden.

We'd like to move even further in the direction of leniency about raft
log truncation while a replica is missing, but this is a compromise
that's intended to be less risky to backport to 19.1.x.

Partial mitigation for #37906

Potentially also helps with #36879

Release note (bug fix): Nodes that have been down now recover more
quickly when they rejoin, assuming they weren't down for much more
than the value of the `server.time_until_store_dead` cluster setting
(which defaults to 5 minutes).

Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@craig
Copy link
Contributor

craig bot commented Jul 2, 2019

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants