changefeedccl: cdc/crdb-chaos sometimes takes a while to recover from chaos #36879

danhhz · 2019-04-16T18:37:19Z

I've seen the latency make it up to 10-20m before, but it really shouldn't ever go higher than 1-2m. Definitely something to look into here, easy enough to repro by extending the duration of the test.

Jira issue: CRDB-4477

danhhz · 2019-04-18T21:21:03Z

Okay, well first of all, something is bonkers about the test. The resolved timestamp it's looking at should never regress, so a 42m jump in 2s means something is broken.

changefeed: 21:13:54 cdc.go:833: end-to-end steady latency 1m42.086645442s; max steady latency so far 18m58.209893436s
changefeed: 21:13:56 cdc.go:833: end-to-end steady latency 43m54.855883321s; max steady latency so far 43m54.855883321s

danhhz · 2019-04-22T19:07:35Z

Started back into this and got a high-water checkpoint that's behind what the feed was restarted with. Still digging into why, but perhaps we want to make sure we .Forward the old progress as a safety net. It's still odd that this ever happened though.

I190422 19:02:23.407085 3713 ccl/changefeedccl/changefeed_dist.go:92  [n1] WIP starting changefeed from 1555959671.535291852,0 is behind by 1m11.871743607s
I190422 19:02:49.985172 3713 ccl/changefeedccl/changefeed.go:323  [n1] WIP wrote high-water checkpoint 1555959404.146301576,0 is behind by 6m5.83883919s

danhhz · 2019-04-22T19:46:54Z

A little overwhelmed with all the threads to pull on here, so just going to write them in here for now and follow up later. Just got an unexpectedly failed changefeed, with no useful stacktrace from the errors stuff. I suspect this happened while trying to compute the spans to track (the fetchSpansForTargets call in distChangefeedFlow. fetchSpansForTargets uses db.Txn, so it should have retried anything retryable. This entire cluster is pretty new, so I'd hope the descriptor it's looking for has not been GC'd yet.

I190422 19:04:45.882727 5450 ccl/changefeedccl/changefeed_dist.go:92  [n1] WIP starting changefeed from 1555959404.146301576,0 is behind by 8m1.736391533s
W190422 19:04:45.883095 5450 ccl/changefeedccl/changefeed_stmt.go:459  [n1] CHANGEFEED job 445281624778735617 returning with error: descriptor not found
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.init
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/structured.go:179

danhhz · 2019-04-22T20:44:59Z

Progress! Figured out that RangeFeed will return checkpoints from before the timestamp you started it at because it just happily forwards closed timestamps as long as there are no outstanding intents before them. When combined with job restarts, this could cause the changefeed high-water to regress (since we don't initialize the spans in the spanFrontier at any particular timestamp). This is likely one cause of the very high end-to-end latencies in the roachtest. It remains to be seen if it's the only cause.

I currently have a local fix for changefeeds to throw out any rangefeed checkpoints from before the timestamp we started the rangefeed with. I also think it'd be appropriate to add a high-level assertion that the high-water never regresses (perhaps a Fatal in tests, Sentry in prod kind of thing).

@nvanbenschoten I can't really see a reason why a rangefeed would ever need to return checkpoints from below the timestamp specified in its request. What do you think about filtering these inside the rangefeed registration?

nvanbenschoten · 2019-04-22T20:55:18Z

Yeah, we very explicitly allow this situation:

cockroach/pkg/storage/rangefeed/registry.go

Lines 368 to 369 in b46cf9e

    
           // Always publish checkpoint notifications, regardless 
        
           // of a registration's starting timestamp.

I don't remember why though. I think I wrote it that way because it seemed like more information about resolved timestamps is always better than less. If that violates the rangefeed contract that consumers are expecting, I don't see a reason why we shouldn't change it.

The one case where this might come up is immediately after a rangefeed's catch-up scan. If we filter checkpoints with timestamps below the rangefeed's start timestamp, a rangefeed may be unable to signal that it has completed its catch-up scan. Is that ok?

danhhz · 2019-04-22T21:02:26Z

It's easy to argue that changefeeds should be doing this filtering (and like I mentioned above, that's where I have my fix right now). I'm wary of generalizing anything about what contract rangefeed consumers expect from only the changefeed data point. Given that you thought this was a good idea while this was most loaded up in your head, it's probably safest to keep the filtering in changefeed for now (and document this more loudly on at least some of RangeFeedRequest/RangeFeedCheckpoint/(*DistSender).RangeFeed). We can reconsider this later once Rangefeed has more consumers.

The one case where this might come up is immediately after a rangefeed's catch-up scan. If we filter checkpoints with timestamps below the rangefeed's start timestamp, a rangefeed may be unable to signal that it has completed its catch-up scan. Is that ok?

Rangefeed currently signals the end of a catchup scan with an empty timestamp, no? I'd imagine it's not surprising to exempt this from the filtering if we did it.

When the ChangeFrontier processor for a changefeed is started up, it creates a spanFrontier to track the overall progression for the feed. Each watched span is set to the empty timestamp so we can tell the difference between a rangefeed being behind and one still initializing. However, rangefeed will happily return resolved timestamps less than the one it was started with. If we feed this into the spanFrontier, it can report a timestamp behind the one everything started up at. In the case of a new changefeed, this is fine (though perhaps odd). In the case of a changefeed being restarted from the job's high-water, this means we can overwrite the high-water progress with the lower value, regressing it. At the least, this is very surprising UX, but it's easy to argue that it's a correctness issue. Touches cockroachdb#36879 Release note (bug fix): Fixed a bug where `CHANGEFEED` job progress would regress when the job was restarted.

37009: changefeedccl: fix job progress regressing r=nvanbenschoten a=danhhz When the ChangeFrontier processor for a changefeed is started up, it creates a spanFrontier to track the overall progression for the feed. Each watched span is set to the empty timestamp so we can tell the difference between a rangefeed being behind and one still initializing. However, rangefeed will happily return resolved timestamps less than the one it was started with. If we feed this into the spanFrontier, it can report a timestamp behind the one everything started up at. In the case of a new changefeed, this is fine (though perhaps odd). In the case of a changefeed being restarted from the job's high-water, this means we can overwrite the high-water progress with the lower value, regressing it. At the least, this is very surprising UX, but it's easy to argue that it's a correctness issue. Touches #36879 Release note (bug fix): Fixed a bug where `CHANGEFEED` job progress would regress when the job was restarted. 37030: changefeedccl: log more details when returning with non-retryable errors r=tbg a=danhhz The `%v` and `%+v` were switched from what I intended in the line that logs retryable errors and the one that handles non-retryable errors. Release note: None Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>

When the ChangeFrontier processor for a changefeed is started up, it creates a spanFrontier to track the overall progression for the feed. Each watched span is set to the empty timestamp so we can tell the difference between a rangefeed being behind and one still initializing. However, rangefeed will happily return resolved timestamps less than the one it was started with. If we feed this into the spanFrontier, it can report a timestamp behind the one everything started up at. In the case of a new changefeed, this is fine (though perhaps odd). In the case of a changefeed being restarted from the job's high-water, this means we can overwrite the high-water progress with the lower value, regressing it. At the least, this is very surprising UX, but it's easy to argue that it's a correctness issue. Touches cockroachdb#36879 Release note (bug fix): Fixed a bug where `CHANGEFEED` job progress would regress when the job was restarted.

Also make the crdb-chaos test more lenient to recovery times to work around the flakes we're seeing. This thread should probably get pulled at some point so leaving cockroachdb#36879 open to track it. Closes cockroachdb#36905 Closes cockroachdb#36979 Release note: None

danhhz · 2019-05-13T22:17:35Z

We're still seeing long recovery times on this test so I made it more lenient in #37498. Definitely a thread to pull on here, but it does seem to eventually recover, so not an emergency.

(Interestingly the last 3 are tightly grouped: 11m17s, 11m12s, 11m23s. The failure condition is 10m, so we wouldn't see anything less than that, but nothing between 10m and 11m or above 12m. There were a few failures with much longer times before the last round of fixes, so dunno whether to read anything into this.)

37498: roachtest: unskip cdc/{crdb,sink}-chaos on 19.1 r=tbg a=danhhz Also make the crdb-chaos test more lenient to recovery times to work around the flakes we're seeing. This thread should probably get pulled at some point so leaving #36879 open to track it. Closes #36905 Closes #36979 Release note: None Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>

Previously, we'd hold off on truncating the raft log if a replica was missing but contactable in the last 10 seconds. This meant that if a node was down for *more* than 10 seconds, there was a good chance that we'd truncate logs for some of its replicas (especially for its high-traffic ones) and it would need snapshots for them when it came back up. This was for two reasons. First, we've historically assumed that it's cheaper to catch a far-behind node up with a snapshot than with entries. Second, when we replicated the raft log truncation, we'd have to include it in snapshots anyway, but that's changed recently. The problem is when a node is down for a short time but more than 10 seconds (think a couple minutes). It might come back up to find that it needs a snapshot for most ranges. We rate limit snapshots fairly aggressively because they've been disruptive in the past, but this means that it could potentially take hours for a node to recover from a 2 minute outage. This would be merely unfortunate if there wasn't a second compounding issue. A recently restarted node has a cold leaseholder cache. When it gets traffic for one of its replicas, it first tries itself as the leaseholder (maybe it will get lucky and won't need the network hop). Unfortunately, if the replica needs a snapshot, this decision currently blocks on it. This means that requests sent to the recently started node could block for as long as the heavily-throttled snapshots take, hours or even days. Short outages of more than 10 seconds are reasonably common with routine maintenance (rolling to a new version, swapping hardware, etc), so it's likely that customers will hit this (and one did). This commit ties the threshold that the raft log queue uses for recent activity to `server.time_until_store_dead`, which is already the threshold where we give up on a down node and start transferring away its replicas. It defaults to 5 minutes, which is still short, but it's more reasonable than 10 seconds. Crucially, it also is a cluster setting, so it can be overridden. We'd like to move even further in the direction of leniency about raft log truncation while a replica is missing, but this is a compromise that's intended to be less risky to backport to 19.1.x. Partial mitigation for cockroachdb#37906 Potentially also helps with cockroachdb#36879 Release note (bug fix): Nodes that have been down for less than `server.time_until_store_dead` now recover more quickly when they rejoin.

Previously, we'd hold off on truncating the raft log if a replica was missing but contactable in the last 10 seconds. This meant that if a node was down for *more* than 10 seconds, there was a good chance that we'd truncate logs for some of its replicas (especially for its high-traffic ones) and it would need snapshots for them when it came back up. This was for two reasons. First, we've historically assumed that it's cheaper to catch a far-behind node up with a snapshot than with entries. Second, snapshots historically had to include the Raft log which implied a need to keep the size of the Raft log tightly controlled due to being pulled into memory at the snapshot receiver, but that's changed recently. The problem is when a node is down for longer than 10 seconds but shorter than the time it takes to upreplicate all of its ranges onto new nodes. It might come back up to find that it needs a snapshot for most ranges. We rate limit snapshots fairly aggressively because they've been disruptive in the past, but this means that it could potentially take hours for a node to recover from a 2 minute outage. This would be merely unfortunate if there wasn't a second compounding issue. A recently restarted node has a cold leaseholder cache. When it gets traffic for one of its replicas, it first tries itself as the leaseholder (maybe it will get lucky and won't need the network hop). Unfortunately, if the replica needs a snapshot, this decision currently blocks on it. This means that requests sent to the recently started node could block for as long as the heavily-throttled snapshots take, hours or even days. Short outages of more than 10 seconds are reasonably common with routine maintenance (rolling to a new version, swapping hardware, etc), so it's likely that customers will hit this (and one did). This commit avoids truncating the log past any follower's position when all replicas have recently been active (the quota pool keeps it from growing without bound in this case). If at least one replica hasn't recently been active, it holds off any truncation until the log reaches a size threshold. Partial mitigation for cockroachdb#37906 Potentially also helps with cockroachdb#36879 Release note (bug fix): Nodes that have been down for less than `server.time_until_store_dead` now recover more quickly when they rejoin.

Previously, we'd hold off on truncating the raft log if a replica was missing but contactable in the last 10 seconds. This meant that if a node was down for *more* than 10 seconds, there was a good chance that we'd truncate logs for some of its replicas (especially for its high-traffic ones) and it would need snapshots for them when it came back up. This was for two reasons. First, we've historically assumed that it's cheaper to catch a far-behind node up with a snapshot than with entries. Second, snapshots historically had to include the Raft log which implied a need to keep the size of the Raft log tightly controlled due to being pulled into memory at the snapshot receiver, but that's changed recently. The problem is when a node is down for longer than 10 seconds but shorter than the time it takes to upreplicate all of its ranges onto new nodes. It might come back up to find that it needs a snapshot for most ranges. We rate limit snapshots fairly aggressively because they've been disruptive in the past, but this means that it could potentially take hours for a node to recover from a 2 minute outage. This would be merely unfortunate if there wasn't a second compounding issue. A recently restarted node has a cold leaseholder cache. When it gets traffic for one of its replicas, it first tries itself as the leaseholder (maybe it will get lucky and won't need the network hop). Unfortunately, if the replica needs a snapshot, this decision currently blocks on it. This means that requests sent to the recently started node could block for as long as the heavily-throttled snapshots take, hours or even days. Short outages of more than 10 seconds are reasonably common with routine maintenance (rolling to a new version, swapping hardware, etc), so it's likely that customers will hit this (and one did). This commit avoids truncating the log past any follower's position when all replicas have recently been active (the quota pool keeps it from growing without bound in this case). If at least one replica hasn't recently been active, it holds off any truncation until the log reaches a size threshold. Partial mitigation for cockroachdb#37906 Potentially also helps with cockroachdb#36879 Release note (bug fix): Nodes that have been down now recover more quickly when they rejoin, assuming they weren't down for much more than the value of the `server.time_until_store_dead` cluster setting (which defaults to 5 minutes).

38484: storage: truncate raft log less aggressively when replica is missing r=tbg a=danhhz Previously, we'd hold off on truncating the raft log if a replica was missing but contactable in the last 10 seconds. This meant that if a node was down for *more* than 10 seconds, there was a good chance that we'd truncate logs for some of its replicas (especially for its high-traffic ones) and it would need snapshots for them when it came back up. This was for two reasons. First, we've historically assumed that it's cheaper to catch a far-behind node up with a snapshot than with entries. Second, when we replicated the raft log truncation, we'd have to include it in snapshots anyway, but that's changed recently. The problem is when a node is down for a short time but more than 10 seconds (think a couple minutes). It might come back up to find that it needs a snapshot for most ranges. We rate limit snapshots fairly aggressively because they've been disruptive in the past, but this means that it could potentially take hours for a node to recover from a 2 minute outage. This would be merely unfortunate if there wasn't a second compounding issue. A recently restarted node has a cold leaseholder cache. When it gets traffic for one of its replicas, it first tries itself as the leaseholder (maybe it will get lucky and won't need the network hop). Unfortunately, if the replica needs a snapshot, this decision currently blocks on it. This means that requests sent to the recently started node could block for as long as the heavily-throttled snapshots take, hours or even days. Short outages of more than 10 seconds are reasonably common with routine maintenance (rolling to a new version, swapping hardware, etc), so it's likely that customers will hit this (and one did). This commit ties the threshold that the raft log queue uses for recent activity to `server.time_until_store_dead`, which is already the threshold where we give up on a down node and start transferring away its replicas. It defaults to 5 minutes, which is still short, but it's more reasonable than 10 seconds. Crucially, it also is a cluster setting, so it can be overridden. We'd like to move even further in the direction of leniency about raft log truncation while a replica is missing, but this is a compromise that's intended to be less risky to backport to 19.1.x. Partial mitigation for #37906 Potentially also helps with #36879 Release note (bug fix): Nodes that have been down now recover more quickly when they rejoin, assuming they weren't down for much more than the value of the `server.time_until_store_dead` cluster setting (which defaults to 5 minutes). Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com> Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

Previously, we'd hold off on truncating the raft log if a replica was missing but contactable in the last 10 seconds. This meant that if a node was down for *more* than 10 seconds, there was a good chance that we'd truncate logs for some of its replicas (especially for its high-traffic ones) and it would need snapshots for them when it came back up. This was for two reasons. First, we've historically assumed that it's cheaper to catch a far-behind node up with a snapshot than with entries. Second, snapshots historically had to include the Raft log which implied a need to keep the size of the Raft log tightly controlled due to being pulled into memory at the snapshot receiver, but that's changed recently. The problem is when a node is down for longer than 10 seconds but shorter than the time it takes to upreplicate all of its ranges onto new nodes. It might come back up to find that it needs a snapshot for most ranges. We rate limit snapshots fairly aggressively because they've been disruptive in the past, but this means that it could potentially take hours for a node to recover from a 2 minute outage. This would be merely unfortunate if there wasn't a second compounding issue. A recently restarted node has a cold leaseholder cache. When it gets traffic for one of its replicas, it first tries itself as the leaseholder (maybe it will get lucky and won't need the network hop). Unfortunately, if the replica needs a snapshot, this decision currently blocks on it. This means that requests sent to the recently started node could block for as long as the heavily-throttled snapshots take, hours or even days. Short outages of more than 10 seconds are reasonably common with routine maintenance (rolling to a new version, swapping hardware, etc), so it's likely that customers will hit this (and one did). This commit avoids truncating the log past any follower's position when all replicas have recently been active (the quota pool keeps it from growing without bound in this case). If at least one replica hasn't recently been active, it holds off any truncation until the log reaches a size threshold. Partial mitigation for cockroachdb#37906 Potentially also helps with cockroachdb#36879 Release note (bug fix): Nodes that have been down now recover more quickly when they rejoin, assuming they weren't down for much more than the value of the `server.time_until_store_dead` cluster setting (which defaults to 5 minutes).

54201: roachtest: unskip cdc/crdb-chaos r=aayushshah15 a=aayushshah15 I ran this test a total of 15 times in parallel and wasn't able to reproduce. Since its been skipped for 2+ releases, it's hard to know what fixed it, but a good guess is #48561. Release note: None Fixes #37716 Informs #36879 Release justification: testing only 57170: util/log: new experimental integration with Fluentd r=itsbilal a=knz Release note (cli change): It is now possible to redirect logging to [Fluentd](https://www.fluentd.org)-compatible network collectors. See the documentation for details. This is an alpha-quality feature. 59741: opt: fix panic in GenerateLookupJoin r=mgartner a=mgartner #### opt: fix panic in GenerateLookupJoin In #57690 a new code path was introduced from `findConstantFilterCols` from `GenerateLookupJoins`. This new code path made it possible for the filters passed to `findConstantFilterCols` to contain columns that are not part of the given table. This violated the assumption that the filter only contains columns in the given table and caused a panic. This commit fixes the issue by neglecting constant filters for columns not in the given table. Fixes #59738 Release note (bug fix): A bug has been fixed that caused errors when joining two tables when one of the tables had a computed column. This bug was present since version 21.1.0-alpha.2 and not present in any production releases. #### opt: move findConstantFilterCols to general_funcs.go Release note: None 59779: flowinfra: deflake a test r=yuzefovich a=yuzefovich Previously, a unit test could fail in rare circumstances when relocating a range to a remote node, and now we will use SucceedsSoon to avoid that. Also unskip the vectorized option. Fixes: #59712 Release note: None Co-authored-by: Aayush Shah <aayush.shah15@gmail.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

ajwerner · 2022-03-23T19:45:22Z

I do not know anything about this. I suspect it had to do with catch-up scans. Moving off my board.

miretskiy · 2023-01-09T15:27:15Z

Stale.

danhhz added the A-cdc Change Data Capture label Apr 16, 2019

This was referenced Apr 16, 2019

changefeedccl: switch retryable errors back to a whitelist #36852

Merged

roachtest: cdc/crdb-chaos/rangefeed=true failed (skipped on release-19.1) #36905

Closed

danhhz self-assigned this Apr 18, 2019

awoods187 added the C-investigation Further steps needed to qualify. C-label will change. label Apr 22, 2019

danhhz mentioned this issue Apr 22, 2019

changefeedccl: fix job progress regressing #37009

Merged

danhhz mentioned this issue Apr 24, 2019

release-19.1: changefeedccl: fix job progress regressing #37091

Merged

danhhz mentioned this issue May 7, 2019

roachtest: cdc/cloud-sink-gcs/rangefeed=true failed #37307

Closed

danhhz mentioned this issue May 13, 2019

roachtest: unskip cdc/{crdb,sink}-chaos on 19.1 #37498

Merged

danhhz removed their assignment May 14, 2019

danhhz mentioned this issue May 28, 2019

roachtest: cdc/crdb-chaos/rangefeed=true failed [skipped] #37716

Closed

ajwerner self-assigned this Jun 4, 2019

danhhz mentioned this issue Jun 27, 2019

storage: restarted node in need of snapshots can be wedged for long time #37906

Closed

danhhz mentioned this issue Jun 27, 2019

storage: truncate raft log less aggressively when replica is missing #38484

Merged

tbg mentioned this issue Jul 3, 2019

backport-19.1: storage: truncate raft log less aggressively when replica is missing #38642

Merged

irfansharif mentioned this issue Apr 14, 2020

roachtest: cdc/ledger/rangefeed=true failed #46463

Closed

aayushshah15 mentioned this issue Sep 10, 2020

roachtest: unskip cdc/crdb-chaos #54201

Merged

elinorgarcia added the T-cdc label Dec 7, 2020

ajwerner removed their assignment Mar 23, 2022

miretskiy closed this as completed Jan 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeedccl: cdc/crdb-chaos sometimes takes a while to recover from chaos #36879

changefeedccl: cdc/crdb-chaos sometimes takes a while to recover from chaos #36879

danhhz commented Apr 16, 2019 •

edited by cockroach-jira-scripts

Loading

danhhz commented Apr 18, 2019

danhhz commented Apr 22, 2019

danhhz commented Apr 22, 2019

danhhz commented Apr 22, 2019

nvanbenschoten commented Apr 22, 2019

danhhz commented Apr 22, 2019

danhhz commented May 13, 2019

ajwerner commented Mar 23, 2022

miretskiy commented Jan 9, 2023

changefeedccl: cdc/crdb-chaos sometimes takes a while to recover from chaos #36879

changefeedccl: cdc/crdb-chaos sometimes takes a while to recover from chaos #36879

Comments

danhhz commented Apr 16, 2019 • edited by cockroach-jira-scripts Loading

danhhz commented Apr 18, 2019

danhhz commented Apr 22, 2019

danhhz commented Apr 22, 2019

danhhz commented Apr 22, 2019

nvanbenschoten commented Apr 22, 2019

danhhz commented Apr 22, 2019

danhhz commented May 13, 2019

ajwerner commented Mar 23, 2022

miretskiy commented Jan 9, 2023

danhhz commented Apr 16, 2019 •

edited by cockroach-jira-scripts

Loading