-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changefeedccl: cdc/crdb-chaos sometimes takes a while to recover from chaos #36879
Comments
Okay, well first of all, something is bonkers about the test. The resolved timestamp it's looking at should never regress, so a 42m jump in 2s means something is broken.
|
Started back into this and got a high-water checkpoint that's behind what the feed was restarted with. Still digging into why, but perhaps we want to make sure we
|
A little overwhelmed with all the threads to pull on here, so just going to write them in here for now and follow up later. Just got an unexpectedly failed changefeed, with no useful stacktrace from the errors stuff. I suspect this happened while trying to compute the spans to track (the
|
Progress! Figured out that RangeFeed will return checkpoints from before the timestamp you started it at because it just happily forwards closed timestamps as long as there are no outstanding intents before them. When combined with job restarts, this could cause the changefeed high-water to regress (since we don't initialize the spans in the I currently have a local fix for changefeeds to throw out any rangefeed checkpoints from before the timestamp we started the rangefeed with. I also think it'd be appropriate to add a high-level assertion that the high-water never regresses (perhaps a Fatal in tests, Sentry in prod kind of thing). @nvanbenschoten I can't really see a reason why a rangefeed would ever need to return checkpoints from below the timestamp specified in its request. What do you think about filtering these inside the rangefeed registration? |
Yeah, we very explicitly allow this situation: cockroach/pkg/storage/rangefeed/registry.go Lines 368 to 369 in b46cf9e
I don't remember why though. I think I wrote it that way because it seemed like more information about resolved timestamps is always better than less. If that violates the rangefeed contract that consumers are expecting, I don't see a reason why we shouldn't change it. The one case where this might come up is immediately after a rangefeed's catch-up scan. If we filter checkpoints with timestamps below the rangefeed's start timestamp, a rangefeed may be unable to signal that it has completed its catch-up scan. Is that ok? |
It's easy to argue that changefeeds should be doing this filtering (and like I mentioned above, that's where I have my fix right now). I'm wary of generalizing anything about what contract rangefeed consumers expect from only the changefeed data point. Given that you thought this was a good idea while this was most loaded up in your head, it's probably safest to keep the filtering in changefeed for now (and document this more loudly on at least some of RangeFeedRequest/RangeFeedCheckpoint/(*DistSender).RangeFeed). We can reconsider this later once Rangefeed has more consumers.
Rangefeed currently signals the end of a catchup scan with an empty timestamp, no? I'd imagine it's not surprising to exempt this from the filtering if we did it. |
When the ChangeFrontier processor for a changefeed is started up, it creates a spanFrontier to track the overall progression for the feed. Each watched span is set to the empty timestamp so we can tell the difference between a rangefeed being behind and one still initializing. However, rangefeed will happily return resolved timestamps less than the one it was started with. If we feed this into the spanFrontier, it can report a timestamp behind the one everything started up at. In the case of a new changefeed, this is fine (though perhaps odd). In the case of a changefeed being restarted from the job's high-water, this means we can overwrite the high-water progress with the lower value, regressing it. At the least, this is very surprising UX, but it's easy to argue that it's a correctness issue. Touches cockroachdb#36879 Release note (bug fix): Fixed a bug where `CHANGEFEED` job progress would regress when the job was restarted.
37009: changefeedccl: fix job progress regressing r=nvanbenschoten a=danhhz When the ChangeFrontier processor for a changefeed is started up, it creates a spanFrontier to track the overall progression for the feed. Each watched span is set to the empty timestamp so we can tell the difference between a rangefeed being behind and one still initializing. However, rangefeed will happily return resolved timestamps less than the one it was started with. If we feed this into the spanFrontier, it can report a timestamp behind the one everything started up at. In the case of a new changefeed, this is fine (though perhaps odd). In the case of a changefeed being restarted from the job's high-water, this means we can overwrite the high-water progress with the lower value, regressing it. At the least, this is very surprising UX, but it's easy to argue that it's a correctness issue. Touches #36879 Release note (bug fix): Fixed a bug where `CHANGEFEED` job progress would regress when the job was restarted. 37030: changefeedccl: log more details when returning with non-retryable errors r=tbg a=danhhz The `%v` and `%+v` were switched from what I intended in the line that logs retryable errors and the one that handles non-retryable errors. Release note: None Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
When the ChangeFrontier processor for a changefeed is started up, it creates a spanFrontier to track the overall progression for the feed. Each watched span is set to the empty timestamp so we can tell the difference between a rangefeed being behind and one still initializing. However, rangefeed will happily return resolved timestamps less than the one it was started with. If we feed this into the spanFrontier, it can report a timestamp behind the one everything started up at. In the case of a new changefeed, this is fine (though perhaps odd). In the case of a changefeed being restarted from the job's high-water, this means we can overwrite the high-water progress with the lower value, regressing it. At the least, this is very surprising UX, but it's easy to argue that it's a correctness issue. Touches cockroachdb#36879 Release note (bug fix): Fixed a bug where `CHANGEFEED` job progress would regress when the job was restarted.
Also make the crdb-chaos test more lenient to recovery times to work around the flakes we're seeing. This thread should probably get pulled at some point so leaving cockroachdb#36879 open to track it. Closes cockroachdb#36905 Closes cockroachdb#36979 Release note: None
We're still seeing long recovery times on this test so I made it more lenient in #37498. Definitely a thread to pull on here, but it does seem to eventually recover, so not an emergency. (Interestingly the last 3 are tightly grouped: 11m17s, 11m12s, 11m23s. The failure condition is 10m, so we wouldn't see anything less than that, but nothing between 10m and 11m or above 12m. There were a few failures with much longer times before the last round of fixes, so dunno whether to read anything into this.) |
37498: roachtest: unskip cdc/{crdb,sink}-chaos on 19.1 r=tbg a=danhhz Also make the crdb-chaos test more lenient to recovery times to work around the flakes we're seeing. This thread should probably get pulled at some point so leaving #36879 open to track it. Closes #36905 Closes #36979 Release note: None Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
Previously, we'd hold off on truncating the raft log if a replica was missing but contactable in the last 10 seconds. This meant that if a node was down for *more* than 10 seconds, there was a good chance that we'd truncate logs for some of its replicas (especially for its high-traffic ones) and it would need snapshots for them when it came back up. This was for two reasons. First, we've historically assumed that it's cheaper to catch a far-behind node up with a snapshot than with entries. Second, when we replicated the raft log truncation, we'd have to include it in snapshots anyway, but that's changed recently. The problem is when a node is down for a short time but more than 10 seconds (think a couple minutes). It might come back up to find that it needs a snapshot for most ranges. We rate limit snapshots fairly aggressively because they've been disruptive in the past, but this means that it could potentially take hours for a node to recover from a 2 minute outage. This would be merely unfortunate if there wasn't a second compounding issue. A recently restarted node has a cold leaseholder cache. When it gets traffic for one of its replicas, it first tries itself as the leaseholder (maybe it will get lucky and won't need the network hop). Unfortunately, if the replica needs a snapshot, this decision currently blocks on it. This means that requests sent to the recently started node could block for as long as the heavily-throttled snapshots take, hours or even days. Short outages of more than 10 seconds are reasonably common with routine maintenance (rolling to a new version, swapping hardware, etc), so it's likely that customers will hit this (and one did). This commit ties the threshold that the raft log queue uses for recent activity to `server.time_until_store_dead`, which is already the threshold where we give up on a down node and start transferring away its replicas. It defaults to 5 minutes, which is still short, but it's more reasonable than 10 seconds. Crucially, it also is a cluster setting, so it can be overridden. We'd like to move even further in the direction of leniency about raft log truncation while a replica is missing, but this is a compromise that's intended to be less risky to backport to 19.1.x. Partial mitigation for cockroachdb#37906 Potentially also helps with cockroachdb#36879 Release note (bug fix): Nodes that have been down for less than `server.time_until_store_dead` now recover more quickly when they rejoin.
Previously, we'd hold off on truncating the raft log if a replica was missing but contactable in the last 10 seconds. This meant that if a node was down for *more* than 10 seconds, there was a good chance that we'd truncate logs for some of its replicas (especially for its high-traffic ones) and it would need snapshots for them when it came back up. This was for two reasons. First, we've historically assumed that it's cheaper to catch a far-behind node up with a snapshot than with entries. Second, snapshots historically had to include the Raft log which implied a need to keep the size of the Raft log tightly controlled due to being pulled into memory at the snapshot receiver, but that's changed recently. The problem is when a node is down for longer than 10 seconds but shorter than the time it takes to upreplicate all of its ranges onto new nodes. It might come back up to find that it needs a snapshot for most ranges. We rate limit snapshots fairly aggressively because they've been disruptive in the past, but this means that it could potentially take hours for a node to recover from a 2 minute outage. This would be merely unfortunate if there wasn't a second compounding issue. A recently restarted node has a cold leaseholder cache. When it gets traffic for one of its replicas, it first tries itself as the leaseholder (maybe it will get lucky and won't need the network hop). Unfortunately, if the replica needs a snapshot, this decision currently blocks on it. This means that requests sent to the recently started node could block for as long as the heavily-throttled snapshots take, hours or even days. Short outages of more than 10 seconds are reasonably common with routine maintenance (rolling to a new version, swapping hardware, etc), so it's likely that customers will hit this (and one did). This commit avoids truncating the log past any follower's position when all replicas have recently been active (the quota pool keeps it from growing without bound in this case). If at least one replica hasn't recently been active, it holds off any truncation until the log reaches a size threshold. Partial mitigation for cockroachdb#37906 Potentially also helps with cockroachdb#36879 Release note (bug fix): Nodes that have been down for less than `server.time_until_store_dead` now recover more quickly when they rejoin.
Previously, we'd hold off on truncating the raft log if a replica was missing but contactable in the last 10 seconds. This meant that if a node was down for *more* than 10 seconds, there was a good chance that we'd truncate logs for some of its replicas (especially for its high-traffic ones) and it would need snapshots for them when it came back up. This was for two reasons. First, we've historically assumed that it's cheaper to catch a far-behind node up with a snapshot than with entries. Second, snapshots historically had to include the Raft log which implied a need to keep the size of the Raft log tightly controlled due to being pulled into memory at the snapshot receiver, but that's changed recently. The problem is when a node is down for longer than 10 seconds but shorter than the time it takes to upreplicate all of its ranges onto new nodes. It might come back up to find that it needs a snapshot for most ranges. We rate limit snapshots fairly aggressively because they've been disruptive in the past, but this means that it could potentially take hours for a node to recover from a 2 minute outage. This would be merely unfortunate if there wasn't a second compounding issue. A recently restarted node has a cold leaseholder cache. When it gets traffic for one of its replicas, it first tries itself as the leaseholder (maybe it will get lucky and won't need the network hop). Unfortunately, if the replica needs a snapshot, this decision currently blocks on it. This means that requests sent to the recently started node could block for as long as the heavily-throttled snapshots take, hours or even days. Short outages of more than 10 seconds are reasonably common with routine maintenance (rolling to a new version, swapping hardware, etc), so it's likely that customers will hit this (and one did). This commit avoids truncating the log past any follower's position when all replicas have recently been active (the quota pool keeps it from growing without bound in this case). If at least one replica hasn't recently been active, it holds off any truncation until the log reaches a size threshold. Partial mitigation for cockroachdb#37906 Potentially also helps with cockroachdb#36879 Release note (bug fix): Nodes that have been down for less than `server.time_until_store_dead` now recover more quickly when they rejoin.
Previously, we'd hold off on truncating the raft log if a replica was missing but contactable in the last 10 seconds. This meant that if a node was down for *more* than 10 seconds, there was a good chance that we'd truncate logs for some of its replicas (especially for its high-traffic ones) and it would need snapshots for them when it came back up. This was for two reasons. First, we've historically assumed that it's cheaper to catch a far-behind node up with a snapshot than with entries. Second, snapshots historically had to include the Raft log which implied a need to keep the size of the Raft log tightly controlled due to being pulled into memory at the snapshot receiver, but that's changed recently. The problem is when a node is down for longer than 10 seconds but shorter than the time it takes to upreplicate all of its ranges onto new nodes. It might come back up to find that it needs a snapshot for most ranges. We rate limit snapshots fairly aggressively because they've been disruptive in the past, but this means that it could potentially take hours for a node to recover from a 2 minute outage. This would be merely unfortunate if there wasn't a second compounding issue. A recently restarted node has a cold leaseholder cache. When it gets traffic for one of its replicas, it first tries itself as the leaseholder (maybe it will get lucky and won't need the network hop). Unfortunately, if the replica needs a snapshot, this decision currently blocks on it. This means that requests sent to the recently started node could block for as long as the heavily-throttled snapshots take, hours or even days. Short outages of more than 10 seconds are reasonably common with routine maintenance (rolling to a new version, swapping hardware, etc), so it's likely that customers will hit this (and one did). This commit avoids truncating the log past any follower's position when all replicas have recently been active (the quota pool keeps it from growing without bound in this case). If at least one replica hasn't recently been active, it holds off any truncation until the log reaches a size threshold. Partial mitigation for cockroachdb#37906 Potentially also helps with cockroachdb#36879 Release note (bug fix): Nodes that have been down now recover more quickly when they rejoin, assuming they weren't down for much more than the value of the `server.time_until_store_dead` cluster setting (which defaults to 5 minutes).
38484: storage: truncate raft log less aggressively when replica is missing r=tbg a=danhhz Previously, we'd hold off on truncating the raft log if a replica was missing but contactable in the last 10 seconds. This meant that if a node was down for *more* than 10 seconds, there was a good chance that we'd truncate logs for some of its replicas (especially for its high-traffic ones) and it would need snapshots for them when it came back up. This was for two reasons. First, we've historically assumed that it's cheaper to catch a far-behind node up with a snapshot than with entries. Second, when we replicated the raft log truncation, we'd have to include it in snapshots anyway, but that's changed recently. The problem is when a node is down for a short time but more than 10 seconds (think a couple minutes). It might come back up to find that it needs a snapshot for most ranges. We rate limit snapshots fairly aggressively because they've been disruptive in the past, but this means that it could potentially take hours for a node to recover from a 2 minute outage. This would be merely unfortunate if there wasn't a second compounding issue. A recently restarted node has a cold leaseholder cache. When it gets traffic for one of its replicas, it first tries itself as the leaseholder (maybe it will get lucky and won't need the network hop). Unfortunately, if the replica needs a snapshot, this decision currently blocks on it. This means that requests sent to the recently started node could block for as long as the heavily-throttled snapshots take, hours or even days. Short outages of more than 10 seconds are reasonably common with routine maintenance (rolling to a new version, swapping hardware, etc), so it's likely that customers will hit this (and one did). This commit ties the threshold that the raft log queue uses for recent activity to `server.time_until_store_dead`, which is already the threshold where we give up on a down node and start transferring away its replicas. It defaults to 5 minutes, which is still short, but it's more reasonable than 10 seconds. Crucially, it also is a cluster setting, so it can be overridden. We'd like to move even further in the direction of leniency about raft log truncation while a replica is missing, but this is a compromise that's intended to be less risky to backport to 19.1.x. Partial mitigation for #37906 Potentially also helps with #36879 Release note (bug fix): Nodes that have been down now recover more quickly when they rejoin, assuming they weren't down for much more than the value of the `server.time_until_store_dead` cluster setting (which defaults to 5 minutes). Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com> Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Previously, we'd hold off on truncating the raft log if a replica was missing but contactable in the last 10 seconds. This meant that if a node was down for *more* than 10 seconds, there was a good chance that we'd truncate logs for some of its replicas (especially for its high-traffic ones) and it would need snapshots for them when it came back up. This was for two reasons. First, we've historically assumed that it's cheaper to catch a far-behind node up with a snapshot than with entries. Second, snapshots historically had to include the Raft log which implied a need to keep the size of the Raft log tightly controlled due to being pulled into memory at the snapshot receiver, but that's changed recently. The problem is when a node is down for longer than 10 seconds but shorter than the time it takes to upreplicate all of its ranges onto new nodes. It might come back up to find that it needs a snapshot for most ranges. We rate limit snapshots fairly aggressively because they've been disruptive in the past, but this means that it could potentially take hours for a node to recover from a 2 minute outage. This would be merely unfortunate if there wasn't a second compounding issue. A recently restarted node has a cold leaseholder cache. When it gets traffic for one of its replicas, it first tries itself as the leaseholder (maybe it will get lucky and won't need the network hop). Unfortunately, if the replica needs a snapshot, this decision currently blocks on it. This means that requests sent to the recently started node could block for as long as the heavily-throttled snapshots take, hours or even days. Short outages of more than 10 seconds are reasonably common with routine maintenance (rolling to a new version, swapping hardware, etc), so it's likely that customers will hit this (and one did). This commit avoids truncating the log past any follower's position when all replicas have recently been active (the quota pool keeps it from growing without bound in this case). If at least one replica hasn't recently been active, it holds off any truncation until the log reaches a size threshold. Partial mitigation for cockroachdb#37906 Potentially also helps with cockroachdb#36879 Release note (bug fix): Nodes that have been down now recover more quickly when they rejoin, assuming they weren't down for much more than the value of the `server.time_until_store_dead` cluster setting (which defaults to 5 minutes).
54201: roachtest: unskip cdc/crdb-chaos r=aayushshah15 a=aayushshah15 I ran this test a total of 15 times in parallel and wasn't able to reproduce. Since its been skipped for 2+ releases, it's hard to know what fixed it, but a good guess is #48561. Release note: None Fixes #37716 Informs #36879 Release justification: testing only 57170: util/log: new experimental integration with Fluentd r=itsbilal a=knz Release note (cli change): It is now possible to redirect logging to [Fluentd](https://www.fluentd.org)-compatible network collectors. See the documentation for details. This is an alpha-quality feature. 59741: opt: fix panic in GenerateLookupJoin r=mgartner a=mgartner #### opt: fix panic in GenerateLookupJoin In #57690 a new code path was introduced from `findConstantFilterCols` from `GenerateLookupJoins`. This new code path made it possible for the filters passed to `findConstantFilterCols` to contain columns that are not part of the given table. This violated the assumption that the filter only contains columns in the given table and caused a panic. This commit fixes the issue by neglecting constant filters for columns not in the given table. Fixes #59738 Release note (bug fix): A bug has been fixed that caused errors when joining two tables when one of the tables had a computed column. This bug was present since version 21.1.0-alpha.2 and not present in any production releases. #### opt: move findConstantFilterCols to general_funcs.go Release note: None 59779: flowinfra: deflake a test r=yuzefovich a=yuzefovich Previously, a unit test could fail in rare circumstances when relocating a range to a remote node, and now we will use SucceedsSoon to avoid that. Also unskip the vectorized option. Fixes: #59712 Release note: None Co-authored-by: Aayush Shah <aayush.shah15@gmail.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
I do not know anything about this. I suspect it had to do with catch-up scans. Moving off my board. |
Stale. |
I've seen the latency make it up to 10-20m before, but it really shouldn't ever go higher than 1-2m. Definitely something to look into here, easy enough to repro by extending the duration of the test.
Jira issue: CRDB-4477
The text was updated successfully, but these errors were encountered: