Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: cdc/crdb-chaos sometimes takes a while to recover from chaos #36879

Closed
danhhz opened this issue Apr 16, 2019 · 9 comments
Closed
Labels
A-cdc Change Data Capture C-investigation Further steps needed to qualify. C-label will change. T-cdc

Comments

@danhhz
Copy link
Contributor

danhhz commented Apr 16, 2019

I've seen the latency make it up to 10-20m before, but it really shouldn't ever go higher than 1-2m. Definitely something to look into here, easy enough to repro by extending the duration of the test.

Jira issue: CRDB-4477

@danhhz
Copy link
Contributor Author

danhhz commented Apr 18, 2019

Okay, well first of all, something is bonkers about the test. The resolved timestamp it's looking at should never regress, so a 42m jump in 2s means something is broken.

changefeed: 21:13:54 cdc.go:833: end-to-end steady latency 1m42.086645442s; max steady latency so far 18m58.209893436s
changefeed: 21:13:56 cdc.go:833: end-to-end steady latency 43m54.855883321s; max steady latency so far 43m54.855883321s

@danhhz danhhz self-assigned this Apr 18, 2019
@danhhz
Copy link
Contributor Author

danhhz commented Apr 22, 2019

Started back into this and got a high-water checkpoint that's behind what the feed was restarted with. Still digging into why, but perhaps we want to make sure we .Forward the old progress as a safety net. It's still odd that this ever happened though.

I190422 19:02:23.407085 3713 ccl/changefeedccl/changefeed_dist.go:92  [n1] WIP starting changefeed from 1555959671.535291852,0 is behind by 1m11.871743607s
I190422 19:02:49.985172 3713 ccl/changefeedccl/changefeed.go:323  [n1] WIP wrote high-water checkpoint 1555959404.146301576,0 is behind by 6m5.83883919s

@danhhz
Copy link
Contributor Author

danhhz commented Apr 22, 2019

A little overwhelmed with all the threads to pull on here, so just going to write them in here for now and follow up later. Just got an unexpectedly failed changefeed, with no useful stacktrace from the errors stuff. I suspect this happened while trying to compute the spans to track (the fetchSpansForTargets call in distChangefeedFlow. fetchSpansForTargets uses db.Txn, so it should have retried anything retryable. This entire cluster is pretty new, so I'd hope the descriptor it's looking for has not been GC'd yet.

I190422 19:04:45.882727 5450 ccl/changefeedccl/changefeed_dist.go:92  [n1] WIP starting changefeed from 1555959404.146301576,0 is behind by 8m1.736391533s
W190422 19:04:45.883095 5450 ccl/changefeedccl/changefeed_stmt.go:459  [n1] CHANGEFEED job 445281624778735617 returning with error: descriptor not found
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.init
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/structured.go:179

@danhhz
Copy link
Contributor Author

danhhz commented Apr 22, 2019

Progress! Figured out that RangeFeed will return checkpoints from before the timestamp you started it at because it just happily forwards closed timestamps as long as there are no outstanding intents before them. When combined with job restarts, this could cause the changefeed high-water to regress (since we don't initialize the spans in the spanFrontier at any particular timestamp). This is likely one cause of the very high end-to-end latencies in the roachtest. It remains to be seen if it's the only cause.

I currently have a local fix for changefeeds to throw out any rangefeed checkpoints from before the timestamp we started the rangefeed with. I also think it'd be appropriate to add a high-level assertion that the high-water never regresses (perhaps a Fatal in tests, Sentry in prod kind of thing).

@nvanbenschoten I can't really see a reason why a rangefeed would ever need to return checkpoints from below the timestamp specified in its request. What do you think about filtering these inside the rangefeed registration?

@nvanbenschoten
Copy link
Member

Yeah, we very explicitly allow this situation:

// Always publish checkpoint notifications, regardless
// of a registration's starting timestamp.

I don't remember why though. I think I wrote it that way because it seemed like more information about resolved timestamps is always better than less. If that violates the rangefeed contract that consumers are expecting, I don't see a reason why we shouldn't change it.

The one case where this might come up is immediately after a rangefeed's catch-up scan. If we filter checkpoints with timestamps below the rangefeed's start timestamp, a rangefeed may be unable to signal that it has completed its catch-up scan. Is that ok?

@danhhz
Copy link
Contributor Author

danhhz commented Apr 22, 2019

It's easy to argue that changefeeds should be doing this filtering (and like I mentioned above, that's where I have my fix right now). I'm wary of generalizing anything about what contract rangefeed consumers expect from only the changefeed data point. Given that you thought this was a good idea while this was most loaded up in your head, it's probably safest to keep the filtering in changefeed for now (and document this more loudly on at least some of RangeFeedRequest/RangeFeedCheckpoint/(*DistSender).RangeFeed). We can reconsider this later once Rangefeed has more consumers.

The one case where this might come up is immediately after a rangefeed's catch-up scan. If we filter checkpoints with timestamps below the rangefeed's start timestamp, a rangefeed may be unable to signal that it has completed its catch-up scan. Is that ok?

Rangefeed currently signals the end of a catchup scan with an empty timestamp, no? I'd imagine it's not surprising to exempt this from the filtering if we did it.

@awoods187 awoods187 added the C-investigation Further steps needed to qualify. C-label will change. label Apr 22, 2019
danhhz added a commit to danhhz/cockroach that referenced this issue Apr 22, 2019
When the ChangeFrontier processor for a changefeed is started up, it
creates a spanFrontier to track the overall progression for the feed.
Each watched span is set to the empty timestamp so we can tell the
difference between a rangefeed being behind and one still initializing.

However, rangefeed will happily return resolved timestamps less than the
one it was started with. If we feed this into the spanFrontier, it can
report a timestamp behind the one everything started up at. In the case
of a new changefeed, this is fine (though perhaps odd). In the case of a
changefeed being restarted from the job's high-water, this means we can
overwrite the high-water progress with the lower value, regressing it.
At the least, this is very surprising UX, but it's easy to argue that
it's a correctness issue.

Touches cockroachdb#36879

Release note (bug fix): Fixed a bug where `CHANGEFEED` job progress
would regress when the job was restarted.
craig bot pushed a commit that referenced this issue Apr 23, 2019
37009: changefeedccl: fix job progress regressing r=nvanbenschoten a=danhhz

When the ChangeFrontier processor for a changefeed is started up, it
creates a spanFrontier to track the overall progression for the feed.
Each watched span is set to the empty timestamp so we can tell the
difference between a rangefeed being behind and one still initializing.

However, rangefeed will happily return resolved timestamps less than the
one it was started with. If we feed this into the spanFrontier, it can
report a timestamp behind the one everything started up at. In the case
of a new changefeed, this is fine (though perhaps odd). In the case of a
changefeed being restarted from the job's high-water, this means we can
overwrite the high-water progress with the lower value, regressing it.
At the least, this is very surprising UX, but it's easy to argue that
it's a correctness issue.

Touches #36879

Release note (bug fix): Fixed a bug where `CHANGEFEED` job progress
would regress when the job was restarted.

37030: changefeedccl: log more details when returning with non-retryable errors r=tbg a=danhhz

The `%v` and `%+v` were switched from what I intended in the line that
logs retryable errors and the one that handles non-retryable errors.

Release note: None

Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
danhhz added a commit to danhhz/cockroach that referenced this issue May 6, 2019
When the ChangeFrontier processor for a changefeed is started up, it
creates a spanFrontier to track the overall progression for the feed.
Each watched span is set to the empty timestamp so we can tell the
difference between a rangefeed being behind and one still initializing.

However, rangefeed will happily return resolved timestamps less than the
one it was started with. If we feed this into the spanFrontier, it can
report a timestamp behind the one everything started up at. In the case
of a new changefeed, this is fine (though perhaps odd). In the case of a
changefeed being restarted from the job's high-water, this means we can
overwrite the high-water progress with the lower value, regressing it.
At the least, this is very surprising UX, but it's easy to argue that
it's a correctness issue.

Touches cockroachdb#36879

Release note (bug fix): Fixed a bug where `CHANGEFEED` job progress
would regress when the job was restarted.
danhhz added a commit to danhhz/cockroach that referenced this issue May 13, 2019
Also make the crdb-chaos test more lenient to recovery times to work
around the flakes we're seeing. This thread should probably get pulled
at some point so leaving cockroachdb#36879 open to track it.

Closes cockroachdb#36905
Closes cockroachdb#36979

Release note: None
@danhhz
Copy link
Contributor Author

danhhz commented May 13, 2019

We're still seeing long recovery times on this test so I made it more lenient in #37498. Definitely a thread to pull on here, but it does seem to eventually recover, so not an emergency.

(Interestingly the last 3 are tightly grouped: 11m17s, 11m12s, 11m23s. The failure condition is 10m, so we wouldn't see anything less than that, but nothing between 10m and 11m or above 12m. There were a few failures with much longer times before the last round of fixes, so dunno whether to read anything into this.)

craig bot pushed a commit that referenced this issue May 13, 2019
37498: roachtest: unskip cdc/{crdb,sink}-chaos on 19.1 r=tbg a=danhhz

Also make the crdb-chaos test more lenient to recovery times to work
around the flakes we're seeing. This thread should probably get pulled
at some point so leaving #36879 open to track it.

Closes #36905
Closes #36979

Release note: None

Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
@danhhz danhhz removed their assignment May 14, 2019
@ajwerner ajwerner self-assigned this Jun 4, 2019
danhhz added a commit to danhhz/cockroach that referenced this issue Jun 27, 2019
Previously, we'd hold off on truncating the raft log if a replica was
missing but contactable in the last 10 seconds. This meant that if a
node was down for *more* than 10 seconds, there was a good chance that
we'd truncate logs for some of its replicas (especially for its
high-traffic ones) and it would need snapshots for them when it came
back up.

This was for two reasons. First, we've historically assumed that it's
cheaper to catch a far-behind node up with a snapshot than with entries.
Second, when we replicated the raft log truncation, we'd have to include
it in snapshots anyway, but that's changed recently.

The problem is when a node is down for a short time but more than 10
seconds (think a couple minutes). It might come back up to find that it
needs a snapshot for most ranges. We rate limit snapshots fairly
aggressively because they've been disruptive in the past, but this means
that it could potentially take hours for a node to recover from a 2
minute outage.

This would be merely unfortunate if there wasn't a second compounding
issue. A recently restarted node has a cold leaseholder cache. When it
gets traffic for one of its replicas, it first tries itself as the
leaseholder (maybe it will get lucky and won't need the network hop).
Unfortunately, if the replica needs a snapshot, this decision currently
blocks on it. This means that requests sent to the recently started node
could block for as long as the heavily-throttled snapshots take, hours
or even days.

Short outages of more than 10 seconds are reasonably common with routine
maintenance (rolling to a new version, swapping hardware, etc), so it's
likely that customers will hit this (and one did).

This commit ties the threshold that the raft log queue uses for recent
activity to `server.time_until_store_dead`, which is already the
threshold where we give up on a down node and start transferring away
its replicas. It defaults to 5 minutes, which is still short, but it's
more reasonable than 10 seconds. Crucially, it also is a cluster
setting, so it can be overridden.

We'd like to move even further in the direction of leniency about raft
log truncation while a replica is missing, but this is a compromise
that's intended to be less risky to backport to 19.1.x.

Partial mitigation for cockroachdb#37906

Potentially also helps with cockroachdb#36879

Release note (bug fix): Nodes that have been down for less than
`server.time_until_store_dead` now recover more quickly when they
rejoin.
danhhz added a commit to danhhz/cockroach that referenced this issue Jun 27, 2019
Previously, we'd hold off on truncating the raft log if a replica was
missing but contactable in the last 10 seconds. This meant that if a
node was down for *more* than 10 seconds, there was a good chance that
we'd truncate logs for some of its replicas (especially for its
high-traffic ones) and it would need snapshots for them when it came
back up.

This was for two reasons. First, we've historically assumed that it's
cheaper to catch a far-behind node up with a snapshot than with entries.
Second, snapshots historically had to include the Raft log which implied
a need to keep the size of the Raft log tightly controlled due to being
pulled into memory at the snapshot receiver, but that's changed
recently.

The problem is when a node is down for longer than 10 seconds but
shorter than the time it takes to upreplicate all of its ranges onto new
nodes. It might come back up to find that it needs a snapshot for most
ranges. We rate limit snapshots fairly aggressively because they've been
disruptive in the past, but this means that it could potentially take
hours for a node to recover from a 2 minute outage.

This would be merely unfortunate if there wasn't a second compounding
issue. A recently restarted node has a cold leaseholder cache. When it
gets traffic for one of its replicas, it first tries itself as the
leaseholder (maybe it will get lucky and won't need the network hop).
Unfortunately, if the replica needs a snapshot, this decision currently
blocks on it. This means that requests sent to the recently started node
could block for as long as the heavily-throttled snapshots take, hours
or even days.

Short outages of more than 10 seconds are reasonably common with routine
maintenance (rolling to a new version, swapping hardware, etc), so it's
likely that customers will hit this (and one did).

This commit avoids truncating the log past any follower's position when
all replicas have recently been active (the quota pool keeps it from
growing without bound in this case). If at least one replica hasn't
recently been active, it holds off any truncation until the log reaches
a size threshold.

Partial mitigation for cockroachdb#37906

Potentially also helps with cockroachdb#36879

Release note (bug fix): Nodes that have been down for less than
`server.time_until_store_dead` now recover more quickly when they
rejoin.
danhhz added a commit to danhhz/cockroach that referenced this issue Jun 28, 2019
Previously, we'd hold off on truncating the raft log if a replica was
missing but contactable in the last 10 seconds. This meant that if a
node was down for *more* than 10 seconds, there was a good chance that
we'd truncate logs for some of its replicas (especially for its
high-traffic ones) and it would need snapshots for them when it came
back up.

This was for two reasons. First, we've historically assumed that it's
cheaper to catch a far-behind node up with a snapshot than with entries.
Second, snapshots historically had to include the Raft log which implied
a need to keep the size of the Raft log tightly controlled due to being
pulled into memory at the snapshot receiver, but that's changed
recently.

The problem is when a node is down for longer than 10 seconds but
shorter than the time it takes to upreplicate all of its ranges onto new
nodes. It might come back up to find that it needs a snapshot for most
ranges. We rate limit snapshots fairly aggressively because they've been
disruptive in the past, but this means that it could potentially take
hours for a node to recover from a 2 minute outage.

This would be merely unfortunate if there wasn't a second compounding
issue. A recently restarted node has a cold leaseholder cache. When it
gets traffic for one of its replicas, it first tries itself as the
leaseholder (maybe it will get lucky and won't need the network hop).
Unfortunately, if the replica needs a snapshot, this decision currently
blocks on it. This means that requests sent to the recently started node
could block for as long as the heavily-throttled snapshots take, hours
or even days.

Short outages of more than 10 seconds are reasonably common with routine
maintenance (rolling to a new version, swapping hardware, etc), so it's
likely that customers will hit this (and one did).

This commit avoids truncating the log past any follower's position when
all replicas have recently been active (the quota pool keeps it from
growing without bound in this case). If at least one replica hasn't
recently been active, it holds off any truncation until the log reaches
a size threshold.

Partial mitigation for cockroachdb#37906

Potentially also helps with cockroachdb#36879

Release note (bug fix): Nodes that have been down for less than
`server.time_until_store_dead` now recover more quickly when they
rejoin.
tbg pushed a commit to danhhz/cockroach that referenced this issue Jul 2, 2019
Previously, we'd hold off on truncating the raft log if a replica was
missing but contactable in the last 10 seconds. This meant that if a
node was down for *more* than 10 seconds, there was a good chance that
we'd truncate logs for some of its replicas (especially for its
high-traffic ones) and it would need snapshots for them when it came
back up.

This was for two reasons. First, we've historically assumed that it's
cheaper to catch a far-behind node up with a snapshot than with entries.
Second, snapshots historically had to include the Raft log which implied
a need to keep the size of the Raft log tightly controlled due to being
pulled into memory at the snapshot receiver, but that's changed
recently.

The problem is when a node is down for longer than 10 seconds but
shorter than the time it takes to upreplicate all of its ranges onto new
nodes. It might come back up to find that it needs a snapshot for most
ranges. We rate limit snapshots fairly aggressively because they've been
disruptive in the past, but this means that it could potentially take
hours for a node to recover from a 2 minute outage.

This would be merely unfortunate if there wasn't a second compounding
issue. A recently restarted node has a cold leaseholder cache. When it
gets traffic for one of its replicas, it first tries itself as the
leaseholder (maybe it will get lucky and won't need the network hop).
Unfortunately, if the replica needs a snapshot, this decision currently
blocks on it. This means that requests sent to the recently started node
could block for as long as the heavily-throttled snapshots take, hours
or even days.

Short outages of more than 10 seconds are reasonably common with routine
maintenance (rolling to a new version, swapping hardware, etc), so it's
likely that customers will hit this (and one did).

This commit avoids truncating the log past any follower's position when
all replicas have recently been active (the quota pool keeps it from
growing without bound in this case). If at least one replica hasn't
recently been active, it holds off any truncation until the log reaches
a size threshold.

Partial mitigation for cockroachdb#37906

Potentially also helps with cockroachdb#36879

Release note (bug fix): Nodes that have been down now recover more
quickly when they rejoin, assuming they weren't down for much more
than the value of the `server.time_until_store_dead` cluster setting
(which defaults to 5 minutes).
craig bot pushed a commit that referenced this issue Jul 2, 2019
38484: storage: truncate raft log less aggressively when replica is missing r=tbg a=danhhz

Previously, we'd hold off on truncating the raft log if a replica was
missing but contactable in the last 10 seconds. This meant that if a
node was down for *more* than 10 seconds, there was a good chance that
we'd truncate logs for some of its replicas (especially for its
high-traffic ones) and it would need snapshots for them when it came
back up.

This was for two reasons. First, we've historically assumed that it's
cheaper to catch a far-behind node up with a snapshot than with entries.
Second, when we replicated the raft log truncation, we'd have to include
it in snapshots anyway, but that's changed recently.

The problem is when a node is down for a short time but more than 10
seconds (think a couple minutes). It might come back up to find that it
needs a snapshot for most ranges. We rate limit snapshots fairly
aggressively because they've been disruptive in the past, but this means
that it could potentially take hours for a node to recover from a 2
minute outage.

This would be merely unfortunate if there wasn't a second compounding
issue. A recently restarted node has a cold leaseholder cache. When it
gets traffic for one of its replicas, it first tries itself as the
leaseholder (maybe it will get lucky and won't need the network hop).
Unfortunately, if the replica needs a snapshot, this decision currently
blocks on it. This means that requests sent to the recently started node
could block for as long as the heavily-throttled snapshots take, hours
or even days.

Short outages of more than 10 seconds are reasonably common with routine
maintenance (rolling to a new version, swapping hardware, etc), so it's
likely that customers will hit this (and one did).

This commit ties the threshold that the raft log queue uses for recent
activity to `server.time_until_store_dead`, which is already the
threshold where we give up on a down node and start transferring away
its replicas. It defaults to 5 minutes, which is still short, but it's
more reasonable than 10 seconds. Crucially, it also is a cluster
setting, so it can be overridden.

We'd like to move even further in the direction of leniency about raft
log truncation while a replica is missing, but this is a compromise
that's intended to be less risky to backport to 19.1.x.

Partial mitigation for #37906

Potentially also helps with #36879

Release note (bug fix): Nodes that have been down now recover more
quickly when they rejoin, assuming they weren't down for much more
than the value of the `server.time_until_store_dead` cluster setting
(which defaults to 5 minutes).

Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
tbg pushed a commit to tbg/cockroach that referenced this issue Jul 3, 2019
Previously, we'd hold off on truncating the raft log if a replica was
missing but contactable in the last 10 seconds. This meant that if a
node was down for *more* than 10 seconds, there was a good chance that
we'd truncate logs for some of its replicas (especially for its
high-traffic ones) and it would need snapshots for them when it came
back up.

This was for two reasons. First, we've historically assumed that it's
cheaper to catch a far-behind node up with a snapshot than with entries.
Second, snapshots historically had to include the Raft log which implied
a need to keep the size of the Raft log tightly controlled due to being
pulled into memory at the snapshot receiver, but that's changed
recently.

The problem is when a node is down for longer than 10 seconds but
shorter than the time it takes to upreplicate all of its ranges onto new
nodes. It might come back up to find that it needs a snapshot for most
ranges. We rate limit snapshots fairly aggressively because they've been
disruptive in the past, but this means that it could potentially take
hours for a node to recover from a 2 minute outage.

This would be merely unfortunate if there wasn't a second compounding
issue. A recently restarted node has a cold leaseholder cache. When it
gets traffic for one of its replicas, it first tries itself as the
leaseholder (maybe it will get lucky and won't need the network hop).
Unfortunately, if the replica needs a snapshot, this decision currently
blocks on it. This means that requests sent to the recently started node
could block for as long as the heavily-throttled snapshots take, hours
or even days.

Short outages of more than 10 seconds are reasonably common with routine
maintenance (rolling to a new version, swapping hardware, etc), so it's
likely that customers will hit this (and one did).

This commit avoids truncating the log past any follower's position when
all replicas have recently been active (the quota pool keeps it from
growing without bound in this case). If at least one replica hasn't
recently been active, it holds off any truncation until the log reaches
a size threshold.

Partial mitigation for cockroachdb#37906

Potentially also helps with cockroachdb#36879

Release note (bug fix): Nodes that have been down now recover more
quickly when they rejoin, assuming they weren't down for much more
than the value of the `server.time_until_store_dead` cluster setting
(which defaults to 5 minutes).
craig bot pushed a commit that referenced this issue Feb 3, 2021
54201: roachtest: unskip cdc/crdb-chaos r=aayushshah15 a=aayushshah15

I ran this test a total of 15 times in parallel and wasn't able to
reproduce. Since its been skipped for 2+ releases, it's hard to know
what fixed it, but a good guess is #48561.

Release note: None

Fixes #37716 

Informs #36879

Release justification: testing only

57170: util/log: new experimental integration with Fluentd  r=itsbilal a=knz

Release note (cli change): It is now possible to redirect logging to
[Fluentd](https://www.fluentd.org)-compatible network collectors. See
the documentation for details. This is an alpha-quality feature.


59741: opt: fix panic in GenerateLookupJoin r=mgartner a=mgartner

#### opt: fix panic in GenerateLookupJoin

In #57690 a new code path was introduced from `findConstantFilterCols`
from `GenerateLookupJoins`. This new code path made it possible for the
filters passed to `findConstantFilterCols` to contain columns that are
not part of the given table. This violated the assumption that the
filter only contains columns in the given table and caused a panic. This
commit fixes the issue by neglecting constant filters for columns not in
the given table.

Fixes #59738

Release note (bug fix): A bug has been fixed that caused errors when
joining two tables when one of the tables had a computed column. This
bug was present since version 21.1.0-alpha.2 and not present in any
production releases.

#### opt: move findConstantFilterCols to general_funcs.go

Release note: None


59779: flowinfra: deflake a test r=yuzefovich a=yuzefovich

Previously, a unit test could fail in rare circumstances when relocating
a range to a remote node, and now we will use SucceedsSoon to avoid
that. Also unskip the vectorized option.

Fixes: #59712

Release note: None

Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
@ajwerner
Copy link
Contributor

I do not know anything about this. I suspect it had to do with catch-up scans. Moving off my board.

@ajwerner ajwerner removed their assignment Mar 23, 2022
@miretskiy
Copy link
Contributor

Stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-investigation Further steps needed to qualify. C-label will change. T-cdc
Projects
None yet
Development

No branches or pull requests

6 participants