changefeedccl: Fix initial scan checkpointing #96995

miretskiy · 2023-02-11T00:17:08Z

An over than 2 year old change
(#71848) that added support for checkpointing during backfill after schema change, inadvertently broke initial scan checkpointing functionality

Exacerbating the problem, the existing test
TestChangefeedBackfillCheckpoint continued to work fine. The reason why it was passing was because the test was looking for a checkpoint whose timestamp matched backfill timestamp. The bug involved incorrect initialize/use of 0 timestamp. It just so happens, that after initial scan completes, the rangefeed starts, and the very first thing it does is to generate a 0 timestamp checkpoint. So, the test was observing this event, and continued to work.
This PR does not have a dedicated test because the existing tests work fine -- provided we ignore 0 timestamp checkpoint, which is what this PR does in addition to addressing the root cause of the bug.

Informs #96959

Release note (enterprise change): Fix a bug in changefeeds, where long running initial scans will fail to generate checkpoint. Failure to generate checkpoint is particularly bad if the changefeed restarts for whatever reason. Without checkpoints, the changefeed will restart from the beginning, and in the worst case, when exporting substantially sized tables, changefeed initial scan may have hard time completing.

blathers-crl · 2023-02-11T00:17:13Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-02-11T00:17:18Z

This change is

samiskin

Damn, a comedy of errors indeed 🙃

miretskiy · 2023-02-12T14:50:32Z

@samiskin FYI: Had to make additional changes to make scheduled changefeeds work.
This addition, essentially, ensures that when we are doing initial_scan_only work, then when we scan ranges, we emit
EXIT boundary, as opposed to NONE. Prior to the latest version, we would first scan all ranges and emit NONE boundary, then if doing initial scan=only, we would also emit EXIT boundary. However, emitting boundary at the same time as the previous frontier is pretty much a no-op; and with the original bug fix it became obvious as the tests would time out
(since they wouldn't flush EXIT boundary because frontier wouldn't have advanced).

Furthermore, I removed "ca.cancel()" call from the kvfeed go routine -- please take some time to think about this 1 liner
change. I think it's subtle; and I think there was a chance that we would fail to flush last bits of data prior to completing initial scan only work; What I think could happen is that kvfeed waits for the buffer to drain; as soon as we get the last
element off of that buffer, (tick() function), kvFeed exits; it then emits result to ca.errCh, and then called ca.cancel(). However, the context used in many places (all of them, I think) is tied to ca.Ctx() -- therefore, it's possible that that very last element, that might have triggered frontier flush, would not flush successfully due to context cancellation.
Now, as you say, comedy of errors.. I think it would have worked fine when we emitted NONE followed by EXIT frontiers; when we advanced due to NONE, we would emit all pending elements, then when we see EXIT frontiers -- there wouldn't be anything to flush so, we probably wouldn't see context cancellation. I do think, however, that that ca.cancel() call was in error.

An over than 2 year old change (cockroachdb#71848) that added support for checkpointing during backfill after schema change, inadvertently broke initial scan checkpointing funcitonality Exacerbating the problem, the existing test `TestChangefeedBackfillCheckpoint` continued to work fine. Treason why it was passing was because the test was looking for a checkpoint whose timestamp matched bacfill timestamp. The bug involved incorrect initialize/use of 0 timestamp. It just so happens, that after initial scan completes, the rangefeed starts, and the very first thing it does is to generate a 0 timestamp checkpoint. So, the test was observing this event, and continued to work. This PR does not have a dedicated test because the existing tests work fine -- provided we ignore 0 timestamp checkpoint, which is what this PR does in addition to addressing the root cause of the bug. Informs cockroachdb#96959 Release note (enterprise change): Fix a bug in changefeeds, where long running initial scans will fail to generate checkpoint. Failure to generate checkpoint is particularly bad if the changefeed restarts for whatever reason. Without checkpoints, the changefeed will restart from the beginning, and in the worst case, when exporting substantially sized tables, changefeed initial scan may have hard time completing.

miretskiy · 2023-02-12T23:10:44Z

Another test kept failing; the issue was that once expensive checkpoint was written, then subsequent high watermark checkpoints won't happen for a while. Latest change adds a separate timer to so that highwater checkpoints are independent from span level ones.

samiskin

What I think could happen is that kvfeed waits for the buffer to drain; as soon as we get the last element off of that buffer, (tick() function), kvFeed exits; it then emits result to ca.errCh, and then called ca.cancel()

Hm yeah, perhaps we should eventually move this initial-scan-only to not use errors in its happy-path to avoid having to worry about this type of "is this a 'kill everything and thats okay' error or a 'everything must flush first' kind of error"

miretskiy · 2023-02-13T15:25:08Z

bors r+

craig · 2023-02-13T16:41:46Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-02-13T16:42:07Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 495dc98 to blathers/backport-release-22.1-96995: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

miretskiy requested a review from a team as a code owner February 11, 2023 00:17

miretskiy requested review from HonoreDB and removed request for a team February 11, 2023 00:17

miretskiy added backport-22.1.x labels Feb 11, 2023

samiskin approved these changes Feb 11, 2023

View reviewed changes

miretskiy force-pushed the checkpoint branch from 7441ea5 to b3d89a6 Compare February 12, 2023 03:48

miretskiy force-pushed the checkpoint branch from b3d89a6 to 495dc98 Compare February 12, 2023 22:29

samiskin approved these changes Feb 13, 2023

View reviewed changes

craig bot merged commit 0c2cc47 into cockroachdb:master Feb 13, 2023

blathers-crl bot mentioned this pull request Feb 13, 2023

release-22.2: changefeedccl: Fix initial scan checkpointing #97049

Merged

miretskiy mentioned this pull request Feb 13, 2023

release-22.1: changefeedccl: Fix initial scan checkpointing #97052

Merged

cockroach-teamcity mentioned this pull request Feb 14, 2023

PR #96995 - changefeedccl: Fix initial scan checkpointing cockroachdb/docs#16248

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeedccl: Fix initial scan checkpointing #96995

changefeedccl: Fix initial scan checkpointing #96995

miretskiy commented Feb 11, 2023 •

edited

Loading

blathers-crl bot commented Feb 11, 2023

cockroach-teamcity commented Feb 11, 2023

samiskin left a comment

miretskiy commented Feb 12, 2023

miretskiy commented Feb 12, 2023

samiskin left a comment •

edited

Loading

miretskiy commented Feb 13, 2023

craig bot commented Feb 13, 2023

blathers-crl bot commented Feb 13, 2023

changefeedccl: Fix initial scan checkpointing #96995

changefeedccl: Fix initial scan checkpointing #96995

Conversation

miretskiy commented Feb 11, 2023 • edited Loading

blathers-crl bot commented Feb 11, 2023

cockroach-teamcity commented Feb 11, 2023

samiskin left a comment

Choose a reason for hiding this comment

miretskiy commented Feb 12, 2023

miretskiy commented Feb 12, 2023

samiskin left a comment • edited Loading

Choose a reason for hiding this comment

miretskiy commented Feb 13, 2023

craig bot commented Feb 13, 2023

blathers-crl bot commented Feb 13, 2023

miretskiy commented Feb 11, 2023 •

edited

Loading

samiskin left a comment •

edited

Loading