-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changefeedccl: Fix initial scan checkpointing #96995
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn, a comedy of errors indeed 🙃
7441ea5
to
b3d89a6
Compare
@samiskin FYI: Had to make additional changes to make scheduled changefeeds work. Furthermore, I removed "ca.cancel()" call from the kvfeed go routine -- please take some time to think about this 1 liner |
An over than 2 year old change (cockroachdb#71848) that added support for checkpointing during backfill after schema change, inadvertently broke initial scan checkpointing funcitonality Exacerbating the problem, the existing test `TestChangefeedBackfillCheckpoint` continued to work fine. Treason why it was passing was because the test was looking for a checkpoint whose timestamp matched bacfill timestamp. The bug involved incorrect initialize/use of 0 timestamp. It just so happens, that after initial scan completes, the rangefeed starts, and the very first thing it does is to generate a 0 timestamp checkpoint. So, the test was observing this event, and continued to work. This PR does not have a dedicated test because the existing tests work fine -- provided we ignore 0 timestamp checkpoint, which is what this PR does in addition to addressing the root cause of the bug. Informs cockroachdb#96959 Release note (enterprise change): Fix a bug in changefeeds, where long running initial scans will fail to generate checkpoint. Failure to generate checkpoint is particularly bad if the changefeed restarts for whatever reason. Without checkpoints, the changefeed will restart from the beginning, and in the worst case, when exporting substantially sized tables, changefeed initial scan may have hard time completing.
b3d89a6
to
495dc98
Compare
Another test kept failing; the issue was that once expensive checkpoint was written, then subsequent high watermark checkpoints won't happen for a while. Latest change adds a separate timer to so that highwater checkpoints are independent from span level ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I think could happen is that kvfeed waits for the buffer to drain; as soon as we get the last element off of that buffer, (tick() function), kvFeed exits; it then emits result to ca.errCh, and then called ca.cancel()
Hm yeah, perhaps we should eventually move this initial-scan-only to not use errors in its happy-path to avoid having to worry about this type of "is this a 'kill everything and thats okay' error or a 'everything must flush first' kind of error"
bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 495dc98 to blathers/backport-release-22.1-96995: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
An over than 2 year old change
(#71848) that added support for checkpointing during backfill after schema change, inadvertently broke initial scan checkpointing functionality
Exacerbating the problem, the existing test
TestChangefeedBackfillCheckpoint
continued to work fine. The reason why it was passing was because the test was looking for a checkpoint whose timestamp matched backfill timestamp. The bug involved incorrect initialize/use of 0 timestamp. It just so happens, that after initial scan completes, the rangefeed starts, and the very first thing it does is to generate a 0 timestamp checkpoint. So, the test was observing this event, and continued to work.This PR does not have a dedicated test because the existing tests work fine -- provided we ignore 0 timestamp checkpoint, which is what this PR does in addition to addressing the root cause of the bug.
Informs #96959
Release note (enterprise change): Fix a bug in changefeeds, where long running initial scans will fail to generate checkpoint. Failure to generate checkpoint is particularly bad if the changefeed restarts for whatever reason. Without checkpoints, the changefeed will restart from the beginning, and in the worst case, when exporting substantially sized tables, changefeed initial scan may have hard time completing.