Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: Fix initial scan checkpointing #96995

Merged
merged 1 commit into from
Feb 13, 2023

Conversation

miretskiy
Copy link
Contributor

@miretskiy miretskiy commented Feb 11, 2023

An over than 2 year old change
(#71848) that added support for checkpointing during backfill after schema change, inadvertently broke initial scan checkpointing functionality

Exacerbating the problem, the existing test
TestChangefeedBackfillCheckpoint continued to work fine. The reason why it was passing was because the test was looking for a checkpoint whose timestamp matched backfill timestamp. The bug involved incorrect initialize/use of 0 timestamp. It just so happens, that after initial scan completes, the rangefeed starts, and the very first thing it does is to generate a 0 timestamp checkpoint. So, the test was observing this event, and continued to work.
This PR does not have a dedicated test because the existing tests work fine -- provided we ignore 0 timestamp checkpoint, which is what this PR does in addition to addressing the root cause of the bug.

Informs #96959

Release note (enterprise change): Fix a bug in changefeeds, where long running initial scans will fail to generate checkpoint. Failure to generate checkpoint is particularly bad if the changefeed restarts for whatever reason. Without checkpoints, the changefeed will restart from the beginning, and in the worst case, when exporting substantially sized tables, changefeed initial scan may have hard time completing.

@miretskiy miretskiy requested a review from a team as a code owner February 11, 2023 00:17
@miretskiy miretskiy requested review from HonoreDB and removed request for a team February 11, 2023 00:17
@blathers-crl
Copy link

blathers-crl bot commented Feb 11, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@samiskin samiskin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, a comedy of errors indeed 🙃

@miretskiy
Copy link
Contributor Author

@samiskin FYI: Had to make additional changes to make scheduled changefeeds work.
This addition, essentially, ensures that when we are doing initial_scan_only work, then when we scan ranges, we emit
EXIT boundary, as opposed to NONE. Prior to the latest version, we would first scan all ranges and emit NONE boundary, then if doing initial scan=only, we would also emit EXIT boundary. However, emitting boundary at the same time as the previous frontier is pretty much a no-op; and with the original bug fix it became obvious as the tests would time out
(since they wouldn't flush EXIT boundary because frontier wouldn't have advanced).

Furthermore, I removed "ca.cancel()" call from the kvfeed go routine -- please take some time to think about this 1 liner
change. I think it's subtle; and I think there was a chance that we would fail to flush last bits of data prior to completing initial scan only work; What I think could happen is that kvfeed waits for the buffer to drain; as soon as we get the last
element off of that buffer, (tick() function), kvFeed exits; it then emits result to ca.errCh, and then called ca.cancel(). However, the context used in many places (all of them, I think) is tied to ca.Ctx() -- therefore, it's possible that that very last element, that might have triggered frontier flush, would not flush successfully due to context cancellation.
Now, as you say, comedy of errors.. I think it would have worked fine when we emitted NONE followed by EXIT frontiers; when we advanced due to NONE, we would emit all pending elements, then when we see EXIT frontiers -- there wouldn't be anything to flush so, we probably wouldn't see context cancellation. I do think, however, that that ca.cancel() call was in error.

An over than 2 year old change
(cockroachdb#71848)
that added support for checkpointing during backfill after schema change,
inadvertently broke initial scan checkpointing funcitonality

Exacerbating the problem, the existing test
`TestChangefeedBackfillCheckpoint` continued to work fine.
Treason why it was passing was because the test was looking
for a checkpoint whose timestamp matched bacfill timestamp.
The bug involved incorrect initialize/use of 0 timestamp.
It just so happens, that after initial scan completes, the
rangefeed starts, and the very first thing it does is to
generate a 0 timestamp checkpoint.  So, the test was
observing this event, and continued to work.
This PR does not have a dedicated test because the existing
tests work fine -- provided we ignore 0 timestamp checkpoint,
which is what this PR does in addition to addressing
the root cause of the bug.

Informs cockroachdb#96959

Release note (enterprise change): Fix a bug in changefeeds, where
long running initial scans will fail to generate checkpoint.
Failure to generate checkpoint is particularly bad if the
changefeed restarts for whatever reason.  Without checkpoints,
the changefeed will restart from the beginning, and in the worst
case, when exporting substantially sized tables, changefeed
initial scan may have hard time completing.
@miretskiy
Copy link
Contributor Author

Another test kept failing; the issue was that once expensive checkpoint was written, then subsequent high watermark checkpoints won't happen for a while. Latest change adds a separate timer to so that highwater checkpoints are independent from span level ones.

Copy link
Contributor

@samiskin samiskin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I think could happen is that kvfeed waits for the buffer to drain; as soon as we get the last element off of that buffer, (tick() function), kvFeed exits; it then emits result to ca.errCh, and then called ca.cancel()

Hm yeah, perhaps we should eventually move this initial-scan-only to not use errors in its happy-path to avoid having to worry about this type of "is this a 'kill everything and thats okay' error or a 'everything must flush first' kind of error"

@miretskiy
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 13, 2023

Build succeeded:

@craig craig bot merged commit 0c2cc47 into cockroachdb:master Feb 13, 2023
@blathers-crl
Copy link

blathers-crl bot commented Feb 13, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 495dc98 to blathers/backport-release-22.1-96995: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants