Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xCluster] : Fix CDC Stream Deletion Race Condition #9877

Closed
nspiegelberg opened this issue Aug 30, 2021 · 3 comments
Closed

[xCluster] : Fix CDC Stream Deletion Race Condition #9877

nspiegelberg opened this issue Aug 30, 2021 · 3 comments
Assignees
Labels
area/docdb YugabyteDB core features

Comments

@nspiegelberg
Copy link
Contributor

Currently, there is a race condition on the Producer between CatalogManager::DeleteCDCStream & CDCServiceImpl::UpdateCheckpoint. The 'cdc_state' table entries are dynamically inserted by UpdateCheckpoint & the CDC stream is only validated on first access. This means that, when DeleteCDCStream runs and removes all entries from the 'cdc_state' table, a subsequent GetChanges request by the Consumer can recreate those entries even though the stream has been deleted on the Producer's Master. This race condition, in addition to leaving dangling entries, also ends up affecting the lag metrics, retention policy, and YW UI display. This was seen in a customer setup when the replication fell too far behind and we needed to recreate the setup.

@nspiegelberg
Copy link
Contributor Author

Talking with @rahuldesirazu , we sketched together 2 possible design solutions:

  1. Refactor CreateCDCStream to insert all the 'cdc_state' entries up front. We will also need a hook so tablet splitting also inserts a new entry. The UpdateCheckpoint entry would be updated to only insert IF_EXISTS, so it would error out on delete CDC stream.
  2. DeleteCDCStream uses RetryingTSRpcTask or TSHeartbeatResponsePB to update all TServers with the new state of the CDC Stream. Only delete 'cdc_state' & move the SysCatalog CDC entry from DELETING=>DELETED after this occurs.

Upon further inspection, we also need to handle cleanup of the in-memory state of CDC Metrics on the Tablet, update the CDCService cache, and Reset the cdc_min_replicated_index. #2 would make that easier.

@nspiegelberg
Copy link
Contributor Author

This race condition is referenced in a comment within #9685

@bmatican bmatican added the area/docdb YugabyteDB core features label Sep 15, 2021
@rahuldesirazu rahuldesirazu self-assigned this Nov 16, 2021
@hulien22
Copy link
Contributor

Upon further inspection, we also need to handle cleanup of the in-memory state of CDC Metrics on the Tablet, update the CDCService cache, and Reset the cdc_min_replicated_index. 2 would make that easier.

Encountered this issue of not handling cleanup of in-memory state on the tservers. This happened as part of a customer interaction where there was a delete_cdc_stream on a live replication setup, which then led to dangling entries in system.cdc_state and then high lag metrics. After deleting the dangling entries, the high lag metric sticks around, and is only reset upon a tserver restart.

rahuldesirazu added a commit that referenced this issue Jan 14, 2022
Summary:
Fix a race condition in xCluster stream deletion where cdc service upserts the checkpoint for cdc_state, causing insert of entries for streams entries. Fix this by making the update checkpoint operation a pure update and handling initial setup insert into the cdc_state table as part of master's CreateCDCStream RPC. Deletion is already handled in DeleteCDCStream so no changes are needed on the delete path.

We still want to let BootstrapProducer modify cdc_state with a unique opid and bootstrap id, so we don't modify cdc_state in CreateCDCStream if the call comes from bootstrap.

Test Plan: ybd release --cxx-test integration-tests_cdc_service-int-test --gtest_filter *TestDeleteCDCStream/1

Reviewers: nicolas, jhe

Reviewed By: jhe

Subscribers: ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D14028
rahuldesirazu added a commit that referenced this issue Mar 17, 2022
…ition

Summary:
Fix a race condition in xCluster stream deletion where cdc service upserts the checkpoint for cdc_state, causing insert of entries for streams entries. Fix this by making the update checkpoint operation a pure update and handling initial setup insert into the cdc_state table as part of master's CreateCDCStream RPC. Deletion is already handled in DeleteCDCStream so no changes are needed on the delete path.

We still want to let BootstrapProducer modify cdc_state with a unique opid and bootstrap id, so we don't modify cdc_state in CreateCDCStream if the call comes from bootstrap.

Original Commit: D14028 / 98dab94

Test Plan: ybd release --cxx-test integration-tests_cdc_service-int-test --gtest_filter *TestDeleteCDCStream/1

Reviewers: jhe, nicolas

Reviewed By: jhe, nicolas

Subscribers: bogdan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D16012
rahuldesirazu added a commit that referenced this issue Mar 17, 2022
Summary:
Fix a race condition in xCluster stream deletion where cdc service upserts the checkpoint for cdc_state, causing insert of entries for streams entries. Fix this by making the update checkpoint operation a pure update and handling initial setup insert into the cdc_state table as part of master's CreateCDCStream RPC. Deletion is already handled in DeleteCDCStream so no changes are needed on the delete path.

We still want to let BootstrapProducer modify cdc_state with a unique opid and bootstrap id, so we don't modify cdc_state in CreateCDCStream if the call comes from bootstrap.

Original Commit: D14028 / 98dab94

Test Plan: ybd release --cxx-test integration-tests_cdc_service-int-test --gtest_filter *TestDeleteCDCStream/1

Reviewers: nicolas, jhe

Reviewed By: nicolas, jhe

Subscribers: bogdan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D16010
rahuldesirazu added a commit that referenced this issue Mar 17, 2022
…tion

Summary:
Fix a race condition in xCluster stream deletion where cdc service upserts the checkpoint for cdc_state, causing insert of entries for streams entries. Fix this by making the update checkpoint operation a pure update and handling initial setup insert into the cdc_state table as part of master's CreateCDCStream RPC. Deletion is already handled in DeleteCDCStream so no changes are needed on the delete path.

We still want to let BootstrapProducer modify cdc_state with a unique opid and bootstrap id, so we don't modify cdc_state in CreateCDCStream if the call comes from bootstrap.

Original Commit: D14028 / 98dab94

Test Plan: ybd release --cxx-test integration-tests_cdc_service-int-test --gtest_filter *TestDeleteCDCStream/1

Reviewers: nicolas, jhe

Reviewed By: nicolas, jhe

Subscribers: bogdan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D16011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features
Projects
None yet
Development

No branches or pull requests

4 participants