-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[xCluster] : Fix CDC Stream Deletion Race Condition #9877
Comments
Talking with @rahuldesirazu , we sketched together 2 possible design solutions:
Upon further inspection, we also need to handle cleanup of the in-memory state of CDC Metrics on the Tablet, update the CDCService cache, and Reset the cdc_min_replicated_index. #2 would make that easier. |
This race condition is referenced in a comment within #9685 |
Encountered this issue of not handling cleanup of in-memory state on the tservers. This happened as part of a customer interaction where there was a delete_cdc_stream on a live replication setup, which then led to dangling entries in system.cdc_state and then high lag metrics. After deleting the dangling entries, the high lag metric sticks around, and is only reset upon a tserver restart. |
Summary: Fix a race condition in xCluster stream deletion where cdc service upserts the checkpoint for cdc_state, causing insert of entries for streams entries. Fix this by making the update checkpoint operation a pure update and handling initial setup insert into the cdc_state table as part of master's CreateCDCStream RPC. Deletion is already handled in DeleteCDCStream so no changes are needed on the delete path. We still want to let BootstrapProducer modify cdc_state with a unique opid and bootstrap id, so we don't modify cdc_state in CreateCDCStream if the call comes from bootstrap. Test Plan: ybd release --cxx-test integration-tests_cdc_service-int-test --gtest_filter *TestDeleteCDCStream/1 Reviewers: nicolas, jhe Reviewed By: jhe Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D14028
…ition Summary: Fix a race condition in xCluster stream deletion where cdc service upserts the checkpoint for cdc_state, causing insert of entries for streams entries. Fix this by making the update checkpoint operation a pure update and handling initial setup insert into the cdc_state table as part of master's CreateCDCStream RPC. Deletion is already handled in DeleteCDCStream so no changes are needed on the delete path. We still want to let BootstrapProducer modify cdc_state with a unique opid and bootstrap id, so we don't modify cdc_state in CreateCDCStream if the call comes from bootstrap. Original Commit: D14028 / 98dab94 Test Plan: ybd release --cxx-test integration-tests_cdc_service-int-test --gtest_filter *TestDeleteCDCStream/1 Reviewers: jhe, nicolas Reviewed By: jhe, nicolas Subscribers: bogdan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D16012
Summary: Fix a race condition in xCluster stream deletion where cdc service upserts the checkpoint for cdc_state, causing insert of entries for streams entries. Fix this by making the update checkpoint operation a pure update and handling initial setup insert into the cdc_state table as part of master's CreateCDCStream RPC. Deletion is already handled in DeleteCDCStream so no changes are needed on the delete path. We still want to let BootstrapProducer modify cdc_state with a unique opid and bootstrap id, so we don't modify cdc_state in CreateCDCStream if the call comes from bootstrap. Original Commit: D14028 / 98dab94 Test Plan: ybd release --cxx-test integration-tests_cdc_service-int-test --gtest_filter *TestDeleteCDCStream/1 Reviewers: nicolas, jhe Reviewed By: nicolas, jhe Subscribers: bogdan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D16010
…tion Summary: Fix a race condition in xCluster stream deletion where cdc service upserts the checkpoint for cdc_state, causing insert of entries for streams entries. Fix this by making the update checkpoint operation a pure update and handling initial setup insert into the cdc_state table as part of master's CreateCDCStream RPC. Deletion is already handled in DeleteCDCStream so no changes are needed on the delete path. We still want to let BootstrapProducer modify cdc_state with a unique opid and bootstrap id, so we don't modify cdc_state in CreateCDCStream if the call comes from bootstrap. Original Commit: D14028 / 98dab94 Test Plan: ybd release --cxx-test integration-tests_cdc_service-int-test --gtest_filter *TestDeleteCDCStream/1 Reviewers: nicolas, jhe Reviewed By: nicolas, jhe Subscribers: bogdan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D16011
Currently, there is a race condition on the Producer between CatalogManager::DeleteCDCStream & CDCServiceImpl::UpdateCheckpoint. The 'cdc_state' table entries are dynamically inserted by UpdateCheckpoint & the CDC stream is only validated on first access. This means that, when DeleteCDCStream runs and removes all entries from the 'cdc_state' table, a subsequent GetChanges request by the Consumer can recreate those entries even though the stream has been deleted on the Producer's Master. This race condition, in addition to leaving dangling entries, also ends up affecting the lag metrics, retention policy, and YW UI display. This was seen in a customer setup when the replication fell too far behind and we needed to recreate the setup.
The text was updated successfully, but these errors were encountered: