-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[xCluster] Create API to Wait for Replication Drain #10978
Labels
area/docdb
YugabyteDB core features
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
Comments
rahuldesirazu
added
area/docdb
YugabyteDB core features
priority/high
High Priority
labels
Jan 3, 2022
This was referenced Jan 3, 2022
This API would need to disable splitting while waiting for the drain or take that into account. |
nspiegelberg
changed the title
[xCluster] Create API WaitForReplicationLag == 0
[xCluster] Create API to Wait for Replication Drain
Mar 30, 2022
@hari90 taking this one back for now, as it might be a bit more work than initially thinking. I'll pass you a separate one as an intro to producer side code first, instead. |
yugabyte-ci
added
kind/bug
This issue is a bug
priority/medium
Medium priority issue
labels
Jun 8, 2022
LambdaWL
added a commit
that referenced
this issue
Jul 13, 2022
…logManager and CDCService Summary: Issue: #10978 Design Doc: https://docs.google.com/document/d/1HB6zT2MX3NmKnlhhhJYbuCisVV-JKvgSO9gJAhul9OE/edit As a first step, implemented the API logic on CatalogManager and CDCService. - CatalogManager: repeatedly send RPCs to relevant TServvers through the API on CDCService - CDCService: check `cdc_metrics` for tablets in order to determine whether the replication on a tablet is caught-up (to some user specified point in time) **Notes on how the notion of caught-up is decided:** A new `cdc_metric` named `last_caughtup_physicaltime` is added. This metric is used to record the point in time such that we can safely assume that the consumer has caught up with the producer up to this time. In other words, this metric keeps track of the progress that the consumer has made in the replication. Metric is updated as follows: - If currently consumer has the latest record on producer, i.e. the lag is zero, update the metric to `GetCurrentTimeMicros()` - Otherwise, update the metric to its own value, or `last_checkpoint_physicaltime`, whichever is larger. This is to ensure that the metric value is always non-decreasing. The update happens in `GetChanges()`, since it is the only place that consumer could make progress in the replication. With `last_caughtup_physicaltime`, the logic of the API is greatly simplified: it merely compares the user specified timestamp (default to current time if not provided) against this new metric. **Since the API logic now entirely relies on the new cdc metric, if producer leadership changes & consumer never send GetChanges from this point on, the metric would be empty and the API would view this tablet as not drained. For this reason, the API only works when replication is still on.** Test Plan: Created three unit tests, command: ``` ./yb_build.sh --cxx-test integration-tests_twodc-test --test-timeout-sec 1200 --gtest_filter "*TwoDCTestWaitForReplicationDrain*" -n 20 ``` The tests cover three scenarios: - Consumer is unable to caught-up due to GetChanges being blocked (added a test flag `block_get_changes` in `cdc_service.cc`) - Consumer is unable to caught-up due to tservers being shut down - User specifies a point in time to check for replication drain Reviewers: rahuldesirazu, nicolas, jhe Reviewed By: jhe Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D17806
LambdaWL
added a commit
that referenced
this issue
Jul 15, 2022
…YB-Admin Summary: Issue: #10978 Design Doc: https://docs.google.com/document/d/1HB6zT2MX3NmKnlhhhJYbuCisVV-JKvgSO9gJAhul9OE/edit As the second step, implemented the API on yb-admin. The CLI command is: ``` yb-admin wait_for_replication_drain <comma_separated_list_of_stream_ids> [<timestamp> | minus <interval>] ``` where the `minus <interval>` is the same format as in PITR (documentation [[ https://docs.yugabyte.com/preview/explore/cluster-management/point-in-time-recovery-ycql/ | here ]], or see `restore_snapshot_schedule` in `yb-admin_cli_ent.cc`). If all streams are caught-up, the API prints `All replications are caught-up.` to the console. Otherwise, it prints the non-caught-up streams in the following format: ``` Found undrained replications: - Under Stream <stream_id>: - Tablet: <tablet_id> - Tablet: <tablet_id> // ...... // ...... ``` Test Plan: ``` ./yb_build.sh --cxx-test yb-admin-test_ent --gtest_filter XClusterAdminCliTest.TestWaitForReplicationDrain -n 20 ``` Reviewers: rahuldesirazu, nicolas, jhe Reviewed By: jhe Subscribers: slingam, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D18363
yugabyte-ci
added
kind/enhancement
This is an enhancement of an existing feature
and removed
kind/bug
This issue is a bug
labels
Jul 30, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/docdb
YugabyteDB core features
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
Jira Link: DB-1155
Description
For XCluster, we don't have an explicit API to verify when the Producer & Consumer are in sync. The normal mode of operation should include lag, since XCluster is asynchronous. However, there are some use cases where we would want to understand if the Producer has completely sent all operations to the Consumer (drain).
Additionally, if we extended this "drain" API to include an OpID or Timestamp, we use this API to wait for a particular catch up window when bootstrapping XCluster.
Implementation Notes:
The text was updated successfully, but these errors were encountered: