[xCluster] Handle Pausing/Resuming Test Clusters with Replication #10084

nspiegelberg · 2021-09-23T01:10:06Z

Jira Link: DB-4612
During local SDET testing, we ran into issues with Pause/Resume of a Cluster that was setup for xCluster Replication. The Dev Cluster had been paused for 2 days and the resume failed (https://yugabyte.slack.com/archives/C4141D60H/p1632120259192700). I was pulled in because of log spew. It seems to be a race between [1] the Tablet Log GC, which would kick in on startup and remove all logs and [2] the Consumer, which would call GetChanges on the next OpID after the pause occurred.

The xCluster GC logic is in LogReader::GetSegmentPrefixNotIncluding, which provides the candidate files for GetSegmentsToGCUnlocked. 4 variables control GC, as far as we care (precedence):

MinSpacePolicy. CDC Policy. Include a file for GC if our space is too low, log_stop_retaining_min_disk_mb (100GB)
MaxTimePolicy. CDC Policy. Include all files for GC that are older than a GFLAG, log_max_seconds_to_retain. (24hr)
cdc_max_replicated_index. CDC Policy. Include all files for GC that CDC has read past.
wal_retention_secs. Global Policy. This is used at the higher layer. Skip candidate files newer than log_min_seconds_to_retain (15min). xCluster overrides this to cdc_wal_retention_time_secs (4 hr).

The above pause triggered MaxTimePolicy, it would’ve been GC’d regardless of cdc_max_replicated_index. It probably didn’t seem like a big deal when we initially wrote it and we likely weren’t thinking of the pause cluster use cases. MinSpacePolicy seems like the important heuristic to keep here & we should get rid of MaxTimePolicy to support this use case. Probably should keep it under a GFLAG in case there are some downsides we're not seeing.

Additionally: cdc_wal_retention_time_secs seems to be necessary because there’s no synchronization in setting cdc_max_replicated_index on the TServers from Master right now during Bootstrap. We raise the limit with an AlterTable, then we write the cdc_state table with the Replicated OpIds. We should be able to consolidate.

The text was updated successfully, but these errors were encountered:

nspiegelberg · 2022-04-04T23:22:34Z

When looking into this, we should also audit the design of FLAGS_cdc_min_replicated_index_considered_stale_secs. This looks like it could accidentally GC valid logs if we're not careful.

nspiegelberg · 2022-08-16T22:03:41Z

One thought on how to support this use case:

Size policy kicks in immediately.
Time policy isn't used for the first N minutes after the process is started.
Do not collect on Time policy if CDC is updating it's checkpoint and the overall retention size is decreasing.

nspiegelberg added the area/cdc Change Data Capture label Sep 23, 2021

nspiegelberg mentioned this issue Oct 16, 2021

[docdb] xCluster Roadmap Firehose #10319

Closed

nspiegelberg mentioned this issue Feb 15, 2022

[DocDB][XCluster] Bootstrap Needed API #10645

Open

bmatican mentioned this issue Mar 23, 2022

[xCluster] Roadmap for 2.13 #11015

Closed

bmatican assigned lingamsandeep Apr 5, 2022

bmatican added xCluster Label for xCluster related issues/improvements and removed area/cdc Change Data Capture labels Apr 5, 2022

rthallamko3 added the area/docdb YugabyteDB core features label Dec 29, 2022

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue status/awaiting-triage Issue awaiting triage and removed status/awaiting-triage Issue awaiting triage labels Dec 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[xCluster] Handle Pausing/Resuming Test Clusters with Replication #10084

[xCluster] Handle Pausing/Resuming Test Clusters with Replication #10084

nspiegelberg commented Sep 23, 2021 •

edited by yugabyte-ci

Loading

nspiegelberg commented Apr 4, 2022

nspiegelberg commented Aug 16, 2022

[xCluster] Handle Pausing/Resuming Test Clusters with Replication #10084

[xCluster] Handle Pausing/Resuming Test Clusters with Replication #10084

Comments

nspiegelberg commented Sep 23, 2021 • edited by yugabyte-ci Loading

nspiegelberg commented Apr 4, 2022

nspiegelberg commented Aug 16, 2022

nspiegelberg commented Sep 23, 2021 •

edited by yugabyte-ci

Loading