[xCluster] Handle Pausing/Resuming Test Clusters with Replication #10084
Labels
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/medium
Medium priority issue
xCluster
Label for xCluster related issues/improvements
Jira Link: DB-4612
During local SDET testing, we ran into issues with Pause/Resume of a Cluster that was setup for xCluster Replication. The Dev Cluster had been paused for 2 days and the resume failed (https://yugabyte.slack.com/archives/C4141D60H/p1632120259192700). I was pulled in because of log spew. It seems to be a race between [1] the Tablet Log GC, which would kick in on startup and remove all logs and [2] the Consumer, which would call GetChanges on the next OpID after the pause occurred.
The xCluster GC logic is in
LogReader::GetSegmentPrefixNotIncluding
, which provides the candidate files forGetSegmentsToGCUnlocked
. 4 variables control GC, as far as we care (precedence):The above pause triggered MaxTimePolicy, it would’ve been GC’d regardless of cdc_max_replicated_index. It probably didn’t seem like a big deal when we initially wrote it and we likely weren’t thinking of the pause cluster use cases. MinSpacePolicy seems like the important heuristic to keep here & we should get rid of MaxTimePolicy to support this use case. Probably should keep it under a GFLAG in case there are some downsides we're not seeing.
Additionally: cdc_wal_retention_time_secs seems to be necessary because there’s no synchronization in setting cdc_max_replicated_index on the TServers from Master right now during Bootstrap. We raise the limit with an AlterTable, then we write the cdc_state table with the Replicated OpIds. We should be able to consolidate.
The text was updated successfully, but these errors were encountered: