[Segment Replication] Peer recovery checkpoint publication invariants #3923

Bukhtawar · 2022-07-15T15:23:46Z

Is your feature request related to a problem? Please describe.
With Segment replication the primary publishes checkpoints on OpenSearch refresh/flush, however soon after the relocation hand-off is complete the peer primary source gets into replica mode which ensures that it is no longer responsible for global checkpoint tracking and the peer primary target should take over. However it's possible the peer primary source continues to concurrently perform refreshes and publish checkpoints concurrently with the RECOVERING peer primary target till the cluster state update changes their respective shard state.

Note: The recovering target has the same primary term as that of the recovery source and hence just primary term checks during checkpoint publication and checkpoint processing might not be good enough

Describe the solution you'd like
The checkpoint publication needs to check on the primaryMode to ensure it doesn't publish checkpoint and the recovery target to discard any checkpoints if it is in primaryMode during the peer recovery process.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

Rishikesh1159 · 2022-07-27T16:36:59Z

Checklist of things to do:

Understand Current Peer recovery Process.
Look for any references of already written test cases in Peer recovery that might help to reproduce the issue.
Add condition/logic to check on primaryMode.
Write a test case unit test/Integ to check on primarymode before publishing.
Make sure test case written in above step passes.

Rishikesh1159 · 2022-08-08T06:23:13Z

@Bukhtawar I have put out a small PR which checks if shard is in PrimaryMode before publishing a checkpoint. Please let me your thoughts on this and if this check is enough to ensure we don't publish checkpoints when not in PrimaryMode.

I think we might need an integration test to cover the specific scenario of replica shard Promotion to primary, but writing integ test for this may take a while until this issue is fixed.

Rishikesh1159 · 2022-08-08T06:41:50Z

@Bukhtawar From your suggested solution: the recovery target to discard any checkpoints if it is in primaryMode during the peer recovery process.

After ensuring we don't publish checkpoint if shard is not in Primary Mode, I was looking to write a check for PrimaryMode to discard checkpoint as you suggested on checkpoint receiving end here before calling onNewCheckpoint(). But I cannot access/check if shard is in PrimaryMode or not. Because I don't have access to replicationtracker or I am not in same package so I cannot do PrimaryMode check in PublishCheckpointAction class.

A workaround for this is to change access of this method getReplicationTracker() from default to public. I am not sure if it is a worth/good practice to change this method access to public.

Please let me know your thoughts on this PrimaryMode check on checkpoint receiving end and discarding it. Also your thoughts on if this check is actually necessary ? as we are only publishing checkpoint on primaryMode.

mch2 · 2022-08-08T20:39:15Z

@Rishikesh1159 I don't think you need to change the visibility of the replication tracker. SegmentReplicationTargetService invokes IndexShard#shouldProcessCheckpoint, you can add a check there?

Rishikesh1159 · 2022-08-16T18:58:03Z

The logic and unit tests for this issue have been merged with this PR. Keeping this issue open until an integration test is added for testing this

dreamer-89 · 2022-09-20T18:21:02Z

@Rishikesh1159 : One idea around mimicking this situation is by mocking the transport action responsible for updating the cluster state. One sample test in SegmentReplicationIT.

@Bukhtawar : Do you have better ideas around mimicking this scenario ?

Bukhtawar added enhancement Enhancement or improvement to existing feature or request untriaged labels Jul 15, 2022

Bukhtawar mentioned this issue Jul 15, 2022

[Remote Segment Store] Failure handling #3906

Closed

owaiskazi19 added distributed framework and removed untriaged labels Jul 15, 2022

This was referenced Jul 19, 2022

[META] Segment Replication Issue list #2194

Closed

[Segment Replication] Experimental Release Tracking #3969

Closed

mch2 assigned Rishikesh1159 Jul 25, 2022

Rishikesh1159 mentioned this issue Aug 8, 2022

[Segment Replication] Adding PrimaryMode check before publishing checkpoint and processing a received checkpoint. #4157

Merged

Rishikesh1159 mentioned this issue Sep 29, 2022

[Segment Replication ] Add shard routing primary check when processing a checkpoint. #4630

Merged

6 tasks

Rishikesh1159 mentioned this issue Oct 10, 2022

[Backport 2.x] [Segment Replication ] Add shard routing primary check when processing a checkpoint. #4716

Merged

6 tasks

mch2 mentioned this issue Nov 14, 2022

[Segment Replication] [BUG] Primary to primary recovery (relocation) breaks with segment conflicts. #5242

Closed

anasalkouz added Migration:In Progress and removed Migration:In Progress labels Mar 17, 2023

Bukhtawar added the Indexing:Replication Issues and PRs related to core replication framework eg segrep label Jul 27, 2023

anasalkouz removed the distributed framework label Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Segment Replication] Peer recovery checkpoint publication invariants #3923

[Segment Replication] Peer recovery checkpoint publication invariants #3923

Bukhtawar commented Jul 15, 2022

Rishikesh1159 commented Jul 27, 2022 •

edited

Loading

Rishikesh1159 commented Aug 8, 2022

Rishikesh1159 commented Aug 8, 2022 •

edited

Loading

mch2 commented Aug 8, 2022

Rishikesh1159 commented Aug 16, 2022

dreamer-89 commented Sep 20, 2022

[Segment Replication] Peer recovery checkpoint publication invariants #3923

[Segment Replication] Peer recovery checkpoint publication invariants #3923

Comments

Bukhtawar commented Jul 15, 2022

Rishikesh1159 commented Jul 27, 2022 • edited Loading

Rishikesh1159 commented Aug 8, 2022

Rishikesh1159 commented Aug 8, 2022 • edited Loading

mch2 commented Aug 8, 2022

Rishikesh1159 commented Aug 16, 2022

dreamer-89 commented Sep 20, 2022

Rishikesh1159 commented Jul 27, 2022 •

edited

Loading

Rishikesh1159 commented Aug 8, 2022 •

edited

Loading