-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] [BUG] Primary to primary recovery (relocation) breaks with segment conflicts. #5242
Comments
Looking into it |
Of three available fixes, third solution (block old primary from blocking replication) will cause delays on replicas and is not the considered. First solution (bump SegmentInfo counter) is hacky and also identifying ideal bump number is tricky to find. Based on above, I will be following the second approach for this fix. Below is the rough plan for the same
|
With integration test below (thanks @Rishikesh1159 for sharing this); post relocation, the segment conflict occurs when new primary indexes a doc when segment replication event is triggered on older replica. The image below shows different steps during the relocation journey. No indexing operation is performed during the relocation in test case below. Need to identify how solution 2 handles the case:
|
Tried out the changes on 3gb data set with 3 data nodes. Force merged the segments to 1 and see the segrep round took
|
Closing as #5344 is merged. |
From #4665 & related . Primary-Primary recovery operates the same under segment replication as it does today by using Peer recovery.
This process works by copying segments out to the new primary and then any operations received during the copy duration to be replayed, followed by the relocation handoff. With segment replication the old primary shard will continue to copy out to other replicas during the relocation process. Once the new primary is recovered it will reindex the operations received. This means the new primary will reindex operations already sent out to the replication group, causing a segment conflict. This will cause the replicas to fail and recover again from the new primary.
Ideas to fix:
The text was updated successfully, but these errors were encountered: