Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug where replication lag grows post primary relocation #11238

Merged
merged 5 commits into from
Dec 1, 2023

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Nov 16, 2023

Description

This fixes an issue where replication lag grows post primary relocation. After a relocation occurs the new primary will publish a checkpoint to sync with the new replica, the replica if it has not yet processed a cluster state update that the relocation has completed, will respond to the old primary. If no further action is taken on the shard, the new primary will think the replica is stale. This change fixes this by triggering a round of replication from the replica once it has processed the cluster state update.

Related Issues

Resolves #11211

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented Nov 16, 2023

Compatibility status:

Checks if related components are compatible with change e4b35a7

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/sql.git]

@github-actions github-actions bot added bug Something isn't working Indexing:Replication Issues and PRs related to core replication framework eg segrep labels Nov 16, 2023
Copy link
Contributor

❌ Gradle check result for 49069a3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@mch2
Copy link
Member Author

mch2 commented Nov 29, 2023

The current solution breaks down when processing the latest received checkpoint as we don't know on the replica if an incoming checkpoint is coming from the former or new primary.

An alternate solution I've been testing here is to simply add a cluster changed listener, and on a routing table change we either process the latest cp or simply update the new primary with the replica's state like so...

    @Override
    public void clusterChanged(ClusterChangedEvent event) {
        if (event.routingTableChanged()) {
            // only consider replicas that have received a checkpoint
            for (ShardId shardId : latestReceivedCheckpoint.keySet()) {
                if (event.indexRoutingTableChanged(shardId.getIndexName())) {
                    final String previousNode = event.previousState().routingTable().shardRoutingTable(shardId).primaryShard().currentNodeId();
                    final String currentNode = event.state().routingTable().shardRoutingTable(shardId).primaryShard().currentNodeId();
                    if (previousNode.equals(currentNode) == false) {
                        IndexShard shard = indicesService.getShardOrNull(shardId);
                        if (shard != null && shard.routingEntry().primary() == false) {
                            processLatestReceivedCheckpoint(shard, Thread.currentThread());
                        }
                    }
                }
            }
        }
    }

This would cover our case here post relocation, guaranteeing that the replica updates the correct node.

Alternatively, we could look up the current primary on the replica, and if its routing entry shows a relocation is in progress invoke update to both old & new nodes. However, this still leaves us depending on cluster state arriving at a certain time.

@mch2
Copy link
Member Author

mch2 commented Nov 30, 2023

Resolved comments made against removed code as obsolete.

This comment was marked as outdated.

Copy link
Contributor

✅ Gradle check result for 33a7ec1: SUCCESS

Copy link

codecov bot commented Nov 30, 2023

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (21c0597) 71.28% compared to head (e4b35a7) 71.16%.

Files Patch % Lines
...s/replication/SegmentReplicationTargetService.java 92.85% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #11238      +/-   ##
============================================
- Coverage     71.28%   71.16%   -0.12%     
+ Complexity    59033    58970      -63     
============================================
  Files          4893     4893              
  Lines        277753   277780      +27     
  Branches      40357    40363       +6     
============================================
- Hits         197989   197681     -308     
- Misses        63288    63659     +371     
+ Partials      16476    16440      -36     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

❕ Gradle check result for 08d58b7: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.search.SearchWeightedRoutingIT.testStrictWeightedRoutingWithCustomString_FailOpenEnabled

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Contributor

❌ Gradle check result for 6c93efb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for 6c93efb: SUCCESS

Copy link
Contributor

✅ Gradle check result for 3aeeaed: SUCCESS

@mch2
Copy link
Member Author

mch2 commented Dec 1, 2023

❕ Gradle check result for 08d58b7: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.search.SearchWeightedRoutingIT.testStrictWeightedRoutingWithCustomString_FailOpenEnabled

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

#8030

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
@mch2
Copy link
Member Author

mch2 commented Dec 1, 2023

added a changelog entry here as its user facing, needed to rebase to avoid conflict

Copy link
Contributor

github-actions bot commented Dec 1, 2023

❌ Gradle check result for e4b35a7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Dec 1, 2023

❕ Gradle check result for e4b35a7: UNSTABLE

  • TEST FAILURES:
      2 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@mch2
Copy link
Member Author

mch2 commented Dec 1, 2023

❕ Gradle check result for e4b35a7: UNSTABLE

  • TEST FAILURES:
      2 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

#9191

@mch2 mch2 added the backport 2.x Backport to 2.x branch label Dec 1, 2023
@mch2 mch2 merged commit 6fa3a0d into opensearch-project:main Dec 1, 2023
32 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Dec 1, 2023
* Fix bug where replication lag grows post primary relocation

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* Fix broken UT

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* add unit test for cluster state update

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* PR feedback

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* add changelog entry

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

---------

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
(cherry picked from commit 6fa3a0d)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
mch2 pushed a commit that referenced this pull request Dec 1, 2023
…11427)

* Fix bug where replication lag grows post primary relocation



* Fix broken UT



* add unit test for cluster state update



* PR feedback



* add changelog entry



---------


(cherry picked from commit 6fa3a0d)

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Comment on lines +166 to +168
protected void doClose() throws IOException {

}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should doClose replicate doStop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to as stop is invoked before close when the node is shut down from Node#close. However, I don't see this getting explicitly closed there along with multiple multiple other services extending AbstractLifecycleComponent, ex PeerRecoverySourceService. Unless I'm missing something here, will raise a pr to get these added.

deshsidd pushed a commit to deshsidd/OpenSearch that referenced this pull request Dec 11, 2023
…ch-project#11238)

* Fix bug where replication lag grows post primary relocation

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* Fix broken UT

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* add unit test for cluster state update

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* PR feedback

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* add changelog entry

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

---------

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
rayshrey pushed a commit to rayshrey/OpenSearch that referenced this pull request Mar 18, 2024
…ch-project#11238)

* Fix bug where replication lag grows post primary relocation

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* Fix broken UT

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* add unit test for cluster state update

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* PR feedback

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* add changelog entry

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

---------

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…ch-project#11238)

* Fix bug where replication lag grows post primary relocation

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* Fix broken UT

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* add unit test for cluster state update

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* PR feedback

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* add changelog entry

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

---------

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch bug Something isn't working Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Segment Replication - SegRep bytes behind and lag metrics incorrect post primary relocation
4 participants