Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation #8889

mch2 · 2023-07-26T07:55:17Z

Description

This test is now occasionally failing with replicas having 0 documents while expecting to be caught up to the primary. This occurs in a couple of ways:

After dropping the old primary the new primary is not publishing a checkpoint to replicas unless it indexes docs from translog after flipping to primary mode. If there is nothing to index, it will not publish a checkpoint, but the other replica could have never sync'd with the original primary and be left out of date.

This PR fixes this by force publishing a checkpoint after the new primary flips to primary mode.

The replica receives a checkpoint post failover and cancels its sync with the former primary that is still active, recognizing a primary term bump. However this cancellation is async and immediately starting a new replication event could fail as its still replicating.

This PR fixes this by attempting to process the latest received checkpoint on failure, if the shard is not failed and still behind.

This PR also introduces a few changes to ensure the accuracy of the ReplicationCheckpoint tracked on primary & replicas.

Ensure the checkpoint stored in SegmentReplicationTarget is the checkpoint passed from the primary and not locally computed. This ensures checks for primary term are accurate and not using a locally computed operationPrimaryTerm.
Introduces a refresh listener for both primary & replica to update the ReplicationCheckpoint and store it in replicationTracker post refresh rather than redundantly computing when accessed.
Removes unnecessary onCheckpointPublished method used to start replication timers manually. This will happen automatically on primaries through the refresh listener.
Implements segmentInfosSnapshot method for NRTReplicationEngine, ensuring we do not remove required segments while computing ReplicationCheckpoint post refresh.

Related Issues

Resolves #8059

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Poojita-Raj · 2023-07-26T17:20:41Z

#8850
Fixed: #8863
New failure: org.opensearch.index.shard.SegmentReplicationWithRemoteIndexShardTests.testReplicaSyncingFromRemoteStore -

github-actions · 2023-07-26T19:26:52Z

Gradle Check (Jenkins) Run Completed with:

RESULT: SUCCESS ✅
URL: https://build.ci.opensearch.org/job/gradle-check/21009/
CommitID: ca7f2bf

codecov · 2023-07-26T19:27:41Z

Codecov Report

Merging #8889 (d0e15b8) into main (9720528) will decrease coverage by 0.03%.
Report is 2 commits behind head on main.
The diff coverage is 69.23%.

@@             Coverage Diff              @@
##               main    #8889      +/-   ##
============================================
- Coverage     71.01%   70.99%   -0.03%     
+ Complexity    57251    57223      -28     
============================================
  Files          4765     4765              
  Lines        270334   270357      +23     
  Branches      39538    39541       +3     
============================================
- Hits         191991   191950      -41     
- Misses        62176    62187      +11     
- Partials      16167    16220      +53

Files Changed	Coverage Δ
...search/index/shard/RemoteStoreRefreshListener.java	`83.06% <0.00%> (-1.48%)`	⬇️
...ckpoint/SegmentReplicationCheckpointPublisher.java	`100.00% <ø> (ø)`
...s/replication/SegmentReplicationTargetService.java	`60.00% <57.69%> (+0.74%)`	⬆️
.../opensearch/index/engine/NRTReplicationEngine.java	`78.98% <66.66%> (-0.49%)`	⬇️
.../indices/replication/SegmentReplicationTarget.java	`89.71% <85.71%> (+0.39%)`	⬆️
...in/java/org/opensearch/index/shard/IndexShard.java	`69.49% <86.95%> (-0.20%)`	⬇️

... and 495 files with indirect coverage changes

github-actions · 2023-07-27T07:07:54Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/21063/
CommitID: 8476d5d
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

mch2 · 2023-07-27T07:20:49Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌

URL: https://build.ci.opensearch.org/job/gradle-check/21063/

CommitID: 8476d5d
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':distribution:bwc:staged:buildBwcLinuxTar'.
> Building 2.9.0 didn't generate expected file /var/jenkins/workspace/gradle-check/search/distribution/bwc/staged/build/bwc/checkout-2.9/distribution/archives/linux-tar/build/distributions/opensearch-min-2.9.0-SNAPSHOT-linux-x64.tar.gz

github-actions · 2023-07-27T07:49:10Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/21068/
CommitID: 8476d5d
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

Poojita-Raj · 2023-07-27T16:50:31Z

Gradle Check (Jenkins) Run Completed with:

* **RESULT:** FAILURE ❌

* **URL:** https://build.ci.opensearch.org/job/gradle-check/21068/

* **CommitID:** [8476d5d](https://github.com/opensearch-project/OpenSearch/commit/8476d5d11705d84da690bb8935aca70da593327e)
  Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
  Is the failure [a flaky test](https://github.com/opensearch-project/OpenSearch/blob/main/DEVELOPER_GUIDE.md#flaky-tests) unrelated to your change?

Execution failed for task ':distribution:bwc:staged:buildBwcLinuxTar'.
Building 2.9.0 didn't generate expected file /var/jenkins/workspace/gradle-check/search/distribution/bwc/staged/build/bwc/checkout-2.9/distribution/archives/linux-tar/build/distributions/opensearch-min-2.9.0-SNAPSHOT-linux-x64.tar.gz

github-actions · 2023-07-27T19:15:48Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/21127/
CommitID: 8b600a7
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

mch2 · 2023-07-27T21:03:25Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌

URL: https://build.ci.opensearch.org/job/gradle-check/21127/

CommitID: 8b600a7
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

#8932

github-actions · 2023-07-27T21:50:36Z

Gradle Check (Jenkins) Run Completed with:

RESULT: ❌
URL: https://build.ci.opensearch.org/job/gradle-check/21157/
CommitID: 8b600a7
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

mch2 · 2023-07-27T22:16:54Z

Gradle Check (Jenkins) Run Completed with:

RESULT: ❌

URL: https://build.ci.opensearch.org/job/gradle-check/21157/

CommitID: 8b600a7
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

#8932 again

server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java

github-actions · 2023-07-27T22:56:03Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❕
TEST FAILURES:

      1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadRangeBlobWithRetries

URL: https://build.ci.opensearch.org/job/gradle-check/21177/
CommitID: 8b600a7
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

…ckpoint validation. This test is now occasionally failing with replicas having 0 documents. This occurs in a couple of ways: 1. After dropping the old primary the new primary is not publishing a checkpoint to replicas unless it indexes docs from translog after flipping to primary mode. If there is nothing to index, it will not publish a checkpoint, but the other replica could have never sync'd with the original primary and be left out of date. - This PR fixes this by force publishing a checkpoint after the new primary flips to primary mode. 2. The replica receives a checkpoint post failover and cancels its sync with the former primary that is still active, recognizing a primary term bump. However this cancellation is async and immediately starting a new replication event could fail as its still replicating. - This PR fixes this by attempting to process the latest received checkpoint on failure, if the shard is not failed and still behind. This PR also introduces a few changes to ensure the accuracy of the ReplicationCheckpoint tracked on primary & replicas. - Ensure the checkpoint stored in SegmentReplicationTarget is the checkpoint passed from the primary and not locally computed. This ensures checks for primary term are accurate and not using a locally compued operationPrimaryTerm. - Introduces a refresh listener for both primary & replica to update the ReplicationCheckpoint and store it in replicationTracker post refresh rather than redundantly computing when accessed. - Removes unnecessary onCheckpointPublished method used to start replication timers manually. This will happen automatically on primaries once its local cp is updated. Signed-off-by: Marc Handalian <handalm@amazon.com>

To avoid divergent logic with remote store, we always incref/decref the segmentinfos.files(true) which includes the segments_n file. Decref to 0 will attempt to delete the file from the store and its possible this _n file does not yet exist. This change will ignore if we get a noSuchFile while attempting to delete. Signed-off-by: Marc Handalian <handalm@amazon.com>

Signed-off-by: Marc Handalian <handalm@amazon.com>

opensearch-trigger-bot · 2023-08-03T01:25:31Z

Compatibility status:



> Task :checkCompatibility
Incompatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/security-analytics.git]
Compatible components: [https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git]

BUILD SUCCESSFUL in 25m 32s

Signed-off-by: Marc Handalian <handalm@amazon.com>

github-actions · 2023-08-03T06:37:07Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/21755/
CommitID: d0e15b8
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

mch2 · 2023-08-03T06:43:00Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌

URL: https://build.ci.opensearch.org/job/gradle-check/21755/

CommitID: d0e15b8
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

Execution failed for task ':test:fixtures:krb5kdc-fixture:composeBuild'.

github-actions · 2023-08-03T07:30:39Z

Gradle Check (Jenkins) Run Completed with:

RESULT: SUCCESS ✅
URL: https://build.ci.opensearch.org/job/gradle-check/21758/
CommitID: d0e15b8

server/src/main/java/org/opensearch/index/shard/IndexShard.java

opensearch-trigger-bot · 2023-08-03T16:48:32Z

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-8889-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 c3acf47b4d643c3a3ab86dc3b07fe722ac6e4982
# Push it to GitHub
git push --set-upstream origin backport/backport-8889-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-8889-to-2.x.

…ckpoint validation (opensearch-project#8889) * Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation. This test is now occasionally failing with replicas having 0 documents. This occurs in a couple of ways: 1. After dropping the old primary the new primary is not publishing a checkpoint to replicas unless it indexes docs from translog after flipping to primary mode. If there is nothing to index, it will not publish a checkpoint, but the other replica could have never sync'd with the original primary and be left out of date. - This PR fixes this by force publishing a checkpoint after the new primary flips to primary mode. 2. The replica receives a checkpoint post failover and cancels its sync with the former primary that is still active, recognizing a primary term bump. However this cancellation is async and immediately starting a new replication event could fail as its still replicating. - This PR fixes this by attempting to process the latest received checkpoint on failure, if the shard is not failed and still behind. This PR also introduces a few changes to ensure the accuracy of the ReplicationCheckpoint tracked on primary & replicas. - Ensure the checkpoint stored in SegmentReplicationTarget is the checkpoint passed from the primary and not locally computed. This ensures checks for primary term are accurate and not using a locally compued operationPrimaryTerm. - Introduces a refresh listener for both primary & replica to update the ReplicationCheckpoint and store it in replicationTracker post refresh rather than redundantly computing when accessed. - Removes unnecessary onCheckpointPublished method used to start replication timers manually. This will happen automatically on primaries once its local cp is updated. Signed-off-by: Marc Handalian <handalm@amazon.com> * Handle NoSuchFileException when attempting to delete decref'd files. To avoid divergent logic with remote store, we always incref/decref the segmentinfos.files(true) which includes the segments_n file. Decref to 0 will attempt to delete the file from the store and its possible this _n file does not yet exist. This change will ignore if we get a noSuchFile while attempting to delete. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add more unit tests. Signed-off-by: Marc Handalian <handalm@amazon.com> * Clean up IndexShardTests.testCheckpointReffreshListenerWithNull Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove unnecessary catch for NoSuchFileException. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add another test for non segrep. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * re-compute replication checkpoint on primary promotion. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com>

…ckpoint validation (opensearch-project#8889) * Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation. This test is now occasionally failing with replicas having 0 documents. This occurs in a couple of ways: 1. After dropping the old primary the new primary is not publishing a checkpoint to replicas unless it indexes docs from translog after flipping to primary mode. If there is nothing to index, it will not publish a checkpoint, but the other replica could have never sync'd with the original primary and be left out of date. - This PR fixes this by force publishing a checkpoint after the new primary flips to primary mode. 2. The replica receives a checkpoint post failover and cancels its sync with the former primary that is still active, recognizing a primary term bump. However this cancellation is async and immediately starting a new replication event could fail as its still replicating. - This PR fixes this by attempting to process the latest received checkpoint on failure, if the shard is not failed and still behind. This PR also introduces a few changes to ensure the accuracy of the ReplicationCheckpoint tracked on primary & replicas. - Ensure the checkpoint stored in SegmentReplicationTarget is the checkpoint passed from the primary and not locally computed. This ensures checks for primary term are accurate and not using a locally compued operationPrimaryTerm. - Introduces a refresh listener for both primary & replica to update the ReplicationCheckpoint and store it in replicationTracker post refresh rather than redundantly computing when accessed. - Removes unnecessary onCheckpointPublished method used to start replication timers manually. This will happen automatically on primaries once its local cp is updated. Signed-off-by: Marc Handalian <handalm@amazon.com> * Handle NoSuchFileException when attempting to delete decref'd files. To avoid divergent logic with remote store, we always incref/decref the segmentinfos.files(true) which includes the segments_n file. Decref to 0 will attempt to delete the file from the store and its possible this _n file does not yet exist. This change will ignore if we get a noSuchFile while attempting to delete. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add more unit tests. Signed-off-by: Marc Handalian <handalm@amazon.com> * Clean up IndexShardTests.testCheckpointReffreshListenerWithNull Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove unnecessary catch for NoSuchFileException. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add another test for non segrep. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * re-compute replication checkpoint on primary promotion. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> (cherry picked from commit c3acf47)

…ckpoint validation (#8889) (#9095) * Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation. This test is now occasionally failing with replicas having 0 documents. This occurs in a couple of ways: 1. After dropping the old primary the new primary is not publishing a checkpoint to replicas unless it indexes docs from translog after flipping to primary mode. If there is nothing to index, it will not publish a checkpoint, but the other replica could have never sync'd with the original primary and be left out of date. - This PR fixes this by force publishing a checkpoint after the new primary flips to primary mode. 2. The replica receives a checkpoint post failover and cancels its sync with the former primary that is still active, recognizing a primary term bump. However this cancellation is async and immediately starting a new replication event could fail as its still replicating. - This PR fixes this by attempting to process the latest received checkpoint on failure, if the shard is not failed and still behind. This PR also introduces a few changes to ensure the accuracy of the ReplicationCheckpoint tracked on primary & replicas. - Ensure the checkpoint stored in SegmentReplicationTarget is the checkpoint passed from the primary and not locally computed. This ensures checks for primary term are accurate and not using a locally compued operationPrimaryTerm. - Introduces a refresh listener for both primary & replica to update the ReplicationCheckpoint and store it in replicationTracker post refresh rather than redundantly computing when accessed. - Removes unnecessary onCheckpointPublished method used to start replication timers manually. This will happen automatically on primaries once its local cp is updated. Signed-off-by: Marc Handalian <handalm@amazon.com> * Handle NoSuchFileException when attempting to delete decref'd files. To avoid divergent logic with remote store, we always incref/decref the segmentinfos.files(true) which includes the segments_n file. Decref to 0 will attempt to delete the file from the store and its possible this _n file does not yet exist. This change will ignore if we get a noSuchFile while attempting to delete. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add more unit tests. Signed-off-by: Marc Handalian <handalm@amazon.com> * Clean up IndexShardTests.testCheckpointReffreshListenerWithNull Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove unnecessary catch for NoSuchFileException. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add another test for non segrep. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * re-compute replication checkpoint on primary promotion. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> (cherry picked from commit c3acf47)

…ckpoint validation (opensearch-project#8889) * Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation. This test is now occasionally failing with replicas having 0 documents. This occurs in a couple of ways: 1. After dropping the old primary the new primary is not publishing a checkpoint to replicas unless it indexes docs from translog after flipping to primary mode. If there is nothing to index, it will not publish a checkpoint, but the other replica could have never sync'd with the original primary and be left out of date. - This PR fixes this by force publishing a checkpoint after the new primary flips to primary mode. 2. The replica receives a checkpoint post failover and cancels its sync with the former primary that is still active, recognizing a primary term bump. However this cancellation is async and immediately starting a new replication event could fail as its still replicating. - This PR fixes this by attempting to process the latest received checkpoint on failure, if the shard is not failed and still behind. This PR also introduces a few changes to ensure the accuracy of the ReplicationCheckpoint tracked on primary & replicas. - Ensure the checkpoint stored in SegmentReplicationTarget is the checkpoint passed from the primary and not locally computed. This ensures checks for primary term are accurate and not using a locally compued operationPrimaryTerm. - Introduces a refresh listener for both primary & replica to update the ReplicationCheckpoint and store it in replicationTracker post refresh rather than redundantly computing when accessed. - Removes unnecessary onCheckpointPublished method used to start replication timers manually. This will happen automatically on primaries once its local cp is updated. Signed-off-by: Marc Handalian <handalm@amazon.com> * Handle NoSuchFileException when attempting to delete decref'd files. To avoid divergent logic with remote store, we always incref/decref the segmentinfos.files(true) which includes the segments_n file. Decref to 0 will attempt to delete the file from the store and its possible this _n file does not yet exist. This change will ignore if we get a noSuchFile while attempting to delete. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add more unit tests. Signed-off-by: Marc Handalian <handalm@amazon.com> * Clean up IndexShardTests.testCheckpointReffreshListenerWithNull Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove unnecessary catch for NoSuchFileException. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add another test for non segrep. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * re-compute replication checkpoint on primary promotion. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> Signed-off-by: Kaushal Kumar <ravi.kaushal97@gmail.com>

…ckpoint validation (opensearch-project#8889) * Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation. This test is now occasionally failing with replicas having 0 documents. This occurs in a couple of ways: 1. After dropping the old primary the new primary is not publishing a checkpoint to replicas unless it indexes docs from translog after flipping to primary mode. If there is nothing to index, it will not publish a checkpoint, but the other replica could have never sync'd with the original primary and be left out of date. - This PR fixes this by force publishing a checkpoint after the new primary flips to primary mode. 2. The replica receives a checkpoint post failover and cancels its sync with the former primary that is still active, recognizing a primary term bump. However this cancellation is async and immediately starting a new replication event could fail as its still replicating. - This PR fixes this by attempting to process the latest received checkpoint on failure, if the shard is not failed and still behind. This PR also introduces a few changes to ensure the accuracy of the ReplicationCheckpoint tracked on primary & replicas. - Ensure the checkpoint stored in SegmentReplicationTarget is the checkpoint passed from the primary and not locally computed. This ensures checks for primary term are accurate and not using a locally compued operationPrimaryTerm. - Introduces a refresh listener for both primary & replica to update the ReplicationCheckpoint and store it in replicationTracker post refresh rather than redundantly computing when accessed. - Removes unnecessary onCheckpointPublished method used to start replication timers manually. This will happen automatically on primaries once its local cp is updated. Signed-off-by: Marc Handalian <handalm@amazon.com> * Handle NoSuchFileException when attempting to delete decref'd files. To avoid divergent logic with remote store, we always incref/decref the segmentinfos.files(true) which includes the segments_n file. Decref to 0 will attempt to delete the file from the store and its possible this _n file does not yet exist. This change will ignore if we get a noSuchFile while attempting to delete. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add more unit tests. Signed-off-by: Marc Handalian <handalm@amazon.com> * Clean up IndexShardTests.testCheckpointReffreshListenerWithNull Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove unnecessary catch for NoSuchFileException. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add another test for non segrep. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * re-compute replication checkpoint on primary promotion. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> Signed-off-by: Ivan Brusic <ivan.brusic@flocksafety.com>

…ckpoint validation (opensearch-project#8889) * Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation. This test is now occasionally failing with replicas having 0 documents. This occurs in a couple of ways: 1. After dropping the old primary the new primary is not publishing a checkpoint to replicas unless it indexes docs from translog after flipping to primary mode. If there is nothing to index, it will not publish a checkpoint, but the other replica could have never sync'd with the original primary and be left out of date. - This PR fixes this by force publishing a checkpoint after the new primary flips to primary mode. 2. The replica receives a checkpoint post failover and cancels its sync with the former primary that is still active, recognizing a primary term bump. However this cancellation is async and immediately starting a new replication event could fail as its still replicating. - This PR fixes this by attempting to process the latest received checkpoint on failure, if the shard is not failed and still behind. This PR also introduces a few changes to ensure the accuracy of the ReplicationCheckpoint tracked on primary & replicas. - Ensure the checkpoint stored in SegmentReplicationTarget is the checkpoint passed from the primary and not locally computed. This ensures checks for primary term are accurate and not using a locally compued operationPrimaryTerm. - Introduces a refresh listener for both primary & replica to update the ReplicationCheckpoint and store it in replicationTracker post refresh rather than redundantly computing when accessed. - Removes unnecessary onCheckpointPublished method used to start replication timers manually. This will happen automatically on primaries once its local cp is updated. Signed-off-by: Marc Handalian <handalm@amazon.com> * Handle NoSuchFileException when attempting to delete decref'd files. To avoid divergent logic with remote store, we always incref/decref the segmentinfos.files(true) which includes the segments_n file. Decref to 0 will attempt to delete the file from the store and its possible this _n file does not yet exist. This change will ignore if we get a noSuchFile while attempting to delete. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add more unit tests. Signed-off-by: Marc Handalian <handalm@amazon.com> * Clean up IndexShardTests.testCheckpointReffreshListenerWithNull Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove unnecessary catch for NoSuchFileException. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add another test for non segrep. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * re-compute replication checkpoint on primary promotion. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> Signed-off-by: Shivansh Arora <hishiv@amazon.com>

mch2 added the skip-changelog label Jul 26, 2023

This comment was marked as outdated.

Sign in to view

mch2 force-pushed the primaryterm branch from 7f06c93 to ca7f2bf Compare July 26, 2023 18:40

mch2 force-pushed the primaryterm branch from ca7f2bf to 8476d5d Compare July 27, 2023 06:41

mch2 force-pushed the primaryterm branch from 8476d5d to 8b600a7 Compare July 27, 2023 18:29

mch2 commented Jul 27, 2023

View reviewed changes

server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java Show resolved Hide resolved

mch2 marked this pull request as ready for review July 27, 2023 23:08

mch2 requested review from reta, anasalkouz, andrross, Bukhtawar, CEHENKLE, dblock, gbbafna, setiah, kartg and kotwanikunal as code owners July 27, 2023 23:08

mch2 added 7 commits August 2, 2023 08:19

Add more unit tests.

e3a5366

Signed-off-by: Marc Handalian <handalm@amazon.com>

Clean up IndexShardTests.testCheckpointReffreshListenerWithNull

95ab3b0

Signed-off-by: Marc Handalian <handalm@amazon.com>

Remove unnecessary catch for NoSuchFileException.

54bbbd3

Signed-off-by: Marc Handalian <handalm@amazon.com>

Add another test for non segrep.

ad3d7bb

Signed-off-by: Marc Handalian <handalm@amazon.com>

PR Feedback.

804c203

Signed-off-by: Marc Handalian <handalm@amazon.com>

mch2 force-pushed the primaryterm branch from bf9423e to 804c203 Compare August 2, 2023 18:56

This comment was marked as outdated.

Sign in to view

re-compute replication checkpoint on primary promotion.

d0e15b8

Signed-off-by: Marc Handalian <handalm@amazon.com>

mch2 force-pushed the primaryterm branch from c5adff6 to d0e15b8 Compare August 3, 2023 06:27

dreamer-89 reviewed Aug 3, 2023

View reviewed changes

server/src/main/java/org/opensearch/index/shard/IndexShard.java Show resolved Hide resolved

dreamer-89 approved these changes Aug 3, 2023

View reviewed changes

mch2 merged commit c3acf47 into opensearch-project:main Aug 3, 2023

mch2 added the backport 2.x Backport to 2.x branch label Aug 3, 2023

mch2 mentioned this pull request Aug 3, 2023

[Backport 2.x] Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation #9095

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation #8889

Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation #8889

mch2 commented Jul 26, 2023 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

Poojita-Raj commented Jul 26, 2023

github-actions bot commented Jul 26, 2023

codecov bot commented Jul 26, 2023 •

edited

Loading

github-actions bot commented Jul 27, 2023

mch2 commented Jul 27, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Jul 27, 2023

Poojita-Raj commented Jul 27, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Jul 27, 2023

mch2 commented Jul 27, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Jul 27, 2023

mch2 commented Jul 27, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Jul 27, 2023

This comment was marked as outdated.

opensearch-trigger-bot bot commented Aug 3, 2023

This comment was marked as outdated.

github-actions bot commented Aug 3, 2023

mch2 commented Aug 3, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Aug 3, 2023

opensearch-trigger-bot bot commented Aug 3, 2023

Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation #8889

Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation #8889

Conversation

mch2 commented Jul 26, 2023 • edited Loading

Description

Related Issues

Check List

This comment was marked as outdated.

This comment was marked as outdated.

Poojita-Raj commented Jul 26, 2023

github-actions bot commented Jul 26, 2023

Gradle Check (Jenkins) Run Completed with:

codecov bot commented Jul 26, 2023 • edited Loading

Codecov Report

github-actions bot commented Jul 27, 2023

Gradle Check (Jenkins) Run Completed with:

mch2 commented Jul 27, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Jul 27, 2023

Gradle Check (Jenkins) Run Completed with:

Poojita-Raj commented Jul 27, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Jul 27, 2023

Gradle Check (Jenkins) Run Completed with:

mch2 commented Jul 27, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Jul 27, 2023

Gradle Check (Jenkins) Run Completed with:

mch2 commented Jul 27, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Jul 27, 2023

Gradle Check (Jenkins) Run Completed with:

This comment was marked as outdated.

opensearch-trigger-bot bot commented Aug 3, 2023

This comment was marked as outdated.

github-actions bot commented Aug 3, 2023

Gradle Check (Jenkins) Run Completed with:

mch2 commented Aug 3, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Aug 3, 2023

Gradle Check (Jenkins) Run Completed with:

opensearch-trigger-bot bot commented Aug 3, 2023

mch2 commented Jul 26, 2023 •

edited

Loading

codecov bot commented Jul 26, 2023 •

edited

Loading