-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segment Replication - Fix ShardLockObtained error during corruption cases #10370
Conversation
…ases This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception. Signed-off-by: Marc Handalian <handalm@amazon.com>
This comment was marked as outdated.
This comment was marked as outdated.
Signed-off-by: Marc Handalian <handalm@amazon.com>
…ate. Signed-off-by: Marc Handalian <handalm@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
@@ Coverage Diff @@
## main #10370 +/- ##
============================================
- Coverage 71.21% 71.10% -0.11%
+ Complexity 58337 58265 -72
============================================
Files 4832 4832
Lines 274828 274850 +22
Branches 40043 40045 +2
============================================
- Hits 195708 195422 -286
- Misses 62728 63011 +283
- Partials 16392 16417 +25
|
grr this IT is still showing flaky, yanking it. |
From this trace it looks like multiple nodes are spinning up node_t0-node_t4. I haven't seen this locally but looks like the test is spawning additional data nodes instead of recovering on the original node/shard path. |
Ok found the cause here and fixing it in next commit - The issue appears to occur when a flush fails after we cause corruption - if during SR this will throw a FlushFailedException which is not treated as corruption case and assume the ReplicationTarget threads have already been cancelled. This case needs to be treated as corruption so the shard is failed. |
Signed-off-by: Marc Handalian <handalm@amazon.com>
Fixed this by ensuring we fail engines on corruption if a flush fails in NRTReplicationEngine. Also cleaned up exception handling in finalizeReplication method. |
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTarget.java
Show resolved
Hide resolved
Gradle Check (Jenkins) Run Completed with:
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some annoying test nitpicks that can be followed up (or ignored) in a subsequent PR
...er/src/internalClusterTest/java/org/opensearch/indices/replication/SegmentReplicationIT.java
Show resolved
Hide resolved
...er/src/internalClusterTest/java/org/opensearch/indices/replication/SegmentReplicationIT.java
Show resolved
Hide resolved
...er/src/internalClusterTest/java/org/opensearch/indices/replication/SegmentReplicationIT.java
Show resolved
Hide resolved
The backport to
To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-10370-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 651a9aa2ca05244e74bcaabcc89d878977efe259
# Push it to GitHub
git push --set-upstream origin backport/backport-10370-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x Then, create a pull request where the |
…ases (opensearch-project#10370) * Segment Replication - Fix ShardLockObtained error during corruption cases This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception. Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove exra logs Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky assertion on store refcount Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky test. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Remove hacky handling of corruption when fetching metadata. This will now check for store corruption when replication has failed and fail the shard accordingly. This commit also fixes logging in NRTReplicationEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix unit test. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix test failure testSegRepSucceedsOnPreviousCopiedFiles. This test broke because we invoked target.indexShard on a closed replicationTarget. In these cases we can assume the store is not corrupt. Signed-off-by: Marc Handalian <handalm@amazon.com> * spotless Signed-off-by: Marc Handalian <handalm@amazon.com> * Revert flaky IT Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix flakiness failure by expecting RTE when check index fails. Signed-off-by: Marc Handalian <handalm@amazon.com> * reintroduce ITs and use recoveries API instead of waiting on shard state. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix edge case where flush failures would not get reported as corruption. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com>
…ng corruption cases #10370 (#10418) * Segment Replication - Fix ShardLockObtained error during corruption cases (#10370) * Segment Replication - Fix ShardLockObtained error during corruption cases This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception. Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove exra logs Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky assertion on store refcount Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky test. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Remove hacky handling of corruption when fetching metadata. This will now check for store corruption when replication has failed and fail the shard accordingly. This commit also fixes logging in NRTReplicationEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix unit test. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix test failure testSegRepSucceedsOnPreviousCopiedFiles. This test broke because we invoked target.indexShard on a closed replicationTarget. In these cases we can assume the store is not corrupt. Signed-off-by: Marc Handalian <handalm@amazon.com> * spotless Signed-off-by: Marc Handalian <handalm@amazon.com> * Revert flaky IT Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix flakiness failure by expecting RTE when check index fails. Signed-off-by: Marc Handalian <handalm@amazon.com> * reintroduce ITs and use recoveries API instead of waiting on shard state. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix edge case where flush failures would not get reported as corruption. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix breaking change only on main. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com>
…ng corruption cases #10370 (#10418) * Segment Replication - Fix ShardLockObtained error during corruption cases (#10370) * Segment Replication - Fix ShardLockObtained error during corruption cases This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception. Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove exra logs Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky assertion on store refcount Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky test. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Remove hacky handling of corruption when fetching metadata. This will now check for store corruption when replication has failed and fail the shard accordingly. This commit also fixes logging in NRTReplicationEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix unit test. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix test failure testSegRepSucceedsOnPreviousCopiedFiles. This test broke because we invoked target.indexShard on a closed replicationTarget. In these cases we can assume the store is not corrupt. Signed-off-by: Marc Handalian <handalm@amazon.com> * spotless Signed-off-by: Marc Handalian <handalm@amazon.com> * Revert flaky IT Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix flakiness failure by expecting RTE when check index fails. Signed-off-by: Marc Handalian <handalm@amazon.com> * reintroduce ITs and use recoveries API instead of waiting on shard state. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix edge case where flush failures would not get reported as corruption. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix breaking change only on main. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> (cherry picked from commit cdf5e1a) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…ng corruption cases #10370 (#10418) (#10429) * Segment Replication - Fix ShardLockObtained error during corruption cases (#10370) * Segment Replication - Fix ShardLockObtained error during corruption cases This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception. * Remove exra logs * Remove flaky assertion on store refcount * Remove flaky test. * PR Feedback. Remove hacky handling of corruption when fetching metadata. This will now check for store corruption when replication has failed and fail the shard accordingly. This commit also fixes logging in NRTReplicationEngine. * Fix unit test. * Fix test failure testSegRepSucceedsOnPreviousCopiedFiles. This test broke because we invoked target.indexShard on a closed replicationTarget. In these cases we can assume the store is not corrupt. * spotless * Revert flaky IT * Fix flakiness failure by expecting RTE when check index fails. * reintroduce ITs and use recoveries API instead of waiting on shard state. * Fix edge case where flush failures would not get reported as corruption. --------- * Fix breaking change only on main. --------- (cherry picked from commit cdf5e1a) Signed-off-by: Marc Handalian <handalm@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…ases (opensearch-project#10370) * Segment Replication - Fix ShardLockObtained error during corruption cases This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception. Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove exra logs Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky assertion on store refcount Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky test. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Remove hacky handling of corruption when fetching metadata. This will now check for store corruption when replication has failed and fail the shard accordingly. This commit also fixes logging in NRTReplicationEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix unit test. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix test failure testSegRepSucceedsOnPreviousCopiedFiles. This test broke because we invoked target.indexShard on a closed replicationTarget. In these cases we can assume the store is not corrupt. Signed-off-by: Marc Handalian <handalm@amazon.com> * spotless Signed-off-by: Marc Handalian <handalm@amazon.com> * Revert flaky IT Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix flakiness failure by expecting RTE when check index fails. Signed-off-by: Marc Handalian <handalm@amazon.com> * reintroduce ITs and use recoveries API instead of waiting on shard state. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix edge case where flush failures would not get reported as corruption. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com>
…ases (opensearch-project#10370) * Segment Replication - Fix ShardLockObtained error during corruption cases This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception. Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove exra logs Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky assertion on store refcount Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky test. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Remove hacky handling of corruption when fetching metadata. This will now check for store corruption when replication has failed and fail the shard accordingly. This commit also fixes logging in NRTReplicationEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix unit test. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix test failure testSegRepSucceedsOnPreviousCopiedFiles. This test broke because we invoked target.indexShard on a closed replicationTarget. In these cases we can assume the store is not corrupt. Signed-off-by: Marc Handalian <handalm@amazon.com> * spotless Signed-off-by: Marc Handalian <handalm@amazon.com> * Revert flaky IT Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix flakiness failure by expecting RTE when check index fails. Signed-off-by: Marc Handalian <handalm@amazon.com> * reintroduce ITs and use recoveries API instead of waiting on shard state. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix edge case where flush failures would not get reported as corruption. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com>
…ases (opensearch-project#10370) * Segment Replication - Fix ShardLockObtained error during corruption cases This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception. Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove exra logs Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky assertion on store refcount Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky test. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Remove hacky handling of corruption when fetching metadata. This will now check for store corruption when replication has failed and fail the shard accordingly. This commit also fixes logging in NRTReplicationEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix unit test. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix test failure testSegRepSucceedsOnPreviousCopiedFiles. This test broke because we invoked target.indexShard on a closed replicationTarget. In these cases we can assume the store is not corrupt. Signed-off-by: Marc Handalian <handalm@amazon.com> * spotless Signed-off-by: Marc Handalian <handalm@amazon.com> * Revert flaky IT Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix flakiness failure by expecting RTE when check index fails. Signed-off-by: Marc Handalian <handalm@amazon.com> * reintroduce ITs and use recoveries API instead of waiting on shard state. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix edge case where flush failures would not get reported as corruption. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com>
…ases (opensearch-project#10370) * Segment Replication - Fix ShardLockObtained error during corruption cases This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception. Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove exra logs Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky assertion on store refcount Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky test. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Remove hacky handling of corruption when fetching metadata. This will now check for store corruption when replication has failed and fail the shard accordingly. This commit also fixes logging in NRTReplicationEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix unit test. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix test failure testSegRepSucceedsOnPreviousCopiedFiles. This test broke because we invoked target.indexShard on a closed replicationTarget. In these cases we can assume the store is not corrupt. Signed-off-by: Marc Handalian <handalm@amazon.com> * spotless Signed-off-by: Marc Handalian <handalm@amazon.com> * Revert flaky IT Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix flakiness failure by expecting RTE when check index fails. Signed-off-by: Marc Handalian <handalm@amazon.com> * reintroduce ITs and use recoveries API instead of waiting on shard state. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix edge case where flush failures would not get reported as corruption. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Description
This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception.
This change handles failures during the commit on close and marks the store corrupt if it occurs.
This change also handles failures while fetching local metadata by wrapping them as a corruption exception - without this shards would indefinitely retry and fail replication events. Rethrowing the exception as a corruption exception ensures the shard is properly failed and re-recovered. For this to happen a file would have to be wiped directly from the store, but would require a node reboot to correct it.
Added some ITs that will test these corruption cases - primary sending faulty data to the replica and a file being wiped underneath the shard.
Related Issues
Resolves #8286
Resolves #9966
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.