Segment Replication - Fix ShardLockObtained error during corruption cases #10370

mch2 · 2023-10-04T20:43:34Z

Description

This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception.

This change handles failures during the commit on close and marks the store corrupt if it occurs.
This change also handles failures while fetching local metadata by wrapping them as a corruption exception - without this shards would indefinitely retry and fail replication events. Rethrowing the exception as a corruption exception ensures the shard is properly failed and re-recovered. For this to happen a file would have to be wiped directly from the store, but would require a node reboot to correct it.

Added some ITs that will test these corruption cases - primary sending faulty data to the replica and a file being wiped underneath the shard.

Related Issues

Resolves #8286
Resolves #9966

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)
GitHub issue/PR created in OpenSearch documentation repo for the required public documentation changes (#[Issue/PR number])

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ases This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception. Signed-off-by: Marc Handalian <handalm@amazon.com>

Signed-off-by: Marc Handalian <handalm@amazon.com>

…ate. Signed-off-by: Marc Handalian <handalm@amazon.com>

github-actions · 2023-10-05T05:24:46Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/27075/
CommitID: 9dfc85d
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-10-05T06:20:20Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❕
TEST FAILURES:

      1 org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testWipeSegmentBetweenSyncs

URL: https://build.ci.opensearch.org/job/gradle-check/27078/
CommitID: 9dfc85d
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

codecov · 2023-10-05T06:21:40Z

Codecov Report

Merging #10370 (2b80149) into main (a0cb344) will decrease coverage by 0.11%.
Report is 14 commits behind head on main.
The diff coverage is 50.00%.

@@             Coverage Diff              @@
##               main   #10370      +/-   ##
============================================
- Coverage     71.21%   71.10%   -0.11%     
+ Complexity    58337    58265      -72     
============================================
  Files          4832     4832              
  Lines        274828   274850      +22     
  Branches      40043    40045       +2     
============================================
- Hits         195708   195422     -286     
- Misses        62728    63011     +283     
- Partials      16392    16417      +25

Files	Coverage Δ
.../indices/replication/SegmentReplicationTarget.java	`84.87% <33.33%> (ø)`
...rc/main/java/org/opensearch/index/store/Store.java	`81.10% <40.00%> (+1.10%)`	⬆️
...s/replication/SegmentReplicationTargetService.java	`53.87% <50.00%> (-3.08%)`	⬇️
.../opensearch/index/engine/NRTReplicationEngine.java	`78.19% <57.14%> (-0.58%)`	⬇️

... and 482 files with indirect coverage changes

mch2 · 2023-10-05T14:41:02Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❕

TEST FAILURES:
      1 org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testWipeSegmentBetweenSyncs
URL: https://build.ci.opensearch.org/job/gradle-check/27078/

CommitID: 9dfc85d
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

grr this IT is still showing flaky, yanking it.

mch2 · 2023-10-05T15:05:49Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❕

TEST FAILURES:
      1 org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testWipeSegmentBetweenSyncs
URL: https://build.ci.opensearch.org/job/gradle-check/27078/

CommitID: 9dfc85d
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.
grr this IT is still showing flaky, yanking it.

From this trace it looks like multiple nodes are spinning up node_t0-node_t4. I haven't seen this locally but looks like the test is spawning additional data nodes instead of recovering on the original node/shard path.

mch2 · 2023-10-05T17:50:34Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❕

TEST FAILURES:
      1 org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testWipeSegmentBetweenSyncs
URL: https://build.ci.opensearch.org/job/gradle-check/27078/

CommitID: 9dfc85d
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.
grr this IT is still showing flaky, yanking it.
From this trace it looks like multiple nodes are spinning up node_t0-node_t4. I haven't seen this locally but looks like the test is spawning additional data nodes instead of recovering on the original node/shard path.

Ok found the cause here and fixing it in next commit - The issue appears to occur when a flush fails after we cause corruption - if during SR this will throw a FlushFailedException which is not treated as corruption case and assume the ReplicationTarget threads have already been cancelled. This case needs to be treated as corruption so the shard is failed.

Signed-off-by: Marc Handalian <handalm@amazon.com>

mch2 · 2023-10-05T18:59:04Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❕

TEST FAILURES:
      1 org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testWipeSegmentBetweenSyncs
URL: https://build.ci.opensearch.org/job/gradle-check/27078/

CommitID: 9dfc85d
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.
grr this IT is still showing flaky, yanking it.
From this trace it looks like multiple nodes are spinning up node_t0-node_t4. I haven't seen this locally but looks like the test is spawning additional data nodes instead of recovering on the original node/shard path.
Ok found the cause here and fixing it in next commit - The issue appears to occur when a flush fails after we cause corruption - if during SR this will throw a FlushFailedException which is not treated as corruption case and assume the ReplicationTarget threads have already been cancelled. This case needs to be treated as corruption so the shard is failed.

Fixed this by ensuring we fail engines on corruption if a flush fails in NRTReplicationEngine. Also cleaned up exception handling in finalizeReplication method.

server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTarget.java

github-actions · 2023-10-05T19:40:48Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❕
TEST FAILURES:

      1 org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testMultipleShards

URL: https://build.ci.opensearch.org/job/gradle-check/27171/
CommitID: 2b80149
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

mch2 · 2023-10-05T20:00:24Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❕

TEST FAILURES:
      1 org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testMultipleShards
URL: https://build.ci.opensearch.org/job/gradle-check/27171/

CommitID: 2b80149
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

#9712

andrross

Just some annoying test nitpicks that can be followed up (or ignored) in a subsequent PR

...er/src/internalClusterTest/java/org/opensearch/indices/replication/SegmentReplicationIT.java

opensearch-trigger-bot · 2023-10-05T20:41:11Z

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-10370-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 651a9aa2ca05244e74bcaabcc89d878977efe259
# Push it to GitHub
git push --set-upstream origin backport/backport-10370-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-10370-to-2.x.

…ases (opensearch-project#10370) * Segment Replication - Fix ShardLockObtained error during corruption cases This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception. Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove exra logs Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky assertion on store refcount Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky test. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Remove hacky handling of corruption when fetching metadata. This will now check for store corruption when replication has failed and fail the shard accordingly. This commit also fixes logging in NRTReplicationEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix unit test. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix test failure testSegRepSucceedsOnPreviousCopiedFiles. This test broke because we invoked target.indexShard on a closed replicationTarget. In these cases we can assume the store is not corrupt. Signed-off-by: Marc Handalian <handalm@amazon.com> * spotless Signed-off-by: Marc Handalian <handalm@amazon.com> * Revert flaky IT Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix flakiness failure by expecting RTE when check index fails. Signed-off-by: Marc Handalian <handalm@amazon.com> * reintroduce ITs and use recoveries API instead of waiting on shard state. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix edge case where flush failures would not get reported as corruption. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com>

…ng corruption cases #10370 (#10418) * Segment Replication - Fix ShardLockObtained error during corruption cases (#10370) * Segment Replication - Fix ShardLockObtained error during corruption cases This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception. Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove exra logs Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky assertion on store refcount Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky test. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Remove hacky handling of corruption when fetching metadata. This will now check for store corruption when replication has failed and fail the shard accordingly. This commit also fixes logging in NRTReplicationEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix unit test. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix test failure testSegRepSucceedsOnPreviousCopiedFiles. This test broke because we invoked target.indexShard on a closed replicationTarget. In these cases we can assume the store is not corrupt. Signed-off-by: Marc Handalian <handalm@amazon.com> * spotless Signed-off-by: Marc Handalian <handalm@amazon.com> * Revert flaky IT Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix flakiness failure by expecting RTE when check index fails. Signed-off-by: Marc Handalian <handalm@amazon.com> * reintroduce ITs and use recoveries API instead of waiting on shard state. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix edge case where flush failures would not get reported as corruption. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix breaking change only on main. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com>

…ng corruption cases #10370 (#10418) * Segment Replication - Fix ShardLockObtained error during corruption cases (#10370) * Segment Replication - Fix ShardLockObtained error during corruption cases This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception. Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove exra logs Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky assertion on store refcount Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky test. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Remove hacky handling of corruption when fetching metadata. This will now check for store corruption when replication has failed and fail the shard accordingly. This commit also fixes logging in NRTReplicationEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix unit test. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix test failure testSegRepSucceedsOnPreviousCopiedFiles. This test broke because we invoked target.indexShard on a closed replicationTarget. In these cases we can assume the store is not corrupt. Signed-off-by: Marc Handalian <handalm@amazon.com> * spotless Signed-off-by: Marc Handalian <handalm@amazon.com> * Revert flaky IT Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix flakiness failure by expecting RTE when check index fails. Signed-off-by: Marc Handalian <handalm@amazon.com> * reintroduce ITs and use recoveries API instead of waiting on shard state. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix edge case where flush failures would not get reported as corruption. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix breaking change only on main. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> (cherry picked from commit cdf5e1a) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…ng corruption cases #10370 (#10418) (#10429) * Segment Replication - Fix ShardLockObtained error during corruption cases (#10370) * Segment Replication - Fix ShardLockObtained error during corruption cases This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception. * Remove exra logs * Remove flaky assertion on store refcount * Remove flaky test. * PR Feedback. Remove hacky handling of corruption when fetching metadata. This will now check for store corruption when replication has failed and fail the shard accordingly. This commit also fixes logging in NRTReplicationEngine. * Fix unit test. * Fix test failure testSegRepSucceedsOnPreviousCopiedFiles. This test broke because we invoked target.indexShard on a closed replicationTarget. In these cases we can assume the store is not corrupt. * spotless * Revert flaky IT * Fix flakiness failure by expecting RTE when check index fails. * reintroduce ITs and use recoveries API instead of waiting on shard state. * Fix edge case where flush failures would not get reported as corruption. --------- * Fix breaking change only on main. --------- (cherry picked from commit cdf5e1a) Signed-off-by: Marc Handalian <handalm@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…ases (opensearch-project#10370) * Segment Replication - Fix ShardLockObtained error during corruption cases This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception. Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove exra logs Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky assertion on store refcount Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky test. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Remove hacky handling of corruption when fetching metadata. This will now check for store corruption when replication has failed and fail the shard accordingly. This commit also fixes logging in NRTReplicationEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix unit test. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix test failure testSegRepSucceedsOnPreviousCopiedFiles. This test broke because we invoked target.indexShard on a closed replicationTarget. In these cases we can assume the store is not corrupt. Signed-off-by: Marc Handalian <handalm@amazon.com> * spotless Signed-off-by: Marc Handalian <handalm@amazon.com> * Revert flaky IT Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix flakiness failure by expecting RTE when check index fails. Signed-off-by: Marc Handalian <handalm@amazon.com> * reintroduce ITs and use recoveries API instead of waiting on shard state. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix edge case where flush failures would not get reported as corruption. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com>

…ases (opensearch-project#10370) * Segment Replication - Fix ShardLockObtained error during corruption cases This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception. Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove exra logs Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky assertion on store refcount Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky test. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Remove hacky handling of corruption when fetching metadata. This will now check for store corruption when replication has failed and fail the shard accordingly. This commit also fixes logging in NRTReplicationEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix unit test. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix test failure testSegRepSucceedsOnPreviousCopiedFiles. This test broke because we invoked target.indexShard on a closed replicationTarget. In these cases we can assume the store is not corrupt. Signed-off-by: Marc Handalian <handalm@amazon.com> * spotless Signed-off-by: Marc Handalian <handalm@amazon.com> * Revert flaky IT Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix flakiness failure by expecting RTE when check index fails. Signed-off-by: Marc Handalian <handalm@amazon.com> * reintroduce ITs and use recoveries API instead of waiting on shard state. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix edge case where flush failures would not get reported as corruption. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> Signed-off-by: Shivansh Arora <hishiv@amazon.com>

github-actions bot added bug Something isn't working Indexing Indexing, Bulk Indexing and anything related to indexing Indexing:Replication Issues and PRs related to core replication framework eg segrep v2.11.0 Issues and PRs related to version 2.11.0 labels Oct 4, 2023

mch2 force-pushed the shardlock branch from 4984039 to f8c3e8a Compare October 4, 2023 20:47

mch2 force-pushed the shardlock branch from f8c3e8a to e69fb21 Compare October 4, 2023 20:48

mch2 marked this pull request as ready for review October 4, 2023 20:49

This comment was marked as outdated.

Sign in to view

mch2 added 2 commits October 4, 2023 22:01

Fix flakiness failure by expecting RTE when check index fails.

f88dc86

Signed-off-by: Marc Handalian <handalm@amazon.com>

reintroduce ITs and use recoveries API instead of waiting on shard st…

9dfc85d

…ate. Signed-off-by: Marc Handalian <handalm@amazon.com>

Fix edge case where flush failures would not get reported as corruption.

2b80149

Signed-off-by: Marc Handalian <handalm@amazon.com>

mch2 commented Oct 5, 2023

View reviewed changes

server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTarget.java Show resolved Hide resolved

kotwanikunal approved these changes Oct 5, 2023

View reviewed changes

mch2 requested a review from andrross October 5, 2023 19:26

andrross approved these changes Oct 5, 2023

View reviewed changes

andrross merged commit 651a9aa into opensearch-project:main Oct 5, 2023
12 of 15 checks passed

opensearch-trigger-bot bot added the backport-failed label Oct 5, 2023

mch2 mentioned this pull request Oct 5, 2023

[Backport 2.x] Segment Replication - Fix ShardLockObtained error during corruption cases #10370 #10418

Merged

BrewTestBot mentioned this pull request Feb 21, 2024

opensearch 2.12.0 Homebrew/homebrew-core#163463

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segment Replication - Fix ShardLockObtained error during corruption cases #10370

Segment Replication - Fix ShardLockObtained error during corruption cases #10370

mch2 commented Oct 4, 2023 •

edited

Loading

This comment was marked as outdated.

github-actions bot commented Oct 5, 2023

github-actions bot commented Oct 5, 2023

codecov bot commented Oct 5, 2023 •

edited

Loading

mch2 commented Oct 5, 2023

Gradle Check (Jenkins) Run Completed with:

mch2 commented Oct 5, 2023

Gradle Check (Jenkins) Run Completed with:

mch2 commented Oct 5, 2023

Gradle Check (Jenkins) Run Completed with:

mch2 commented Oct 5, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 5, 2023

mch2 commented Oct 5, 2023

Gradle Check (Jenkins) Run Completed with:

andrross left a comment

opensearch-trigger-bot bot commented Oct 5, 2023

Segment Replication - Fix ShardLockObtained error during corruption cases #10370

Segment Replication - Fix ShardLockObtained error during corruption cases #10370

Conversation

mch2 commented Oct 4, 2023 • edited Loading

Description

Related Issues

Check List

This comment was marked as outdated.

github-actions bot commented Oct 5, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 5, 2023

Gradle Check (Jenkins) Run Completed with:

codecov bot commented Oct 5, 2023 • edited Loading

Codecov Report

mch2 commented Oct 5, 2023

Gradle Check (Jenkins) Run Completed with:

mch2 commented Oct 5, 2023

Gradle Check (Jenkins) Run Completed with:

mch2 commented Oct 5, 2023

Gradle Check (Jenkins) Run Completed with:

mch2 commented Oct 5, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 5, 2023

Gradle Check (Jenkins) Run Completed with:

mch2 commented Oct 5, 2023

Gradle Check (Jenkins) Run Completed with:

andrross left a comment

Choose a reason for hiding this comment

opensearch-trigger-bot bot commented Oct 5, 2023

mch2 commented Oct 4, 2023 •

edited

Loading

codecov bot commented Oct 5, 2023 •

edited

Loading