Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise snapshot deletion to speed up snapshot deletion and creation #15568

Merged

Conversation

ashking94
Copy link
Member

@ashking94 ashking94 commented Sep 2, 2024

Description

Snapshot creation is distributed in nature. The snapshot creation operation is performed by the Data node holding the primary shard. Hence the total snapshot creation work is shared amongst all the data nodes in the cluster. On the contrary, the snapshot deletion is handled solely by active cluster manager. This can lead to excessively slow snapshots deletion when there are relative higher number of primary shards in the cluster.

In this PR, we have tried fixing this problem by creating a separate thread that is responsible for performing snapshot deletion or old shard gen cleanup during snapshot creation. The thread count has been set as 4x the number of allocated processor. The thread count is bounded between 64 and 256 to ensure that we have sufficient threads to get the deletion done and not too many threads that they start eating up from the connections of other remote store operations that may happen on the same cluster.

Check List

  • Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@ashking94
Copy link
Member Author

ashking94 commented Sep 2, 2024

There are existing UTs and ITs that covers the changed code.

Copy link
Contributor

github-actions bot commented Sep 2, 2024

❌ Gradle check result for 33b0dd1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Copy link
Contributor

github-actions bot commented Sep 2, 2024

❌ Gradle check result for 05757e2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…imisations

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Copy link
Contributor

github-actions bot commented Sep 2, 2024

❌ Gradle check result for 7329e67: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…imisations

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: Ashish Singh <ssashish@amazon.com>
Copy link
Contributor

github-actions bot commented Sep 3, 2024

❌ Gradle check result for 345a277: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Sep 3, 2024

❌ Gradle check result for e777412: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@ashking94
Copy link
Member Author

❌ Gradle check result for 345a277: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Flaky tests - #15600

Copy link
Contributor

github-actions bot commented Sep 3, 2024

❌ Gradle check result for 739557a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…imisations

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Copy link
Contributor

github-actions bot commented Sep 3, 2024

❕ Gradle check result for b9c0e4e: UNSTABLE

  • TEST FAILURES:
      2 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=snapshot.status/10_basic/Get missing snapshot status throws an exception}
      1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=snapshot.status/10_basic/Get snapshot status}

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Sep 3, 2024

Codecov Report

Attention: Patch coverage is 83.92857% with 9 lines in your changes missing coverage. Please review.

Project coverage is 71.87%. Comparing base (7a9cb35) to head (b9c0e4e).
Report is 35 commits behind head on main.

Files with missing lines Patch % Lines
...ch/repositories/blobstore/BlobStoreRepository.java 82.69% 9 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #15568      +/-   ##
============================================
+ Coverage     71.83%   71.87%   +0.04%     
- Complexity    63932    64008      +76     
============================================
  Files          5258     5258              
  Lines        299329   299398      +69     
  Branches      43259    43264       +5     
============================================
+ Hits         215010   215182     +172     
+ Misses        66587    66559      -28     
+ Partials      17732    17657      -75     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sachinpkale sachinpkale merged commit 3fc0139 into opensearch-project:main Sep 3, 2024
34 checks passed
@ashking94 ashking94 deleted the snapshot-delete-optimisations branch September 3, 2024 18:13
Copy link
Contributor

github-actions bot commented Sep 4, 2024

❌ Gradle check result for 739557a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@ashking94 ashking94 added the backport 2.x Backport to 2.x branch label Sep 5, 2024
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-15568-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 3fc0139ca68a1ff843ec1492c3cd52c2c4c67f02
# Push it to GitHub
git push --set-upstream origin backport/backport-15568-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-15568-to-2.x.

ashking94 added a commit to ashking94/OpenSearch that referenced this pull request Sep 5, 2024
ashking94 added a commit to ashking94/OpenSearch that referenced this pull request Sep 5, 2024
sachinpkale pushed a commit that referenced this pull request Sep 5, 2024
…#15568) (#15725)

---------

Signed-off-by: Ashish Singh <ssashish@amazon.com>
ashking94 added a commit that referenced this pull request Sep 5, 2024
…#15568) (#15724)

---------

Signed-off-by: Ashish Singh <ssashish@amazon.com>
akolarkunnu pushed a commit to akolarkunnu/OpenSearch that referenced this pull request Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants