Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigation for remote snapshot filecache overflow #15077

Merged
merged 6 commits into from
Sep 5, 2024

Conversation

finnegancarroll
Copy link
Contributor

@finnegancarroll finnegancarroll commented Aug 2, 2024

Description

Remote snapshot file_cache does not strictly limit it's disk usage to its configured capacity. This change causes subsequent requests to the file cache to fail when the cache overflows. #11676 for more details.

Related Issues

Mitigation for #11676

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented Aug 2, 2024

❌ Gradle check result for 9b494b6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@sandervandegeijn
Copy link

Thanks for the effort Finn!

@sandervandegeijn
Copy link

Looking at the code:

if (fileCache.usage().usage() > fileCache.capacity()) {
                System.gc();

Would it be possible to do this at 70-90% of capacity and at > 100%? At >100% you are already too late right?

Copy link
Contributor

github-actions bot commented Aug 3, 2024

❌ Gradle check result for 6f86ed4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@finnegancarroll
Copy link
Contributor Author

> Task :server:forbiddenApisMain FAILED
Forbidden method invocation: java.lang.System#gc() [Please do not try to stop the world]

@andrross tests which exceed the cache capacity have drastically fewer failed requests with 'System.gc()' and a small sleep. Is this appropriate for a mitigation or is it preferable to just fail and make no attempt to prune the cache here?

@andrross
Copy link
Member

andrross commented Aug 5, 2024

> Task :server:forbiddenApisMain FAILED
Forbidden method invocation: java.lang.System#gc() [Please do not try to stop the world]

@andrross tests which exceed the cache capacity have drastically fewer failed requests with 'System.gc()' and a small sleep. Is this appropriate for a mitigation or is it preferable to just fail and make no attempt to prune the cache here?

@finnegancarroll Yeah we shouldn't be trying to invoke GC manually. The better way to do this, if we need to do this at all, is to do the tracking of the unclosed cloned objects in the parent via weak references, and then deterministically close everything when the parent is closed. I think that would have the same effect, because anything that is eligible for garbage collection must have already had the originating input stream closed.

Copy link
Contributor

github-actions bot commented Aug 5, 2024

❌ Gradle check result for 38b3b95: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Aug 5, 2024

❌ Gradle check result for f72d436: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@finnegancarroll
Copy link
Contributor Author

Reverted to just failing requests on cache overflow.

Copy link
Contributor

github-actions bot commented Sep 5, 2024

❕ Gradle check result for d3d77d1: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@andrross andrross merged commit 8f34ce5 into opensearch-project:main Sep 5, 2024
38 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Sep 5, 2024
TransferManager fails BlobFetchRequest on full cache

Signed-off-by: Finn Carroll <carrofin@amazon.com>
(cherry picked from commit 8f34ce5)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Sep 5, 2024
TransferManager fails BlobFetchRequest on full cache

Signed-off-by: Finn Carroll <carrofin@amazon.com>
(cherry picked from commit 8f34ce5)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
andrross pushed a commit that referenced this pull request Sep 5, 2024
TransferManager fails BlobFetchRequest on full cache

Signed-off-by: Finn Carroll <carrofin@amazon.com>
(cherry picked from commit 8f34ce5)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
andrross pushed a commit that referenced this pull request Sep 5, 2024
TransferManager fails BlobFetchRequest on full cache


(cherry picked from commit 8f34ce5)

Signed-off-by: Finn Carroll <carrofin@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
andrross pushed a commit that referenced this pull request Sep 5, 2024
TransferManager fails BlobFetchRequest on full cache

Signed-off-by: Finn Carroll <carrofin@amazon.com>
(cherry picked from commit 8f34ce5)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Signed-off-by: Andrew Ross <andrross@amazon.com>
akolarkunnu pushed a commit to akolarkunnu/OpenSearch that referenced this pull request Sep 10, 2024
…#15077)

TransferManager fails BlobFetchRequest on full cache

Signed-off-by: Finn Carroll <carrofin@amazon.com>
@sandervandegeijn
Copy link

Maybe I'm not looking in the right place, but did this make it into the 2.17.0 release? Can't find it in the release notes?

@dblock
Copy link
Member

dblock commented Sep 19, 2024

The backport was merged looks like, #15761, maybe a release notes wasn't updated? Where did you look?

@sandervandegeijn
Copy link

Ah I do see it now: https://github.com/opensearch-project/OpenSearch/releases/tag/2.17.0

Might have made a typo while searching, sorry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport 2.17
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants