Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a soft reference based shared cache for S3 reads #5357

Merged
merged 29 commits into from
May 14, 2024

Conversation

malhotrashivam
Copy link
Contributor

@malhotrashivam malhotrashivam commented Apr 12, 2024

This PR adds a soft-reference based cache for recently fetched fragments from S3 for faster lookup.
This cache would be especially useful for smaller files which have fewer fragments and can fit in the cache.

Documentation:

  • Removed the config parameter S3.maxFragmentSize since we won't use a buffer pool internally anymore.
  • Updated defaults for following S3 specific instructions: maxConcurrentRequests, readAheadCount, fragmentSize, maxCacheSize.

@malhotrashivam malhotrashivam added parquet Related to the Parquet integration NoDocumentationNeeded NoReleaseNotesNeeded No release notes are needed. s3 labels Apr 12, 2024
@malhotrashivam malhotrashivam added this to the 2. April 2024 milestone Apr 12, 2024
@malhotrashivam malhotrashivam self-assigned this Apr 12, 2024
Copy link
Member

@devinrsmith devinrsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard for me to evaluate this in a reasonable amount of time. I worry we're adding complexity and it's not obvious to me that things are correct. I'm posting this now, but I estimate I would need at least another half a day to go through it all.

* {@link #cleanup() cleanup} once the reference count reaches zero.
*/
// TODO Move the release method lower in file after the fill method. Kept it here for ease of review.
void release() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to figure out what the advantage of acquire/release is as opposed to relying on java reachability to Request. Should we instead have a SoftReference<Request> structure somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main challenge we were facing was to control when to cancel the request. This uses CleanupReference for that and the request holds a soft reference to the corresponding ByteBuffer. Once the ByteBuffer goes out of scope, the request can be cleared. This new acquire/release methodology was so that each context can control the lifecycle of the buffer and thus the corresponding request.

We have slightly changed the design now so that each context will hold {Request, ownershipToken}, where internally the ownershipToken will be the Buffer only.
So now the reachability of the buffer will directly control the lifecycle of the Request. I might not be able to explain the new design in words but the code is much cleaner and should be easier to follow.

@malhotrashivam malhotrashivam removed the NoReleaseNotesNeeded No release notes are needed. label May 8, 2024
Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

if (existingRequest == null) {
// Ideally, we could have used ".replace" in this case as well, but KeyedObjectHashMap.replace currently
// has a bug when the key is not present in the map.
added = requests.putIfAbsent(key, newAcquiredRequest.request) == null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this fails, we can avoid get.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You found this more complicated. I still prefer to avoid an extra lookup. Maybe moot anyway, since we can't seem to trust replace right now.

        Request existingRequest = requests.get(key);
        while (true) {
            if (existingRequest != null) {
                final Request.AcquiredRequest acquired = existingRequest.tryAcquire();
                if (acquired != null) {
                    return acquired;
                }
            }
            if (newAcquiredRequest == null) {
                newAcquiredRequest = Request.createAndAcquire(fragmentIndex, context);
            }
            final boolean added;
            if (existingRequest == null) {
                added = (existingRequest = requests.putIfAbsent(key, newAcquiredRequest.request)) == null;
            } else {
                if (!(added = requests.replace(key, existingRequest, newAcquiredRequest.request))) {
                    existingRequest = requests.get(key);
                }
            }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted back to remove + putIfAbsent pattern and added a TODO for this with issue #5486.

chipkent
chipkent previously approved these changes May 14, 2024
Copy link
Member

@chipkent chipkent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python LGTM

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can approve like this, but I had one suggestion/question.

@malhotrashivam malhotrashivam merged commit a5ca66f into deephaven:main May 14, 2024
15 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators May 14, 2024
@malhotrashivam malhotrashivam changed the title Added a thread-safe shared cache for S3 reads Added a soft reference based shared cache for S3 reads May 14, 2024
@deephaven-internal
Copy link
Contributor

Labels indicate documentation is required. Issues for documentation have been opened:

Community: deephaven/deephaven-docs-community#211

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
DocumentationNeeded NoReleaseNotesNeeded No release notes are needed. parquet Related to the Parquet integration s3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants