Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix S3InputStream's handling of large skips #24521

Merged
merged 1 commit into from
Dec 19, 2024

Conversation

alexjo2144
Copy link
Member

@alexjo2144 alexjo2144 commented Dec 18, 2024

Description

When the skip(n) method is called the MAX_SKIP_BYTES check is skipped, resulting in the call potentially blocking for a long time.

Instead of delegating to the underlying stream, set the nextReadPosition value. This allows the next read to decide if it is best to keep the existing s3 object stream or open a new one.

This behavior matches the implementations for Azure and GCS.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

@alexjo2144
Copy link
Member Author

alexjo2144 commented Dec 18, 2024

I was able to reproduce this issue, but not in a way where it's going to be easy to write a regression test. What I did was upload a 10GB uncompressed JSON file to s3 and set up the TextLineReader to read a small split. Before this change that reader took much longer to load a split at the end of the file than it did at the start of the file. After this change it is constant regardless of the split's start offset.

When the skip(n) method is called the MAX_SKIP_BYTES check is skipped,
resulting in the call potentially blocking for a long time.

Instead of delegating to the underlying stream, set the nextReadPosition
value. This allows the next read to decide if it is best to keep the existing
s3 object stream or open a new one.

This behavior matches the implementations for Azure and GCS.
@alexjo2144 alexjo2144 force-pushed the ajo/s3-input-stream-skip branch from 11fba15 to 5fe42db Compare December 18, 2024 22:02
@wendigo
Copy link
Contributor

wendigo commented Dec 19, 2024

I'm testing this with secrets now

@findinpath
Copy link
Contributor

resulting in the call potentially blocking for a long time.

What is the furher consequence of this?
Please enhance the original description of the issue.

@wendigo
Copy link
Contributor

wendigo commented Dec 19, 2024

@findinpath I think that existing description is exhaustive enough. For S3 FS any delayed request will cause planning/execution to be longer than necessary.

@wendigo
Copy link
Contributor

wendigo commented Dec 19, 2024

@alexjo2144 thanks, merging

@wendigo wendigo merged commit 6c52253 into trinodb:master Dec 19, 2024
128 checks passed
@github-actions github-actions bot added this to the 469 milestone Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants