Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to buffered reading for parquet #5611

Merged
merged 4 commits into from
Jun 15, 2024

Conversation

malhotrashivam
Copy link
Contributor

@malhotrashivam malhotrashivam commented Jun 12, 2024

Currently, while reading bytes from parquet files, we read in chunks of 8K bytes.
So for cases, where we need fewer bytes (like reading page headers), this leads to extra bytes read.
And for cases where we need to read bulk of data (like reading actual data bytes), this can lead to repeated reads in 8k chunks, till we get the required number of bytes.

This PR adds a size hint to the stream creation method, so that we can accurately created an internal buffered input stream based on how much data the user actually want to read from the stream.

This PR leads to minor parquet read performance improvements.

@malhotrashivam malhotrashivam added parquet Related to the Parquet integration NoDocumentationNeeded ReleaseNotesNeeded Release notes are needed labels Jun 12, 2024
@malhotrashivam malhotrashivam added this to the June 2024 milestone Jun 12, 2024
@malhotrashivam malhotrashivam self-assigned this Jun 12, 2024
private Dictionary readDictionary(long dictionaryPageOffset, SeekableChannelContext channelContext) {
// Use the context object provided by the caller, or create (and close) a new one
try (
final ContextHolder holder = SeekableChannelContext.ensureContext(channelsProvider, channelContext);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original code, we used to make the channel and stream in the calling method and this method would just use the same stream and not touch the underlying channel.
Now we make two streams, one for header and one for data. And we use the same channel.

Note that the channel's position gets updated after reading the header.
So I wanted to make the channel's lifecycle limited to this method so that no one else should depend on or use this channel. That is why I moved the logic for making the channel inside this method.

@malhotrashivam malhotrashivam changed the title Added limits on number of bytes read from parquet file to prevent excess reads Improvements to buffered reading for parquet Jun 14, 2024
@malhotrashivam malhotrashivam merged commit fdd491f into deephaven:main Jun 15, 2024
15 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jun 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NoDocumentationNeeded parquet Related to the Parquet integration ReleaseNotesNeeded Release notes are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants