Improvements to buffered reading for parquet #5611

malhotrashivam · 2024-06-12T21:39:16Z

Currently, while reading bytes from parquet files, we read in chunks of 8K bytes.
So for cases, where we need fewer bytes (like reading page headers), this leads to extra bytes read.
And for cases where we need to read bulk of data (like reading actual data bytes), this can lead to repeated reads in 8k chunks, till we get the required number of bytes.

This PR adds a size hint to the stream creation method, so that we can accurately created an internal buffered input stream based on how much data the user actually want to read from the stream.

This PR leads to minor parquet read performance improvements.

Util/channel/src/main/java/io/deephaven/util/channel/SeekableChannelsProvider.java

...dfile/src/main/java/io/deephaven/extensions/trackedfile/TrackedSeekableChannelsProvider.java

...mpression/src/main/java/io/deephaven/parquet/compress/DeephavenCompressorAdapterFactory.java

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReaderImpl.java

malhotrashivam · 2024-06-14T19:52:53Z

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReaderImpl.java

+    private Dictionary readDictionary(long dictionaryPageOffset, SeekableChannelContext channelContext) {
+        // Use the context object provided by the caller, or create (and close) a new one
+        try (
+                final ContextHolder holder = SeekableChannelContext.ensureContext(channelsProvider, channelContext);


In the original code, we used to make the channel and stream in the calling method and this method would just use the same stream and not touch the underlying channel.
Now we make two streams, one for header and one for data. And we use the same channel.

Note that the channel's position gets updated after reading the header.
So I wanted to make the channel's lifecycle limited to this method so that no one else should depend on or use this channel. That is why I moved the logic for making the channel inside this method.

Util/channel/src/main/java/io/deephaven/util/channel/SeekableChannelsProvider.java

Added optional limits on maximum bytes read from file streams

0be0031

malhotrashivam added parquet Related to the Parquet integration NoDocumentationNeeded ReleaseNotesNeeded Release notes are needed labels Jun 12, 2024

malhotrashivam added this to the June 2024 milestone Jun 12, 2024

malhotrashivam self-assigned this Jun 12, 2024

rcaudy reviewed Jun 14, 2024

View reviewed changes

malhotrashivam added 2 commits June 14, 2024 14:08

Resolving review comments

61a1872

Minor changes to code layout for readability

056d706

malhotrashivam commented Jun 14, 2024

View reviewed changes

malhotrashivam changed the title ~~Added limits on number of bytes read from parquet file to prevent excess reads~~ Improvements to buffered reading for parquet Jun 14, 2024

rcaudy reviewed Jun 14, 2024

View reviewed changes

Util/channel/src/main/java/io/deephaven/util/channel/SeekableChannelsProvider.java Show resolved Hide resolved

Util/channel/src/main/java/io/deephaven/util/channel/SeekableChannelsProvider.java Show resolved Hide resolved

Javadoc updates

9086154

rcaudy approved these changes Jun 15, 2024

View reviewed changes

malhotrashivam merged commit fdd491f into deephaven:main Jun 15, 2024
15 checks passed

github-actions bot locked and limited conversation to collaborators Jun 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to buffered reading for parquet #5611

Improvements to buffered reading for parquet #5611

malhotrashivam commented Jun 12, 2024 •

edited

Loading

malhotrashivam Jun 14, 2024

Improvements to buffered reading for parquet #5611

Improvements to buffered reading for parquet #5611

Conversation

malhotrashivam commented Jun 12, 2024 • edited Loading

malhotrashivam Jun 14, 2024

Choose a reason for hiding this comment

malhotrashivam commented Jun 12, 2024 •

edited

Loading