Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should investigate what the release notes means for:
PARQUET-1822 - Parquet without Hadoop dependencies
that would be quite nice if we don't need some of the hadoop deps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at the linked changes and the main commit seems like this: https://github.com/apache/parquet-mr/pull/1141/files#diff-b044ae9879a94e2b8a49d6e6911ea5498ef162df1373cc049ded6256980a7248
One interesting class they have now added are
org.apache.parquet.conf.PlainParquetConfiguration
to replaceorg.apache.hadoop.conf.Configuration
.Some other interesting classes that I noticed are
org.apache.parquet.hadoop.CodecFactory
which can potentially replace the usage oforg.apache.hadoop.io.compress.CompressionCodecFactory
.org.apache.parquet.hadoop.CodecFactory.HeapBytesCompressor
which can replaceorg.apache.hadoop.io.compress.Compressor
org.apache.parquet.hadoop.CodecFactory.HeapBytesDecompressor
which can replaceorg.apache.hadoop.io.compress.Decompressor
I think there is enough here so we can get rid of the
hadoop-common
dependencies. We would still needparquet-hadoop
though which internally would be usinghadoop-common
only.Do you think that would be beneficial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Is it okay if I make a separate issue for this?
This would require some dedicated effort for the change as well as for the benchmarking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created this issue: #5517