PARQUET-2196: Support LZ4_RAW codec #1000

wgtmac · 2022-09-27T09:22:16Z

This PR implements the LZ4_RAW codec which was introduced by parquet format v2.9.0. Since there are a lot of common logic between the LZ4_RAW and SNAPPY codecs, this patch moves them into NonBlockedCompressor and NonBlockedDecompressor and make the specific codec extend them.

Added TestLz4RawCodec test to make sure the new codec itself is correct.

wgtmac · 2022-09-27T09:25:53Z

@pitrou @shangxinli Can you please take a look? Thanks in advance!

pitrou · 2022-09-27T10:06:40Z

cc @lidavidm @emkornfield : could you give this a look?

pitrou

@wgtmac Did you try to read an actual file produced by Parquet C++?

Note you can find such files in https://github.com/apache/parquet-testing/

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestLz4RawCodec.java

wgtmac · 2022-09-27T12:25:51Z

@wgtmac Did you try to read an actual file produced by Parquet C++?

Note you can find such files in https://github.com/apache/parquet-testing/

Yes, I have tried that. I will add some parquet files for compatibility test as well.

shangxinli · 2022-09-27T14:49:35Z

Thank Gang for contributing! Is there any benchmarking numbers? Any comparison with ZSTD? These are non-blocking questions for review and merging.

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawCodec.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawCompressor.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawDecompressor.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedDecompressor.java

shangxinli · 2022-09-27T19:50:10Z

Nice implementation! For the test, can you add more for interop with lz4?

wgtmac · 2022-09-28T16:07:19Z

@shangxinli Thanks for the review.

I haven't be able to run the benchmark since the Hadoop Lz4Codec is implemented via JNI and I cannot simply run it on my Mac M1 laptop.

Generally this is the same LZ4 algorithm with the one shipped by Hadoop. The key difference is that LZ4_RAW has removed the redundant block header introduced by Hadoop Lz4Codec and its implementation is in pure Java.

I will add the interop test in a follow-up commit.

wgtmac · 2022-09-29T15:40:37Z

The interop test has been added. Please take a look again. Thanks! @shangxinli @pitrou

parquet-cli/src/main/java/org/apache/parquet/cli/Util.java

parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/CompressionCodecName.java

parquet-hadoop/pom.xml

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawCodec.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawCompressor.java

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/codec/TestInteropReadLz4RawCodec.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawCodec.java

pitrou · 2022-10-18T09:33:37Z

It would probably be easier to use a git submodule (I wonder why that's not the approach being used in parquet-mr), but in any case reference files should go into the parquet-testing repo, not be privately-maintained in parquet-mr.

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawCodec.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedCompressor.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedDecompressor.java

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/codec/TestInteropReadLz4RawCodec.java

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/codec/TestLz4RawCodec.java

ggershinsky · 2022-10-19T07:34:54Z

Yep, we test encryption interop using binary files in the parquet-testing repo. @wgtmac Please have a look at this code: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestEncryptionOptions.java#L130

@ggershinsky @emkornfield Yes I did that in the previous commit: 6e02cb2#diff-61de8043597197e5da4b42ebfe2d123b9c42e56fcb520e293c1a10ba55fa93cc but switched to local file to avoid unstable test.

What was the source of the instability?
(encryption interop runs with files from the parquet-testing module for ~3 years now, no problems as far as I remember)

wgtmac · 2022-10-24T04:08:38Z

I have changed the approach of interop test to follow encryption which downloads test files from parquet-testing repo and verifies the data decompressed and decoded from them.

In addition, the codec test has been modified to be generic which supports LZ4_RAW and SNAPPY. It should be easy to support more codec types.

Please take a look when you have time and let me know if there is any feedback, thanks! @pitrou @emkornfield @ggershinsky @shangxinli

pitrou · 2022-10-25T09:46:09Z

parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/CompressionCodecName.java

  LZ4("org.apache.hadoop.io.compress.Lz4Codec", CompressionCodec.LZ4, ".lz4"),
-  ZSTD("org.apache.parquet.hadoop.codec.ZstandardCodec", CompressionCodec.ZSTD, ".zstd");
+  ZSTD("org.apache.parquet.hadoop.codec.ZstandardCodec", CompressionCodec.ZSTD, ".zstd"),
+  LZ4_RAW("org.apache.parquet.hadoop.codec.Lz4RawCodec", CompressionCodec.LZ4_RAW, ".lz4raw");


I think both file extensions for LZ4 and LZ4RAW should be removed.
The reason is that the lz4 command-line utilities expect the LZ4 frame format.
In any case, the .lz4 extension should certainly not be used for the "Hadoop LZ4" compression codec.

I searched the parquet-mr repo and found the only usage of the codec extension as below:

https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L420

public RecordWriter<Void, T> getRecordWriter(TaskAttemptContext taskAttemptContext, Mode mode) throws IOException, InterruptedException { final Configuration conf = getConfiguration(taskAttemptContext); CompressionCodecName codec = getCodec(taskAttemptContext); String extension = codec.getExtension() + ".parquet"; Path file = getDefaultWorkFile(taskAttemptContext, extension); return getRecordWriter(conf, file, codec, mode); }

The extension is part of the enum definition and cannot be removed. I propose to rename lz4 to lz4hadoop to explicitly differentiate this lz4 variant from standard lz4. @pitrou

Sounds good to me.

pitrou · 2022-10-25T09:55:53Z

@lwhite1 Could you perhaps review this? I can't really comment on the details of the Java code, myself.

shangxinli · 2022-11-02T14:50:48Z

LGTM, let's see if there are comments from others, otherwise we can merge.

lwhite1

Looks good to me.

The following are the changes needed: * PySpark 3.5 has deprecated the support for Python 3.7. This required changes to Delta test infra to install the appropriate Python version and other packages. The `Dockerfile` used for running tests is also updated to have required Python version and packages and uses the same base image as PySpark test infra in Apache Spark. * `StructType.toAttributes` and `StructType.fromAttributes` methods are moved into a utility class `DataTypeUtils`. * The `iceberg` module is disabled as there is no released version of `iceberg` that works with Spark 3.5 yet * Remove the URI path hack used in `DMLWithDeletionVectorsHelper` to get around a bug in Spark 3.4. * Remove unrelated tutorial in `delta/examples/tutorials/saiseu19` * Test failure fixes * `org.apache.spark.sql.delta.DeltaHistoryManagerSuite` - Error message has changed * `org.apache.spark.sql.delta.DeltaOptionSuite` - Parquet file name using the LZ4 code has changed due to a apache/parquet-java#1000 in `parquet-mr` dependency. * `org.apache.spark.sql.delta.deletionvectors.DeletionVectorsSuite` - Parquet now generates `row-index` whenever `_metadata` column is selected, however Spark 3.5 has a bug where a row group containing more than 2bn rows fails. For now don't return any `row-index` column in `_metadata` by overriding the `metadataSchemaFields: Seq[StructField]` in `DeltaParquetFileFormat`. * `org.apache.spark.sql.delta.perf.OptimizeMetadataOnlyDeltaQuerySuite`: A behavior change by apache/spark#40922. In Spark plans a new function called `ToPrettyString` is used instead of `cast(aggExpr To STRING)` in when `Dataset.show()` usage. * `org.apache.spark.sql.delta.DeltaCDCStreamDeletionVectorSuite` and `org.apache.spark.sql.delta.DeltaCDCStreamSuite`: Regression in Spark 3.5 RC fixed by apache/spark#42774 before the Spark 3.5 release Closes delta-io#1986 GitOrigin-RevId: b0e4a81b608a857e45ecba71b070309347616a30

The following are the changes needed: * PySpark 3.5 has deprecated the support for Python 3.7. This required changes to Delta test infra to install the appropriate Python version and other packages. The `Dockerfile` used for running tests is also updated to have required Python version and packages and uses the same base image as PySpark test infra in Apache Spark. * `StructType.toAttributes` and `StructType.fromAttributes` methods are moved into a utility class `DataTypeUtils`. * The `iceberg` module is disabled as there is no released version of `iceberg` that works with Spark 3.5 yet * Remove the URI path hack used in `DMLWithDeletionVectorsHelper` to get around a bug in Spark 3.4. * Remove unrelated tutorial in `delta/examples/tutorials/saiseu19` * Test failure fixes * `org.apache.spark.sql.delta.DeltaHistoryManagerSuite` - Error message has changed * `org.apache.spark.sql.delta.DeltaOptionSuite` - Parquet file name using the LZ4 code has changed due to a apache/parquet-java#1000 in `parquet-mr` dependency. * `org.apache.spark.sql.delta.deletionvectors.DeletionVectorsSuite` - Parquet now generates `row-index` whenever `_metadata` column is selected, however Spark 3.5 has a bug where a row group containing more than 2bn rows fails. For now don't return any `row-index` column in `_metadata` by overriding the `metadataSchemaFields: Seq[StructField]` in `DeltaParquetFileFormat`. * `org.apache.spark.sql.delta.perf.OptimizeMetadataOnlyDeltaQuerySuite`: A behavior change by apache/spark#40922. In Spark plans a new function called `ToPrettyString` is used instead of `cast(aggExpr To STRING)` in when `Dataset.show()` usage. * `org.apache.spark.sql.delta.DeltaCDCStreamDeletionVectorSuite` and `org.apache.spark.sql.delta.DeltaCDCStreamSuite`: Regression in Spark 3.5 RC fixed by apache/spark#42774 before the Spark 3.5 release Closes #1986 GitOrigin-RevId: b0e4a81b608a857e45ecba71b070309347616a30

The following are the changes needed: * PySpark 3.5 has deprecated the support for Python 3.7. This required changes to Delta test infra to install the appropriate Python version and other packages. The `Dockerfile` used for running tests is also updated to have required Python version and packages and uses the same base image as PySpark test infra in Apache Spark. * `StructType.toAttributes` and `StructType.fromAttributes` methods are moved into a utility class `DataTypeUtils`. * The `iceberg` module is disabled as there is no released version of `iceberg` that works with Spark 3.5 yet * Remove the URI path hack used in `DMLWithDeletionVectorsHelper` to get around a bug in Spark 3.4. * Remove unrelated tutorial in `delta/examples/tutorials/saiseu19` * Test failure fixes * `org.apache.spark.sql.delta.DeltaHistoryManagerSuite` - Error message has changed * `org.apache.spark.sql.delta.DeltaOptionSuite` - Parquet file name using the LZ4 code has changed due to a apache/parquet-java#1000 in `parquet-mr` dependency. * `org.apache.spark.sql.delta.deletionvectors.DeletionVectorsSuite` - Parquet now generates `row-index` whenever `_metadata` column is selected, however Spark 3.5 has a bug where a row group containing more than 2bn rows fails. For now don't return any `row-index` column in `_metadata` by overriding the `metadataSchemaFields: Seq[StructField]` in `DeltaParquetFileFormat`. * `org.apache.spark.sql.delta.perf.OptimizeMetadataOnlyDeltaQuerySuite`: A behavior change by apache/spark#40922. In Spark plans a new function called `ToPrettyString` is used instead of `cast(aggExpr To STRING)` in when `Dataset.show()` usage. * `org.apache.spark.sql.delta.DeltaCDCStreamDeletionVectorSuite` and `org.apache.spark.sql.delta.DeltaCDCStreamSuite`: Regression in Spark 3.5 RC fixed by apache/spark#42774 before the Spark 3.5 release Closes delta-io#1986 GitOrigin-RevId: b0e4a81b608a857e45ecba71b070309347616a30

PARQUET-2196: Support LZ4_RAW codec

e3ffe88

wgtmac force-pushed the PARQUET-2196 branch from 3842da3 to e3ffe88 Compare September 27, 2022 09:49

use SnappyUtil

3f1302b

pitrou reviewed Sep 27, 2022

View reviewed changes

shangxinli reviewed Sep 27, 2022

View reviewed changes

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawCodec.java Outdated Show resolved Hide resolved

shangxinli reviewed Sep 27, 2022

View reviewed changes

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawCompressor.java Show resolved Hide resolved

shangxinli reviewed Sep 27, 2022

View reviewed changes

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawDecompressor.java Show resolved Hide resolved

shangxinli reviewed Sep 27, 2022

View reviewed changes

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawDecompressor.java Outdated Show resolved Hide resolved

shangxinli reviewed Sep 27, 2022

View reviewed changes

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedDecompressor.java Show resolved Hide resolved

shangxinli reviewed Sep 27, 2022

View reviewed changes

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedDecompressor.java Outdated Show resolved Hide resolved

address feedback and refine test cases

4f417ff

add interop test

6e02cb2