Skip to content

Commit

Permalink
[SPARK-48177][BUILD] Upgrade Apache Parquet to 1.14.1
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?

### Why are the changes needed?

Fixes quite a few bugs on the Parquet side: https://github.com/apache/parquet-mr/blob/master/CHANGES.md#version-1140

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Using the existing unit tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#46447 from Fokko/fd-bump-parquet.

Authored-by: Fokko Driesprong <fokko@tabular.io>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
  • Loading branch information
Fokko authored and dongjoon-hyun committed Jul 2, 2024
1 parent 4ee37ed commit db9e1ac
Show file tree
Hide file tree
Showing 9 changed files with 647 additions and 646 deletions.
13 changes: 7 additions & 6 deletions dev/deps/spark-deps-hadoop-3-hive-2.3
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@ jackson-core/2.17.1//jackson-core-2.17.1.jar
jackson-databind/2.17.1//jackson-databind-2.17.1.jar
jackson-dataformat-cbor/2.17.1//jackson-dataformat-cbor-2.17.1.jar
jackson-dataformat-yaml/2.17.1//jackson-dataformat-yaml-2.17.1.jar
jackson-datatype-jdk8/2.17.0//jackson-datatype-jdk8-2.17.0.jar
jackson-datatype-jsr310/2.17.1//jackson-datatype-jsr310-2.17.1.jar
jackson-mapper-asl/1.9.13//jackson-mapper-asl-1.9.13.jar
jackson-module-scala_2.13/2.17.1//jackson-module-scala_2.13-2.17.1.jar
Expand Down Expand Up @@ -235,12 +236,12 @@ orc-shims/2.0.1//orc-shims-2.0.1.jar
oro/2.0.8//oro-2.0.8.jar
osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar
paranamer/2.8//paranamer-2.8.jar
parquet-column/1.13.1//parquet-column-1.13.1.jar
parquet-common/1.13.1//parquet-common-1.13.1.jar
parquet-encoding/1.13.1//parquet-encoding-1.13.1.jar
parquet-format-structures/1.13.1//parquet-format-structures-1.13.1.jar
parquet-hadoop/1.13.1//parquet-hadoop-1.13.1.jar
parquet-jackson/1.13.1//parquet-jackson-1.13.1.jar
parquet-column/1.14.1//parquet-column-1.14.1.jar
parquet-common/1.14.1//parquet-common-1.14.1.jar
parquet-encoding/1.14.1//parquet-encoding-1.14.1.jar
parquet-format-structures/1.14.1//parquet-format-structures-1.14.1.jar
parquet-hadoop/1.14.1//parquet-hadoop-1.14.1.jar
parquet-jackson/1.14.1//parquet-jackson-1.14.1.jar
pickle/1.5//pickle-1.5.jar
py4j/0.10.9.7//py4j-0.10.9.7.jar
remotetea-oncrpc/1.1.2//remotetea-oncrpc-1.1.2.jar
Expand Down
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@
<kafka.version>3.7.0</kafka.version>
<!-- After 10.17.1.0, the minimum required version is JDK19 -->
<derby.version>10.16.1.1</derby.version>
<parquet.version>1.13.1</parquet.version>
<parquet.version>1.14.1</parquet.version>
<orc.version>2.0.1</orc.version>
<orc.classifier>shaded-protobuf</orc.classifier>
<jetty.version>11.0.21</jetty.version>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,69 +2,69 @@
Parquet writer benchmark
================================================================================================

OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
Parquet(PARQUET_1_0) writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Output Single Int Column 1839 1907 96 8.6 116.9 1.0X
Output Single Double Column 1832 1841 13 8.6 116.5 1.0X
Output Int and String Column 4356 4494 195 3.6 277.0 0.4X
Output Partitions 3233 3303 99 4.9 205.5 0.6X
Output Buckets 4393 4506 160 3.6 279.3 0.4X
Output Single Int Column 1732 1745 19 9.1 110.1 1.0X
Output Single Double Column 1754 1758 7 9.0 111.5 1.0X
Output Int and String Column 4309 4363 76 3.7 273.9 0.4X
Output Partitions 3252 3350 139 4.8 206.8 0.5X
Output Buckets 4487 4575 124 3.5 285.3 0.4X

OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
Parquet(PARQUET_2_0) writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Output Single Int Column 2057 2066 13 7.6 130.8 1.0X
Output Single Double Column 1805 1813 11 8.7 114.8 1.1X
Output Int and String Column 4771 4775 6 3.3 303.3 0.4X
Output Partitions 3337 3339 3 4.7 212.2 0.6X
Output Buckets 4441 4463 31 3.5 282.3 0.5X
Output Single Int Column 1938 1978 55 8.1 123.2 1.0X
Output Single Double Column 1762 1769 10 8.9 112.0 1.1X
Output Int and String Column 4920 4932 17 3.2 312.8 0.4X
Output Partitions 3385 3389 7 4.6 215.2 0.6X
Output Buckets 4528 4538 14 3.5 287.9 0.4X


================================================================================================
ORC writer benchmark
================================================================================================

OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
ORC writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Output Single Int Column 1144 1168 35 13.8 72.7 1.0X
Output Single Double Column 1612 1628 22 9.8 102.5 0.7X
Output Int and String Column 3911 3915 7 4.0 248.6 0.3X
Output Partitions 2600 2648 67 6.0 165.3 0.4X
Output Buckets 3449 3477 40 4.6 219.3 0.3X
Output Single Int Column 1137 1142 7 13.8 72.3 1.0X
Output Single Double Column 1700 1705 6 9.3 108.1 0.7X
Output Int and String Column 4028 4096 97 3.9 256.1 0.3X
Output Partitions 2562 2582 28 6.1 162.9 0.4X
Output Buckets 3524 3530 9 4.5 224.1 0.3X


================================================================================================
JSON writer benchmark
================================================================================================

OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
JSON writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Output Single Int Column 1627 1636 13 9.7 103.4 1.0X
Output Single Double Column 2389 2390 1 6.6 151.9 0.7X
Output Int and String Column 4283 4299 22 3.7 272.3 0.4X
Output Partitions 3171 3192 29 5.0 201.6 0.5X
Output Buckets 4120 4124 6 3.8 261.9 0.4X
Output Single Int Column 1618 1645 37 9.7 102.9 1.0X
Output Single Double Column 2398 2399 1 6.6 152.5 0.7X
Output Int and String Column 3766 3778 17 4.2 239.5 0.4X
Output Partitions 3162 3164 3 5.0 201.0 0.5X
Output Buckets 4015 4028 18 3.9 255.3 0.4X


================================================================================================
CSV writer benchmark
================================================================================================

OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
CSV writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Output Single Int Column 3536 3557 31 4.4 224.8 1.0X
Output Single Double Column 3863 3894 44 4.1 245.6 0.9X
Output Int and String Column 6363 6377 19 2.5 404.5 0.6X
Output Partitions 5128 5148 29 3.1 326.0 0.7X
Output Buckets 6613 6626 18 2.4 420.5 0.5X
Output Single Int Column 3985 3993 11 3.9 253.4 1.0X
Output Single Double Column 4148 4210 88 3.8 263.7 1.0X
Output Int and String Column 6728 6741 18 2.3 427.8 0.6X
Output Partitions 5431 5447 23 2.9 345.3 0.7X
Output Buckets 6927 6942 22 2.3 440.4 0.6X


60 changes: 30 additions & 30 deletions sql/core/benchmarks/BuiltInDataSourceWriteBenchmark-results.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,69 +2,69 @@
Parquet writer benchmark
================================================================================================

OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
Parquet(PARQUET_1_0) writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Output Single Int Column 1778 1861 116 8.8 113.1 1.0X
Output Single Double Column 1750 1757 10 9.0 111.2 1.0X
Output Int and String Column 4290 4408 167 3.7 272.8 0.4X
Output Partitions 3089 3259 240 5.1 196.4 0.6X
Output Buckets 4269 4289 29 3.7 271.4 0.4X
Output Single Int Column 1813 1881 96 8.7 115.3 1.0X
Output Single Double Column 1976 1977 1 8.0 125.6 0.9X
Output Int and String Column 4403 4438 50 3.6 279.9 0.4X
Output Partitions 3388 3421 46 4.6 215.4 0.5X
Output Buckets 4670 4680 15 3.4 296.9 0.4X

OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
Parquet(PARQUET_2_0) writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Output Single Int Column 1731 1744 19 9.1 110.0 1.0X
Output Single Double Column 1803 1804 2 8.7 114.6 1.0X
Output Int and String Column 4665 4672 10 3.4 296.6 0.4X
Output Partitions 3290 3308 26 4.8 209.2 0.5X
Output Buckets 4261 4327 93 3.7 270.9 0.4X
Output Single Int Column 1903 1926 33 8.3 121.0 1.0X
Output Single Double Column 1998 1998 0 7.9 127.0 1.0X
Output Int and String Column 4916 4936 29 3.2 312.6 0.4X
Output Partitions 3366 3375 13 4.7 214.0 0.6X
Output Buckets 4560 4583 33 3.4 289.9 0.4X


================================================================================================
ORC writer benchmark
================================================================================================

OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
ORC writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Output Single Int Column 1072 1075 4 14.7 68.1 1.0X
Output Single Double Column 1579 1580 0 10.0 100.4 0.7X
Output Int and String Column 3815 3875 85 4.1 242.5 0.3X
Output Partitions 2510 2511 1 6.3 159.6 0.4X
Output Buckets 3441 3471 43 4.6 218.7 0.3X
Output Single Int Column 1034 1039 7 15.2 65.8 1.0X
Output Single Double Column 1687 1691 7 9.3 107.2 0.6X
Output Int and String Column 3941 3955 20 4.0 250.6 0.3X
Output Partitions 2553 2674 172 6.2 162.3 0.4X
Output Buckets 3544 3548 6 4.4 225.3 0.3X


================================================================================================
JSON writer benchmark
================================================================================================

OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
JSON writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Output Single Int Column 1635 1639 5 9.6 104.0 1.0X
Output Single Double Column 2218 2230 17 7.1 141.0 0.7X
Output Int and String Column 3948 3997 68 4.0 251.0 0.4X
Output Partitions 3165 3240 105 5.0 201.2 0.5X
Output Buckets 4132 4142 15 3.8 262.7 0.4X
Output Single Int Column 1669 1686 24 9.4 106.1 1.0X
Output Single Double Column 2342 2369 37 6.7 148.9 0.7X
Output Int and String Column 3776 3805 42 4.2 240.0 0.4X
Output Partitions 3060 3064 7 5.1 194.5 0.5X
Output Buckets 4009 4052 60 3.9 254.9 0.4X


================================================================================================
CSV writer benchmark
================================================================================================

OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
CSV writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Output Single Int Column 3680 3696 22 4.3 234.0 1.0X
Output Single Double Column 3554 3559 7 4.4 225.9 1.0X
Output Int and String Column 6396 6402 9 2.5 406.6 0.6X
Output Partitions 4937 4942 7 3.2 313.9 0.7X
Output Buckets 6288 6300 17 2.5 399.8 0.6X
Output Single Int Column 3877 3889 18 4.1 246.5 1.0X
Output Single Double Column 4079 4086 10 3.9 259.3 1.0X
Output Int and String Column 6266 6269 4 2.5 398.4 0.6X
Output Partitions 5432 5438 8 2.9 345.4 0.7X
Output Buckets 6528 6530 4 2.4 415.0 0.6X


Loading

0 comments on commit db9e1ac

Please sign in to comment.