[WIP]Speed up parquet reading with Java Vector API #40719

jiangjiguang · 2023-04-10T03:26:11Z

What changes were proposed in this pull request?

Parquet has supported vector read speed up with this PR apache/parquet-java#1011
The performance gain is 4x ~ 8x according to the parquet microbenchmark
TPC-H(SF100) Q6 has 11% performance increase with Apache Spark integrating parquet vector optimization

Why are the changes needed?

This PR used to support parquet vector optimization

Does this PR introduce any user-facing change?

Add configuration spark.sql.parquet.vector512.read.enabled, If true and CPU contains avx512vbmi & avx512_vbmi2 instruction set, parquet decodes using Java Vector API. For Intel CPU, Ice Lake or newer contains the required instruction set.

How was this patch tested?

For the test case, there are some problems to fix:

It is necessary to Parquet-mr community release new java version to use the parquet vector optimization.
Parquet Vector optimization does not release default, so users have to build parquet with mvn clean install -P vector-plugins manually to get the parquet-encoding-vector-{VERSION}.jar and put it on the {SPARK_HOME}/jars path
github doesn't support select runners with specific instruction set. So it is impossible (a self-hosted runner can do it) to verify the optimization on github runners machine.

jiangjiguang · 2023-04-10T03:29:40Z

@LuciferYang @wangyum @frankliee Since parquet-mr has released 1.13.0, So I resubmit the PR. The original PR is #40646

jiangjiguang · 2023-04-10T03:33:48Z

@frankliee Sorry for delay. The PR only supports AVX512, does not support AVX256.
Your question "Do we need to create SparkContext in static code ?" because I want to get the SQL configuration sql.parquet.vector512.read.enabled

jiangjiguang · 2023-04-10T06:42:19Z

@LuciferYang @wangyum @frankliee I have added a benchmark.

This is the result:

Java HotSpot(TM) 64-Bit Server VM 17.0.5+9-LTS-191 on Linux 5.15.0-60-generic
Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz
Selection:                                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Without Java Vector API                            4696           4802          89         21.3          47.0       1.0X
With Java Vector API                               3742           3927         230         26.7          37.4       1.3X

.../main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java

github-actions · 2023-07-21T00:21:07Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Speed up parquet reading with Java Vector API

4345afd

github-actions bot added BUILD SQL labels Apr 10, 2023

add benchmark

dff1338

LuciferYang reviewed Apr 11, 2023

View reviewed changes

.../main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java Show resolved Hide resolved

github-actions bot added the Stale label Jul 21, 2023

github-actions bot closed this Jul 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]Speed up parquet reading with Java Vector API #40719

[WIP]Speed up parquet reading with Java Vector API #40719

jiangjiguang commented Apr 10, 2023

jiangjiguang commented Apr 10, 2023 •

edited

Loading

jiangjiguang commented Apr 10, 2023

jiangjiguang commented Apr 10, 2023

github-actions bot commented Jul 21, 2023

[WIP]Speed up parquet reading with Java Vector API #40719

[WIP]Speed up parquet reading with Java Vector API #40719

Conversation

jiangjiguang commented Apr 10, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

jiangjiguang commented Apr 10, 2023 • edited Loading

jiangjiguang commented Apr 10, 2023

jiangjiguang commented Apr 10, 2023

github-actions bot commented Jul 21, 2023

jiangjiguang commented Apr 10, 2023 •

edited

Loading