Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]Speed up parquet reading with Java Vector API #40719

Closed
wants to merge 2 commits into from

Conversation

jiangjiguang
Copy link

What changes were proposed in this pull request?

Parquet has supported vector read speed up with this PR apache/parquet-java#1011
The performance gain is 4x ~ 8x according to the parquet microbenchmark
TPC-H(SF100) Q6 has 11% performance increase with Apache Spark integrating parquet vector optimization

Why are the changes needed?

This PR used to support parquet vector optimization

Does this PR introduce any user-facing change?

Add configuration spark.sql.parquet.vector512.read.enabled, If true and CPU contains avx512vbmi & avx512_vbmi2 instruction set, parquet decodes using Java Vector API. For Intel CPU, Ice Lake or newer contains the required instruction set.

How was this patch tested?

For the test case, there are some problems to fix:

  1. It is necessary to Parquet-mr community release new java version to use the parquet vector optimization.
  2. Parquet Vector optimization does not release default, so users have to build parquet with mvn clean install -P vector-plugins manually to get the parquet-encoding-vector-{VERSION}.jar and put it on the {SPARK_HOME}/jars path
  3. github doesn't support select runners with specific instruction set. So it is impossible (a self-hosted runner can do it) to verify the optimization on github runners machine.

@jiangjiguang
Copy link
Author

jiangjiguang commented Apr 10, 2023

@LuciferYang @wangyum @frankliee Since parquet-mr has released 1.13.0, So I resubmit the PR. The original PR is #40646

@jiangjiguang
Copy link
Author

@frankliee Sorry for delay. The PR only supports AVX512, does not support AVX256.
Your question "Do we need to create SparkContext in static code ?" because I want to get the SQL configuration sql.parquet.vector512.read.enabled

@jiangjiguang
Copy link
Author

@LuciferYang @wangyum @frankliee I have added a benchmark.

This is the result:

Java HotSpot(TM) 64-Bit Server VM 17.0.5+9-LTS-191 on Linux 5.15.0-60-generic
Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz
Selection:                                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Without Java Vector API                            4696           4802          89         21.3          47.0       1.0X
With Java Vector API                               3742           3927         230         26.7          37.4       1.3X

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jul 21, 2023
@github-actions github-actions bot closed this Jul 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants