[WIP]Speed up parquet reading with Java Vector API #40719
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Parquet has supported vector read speed up with this PR apache/parquet-java#1011
The performance gain is 4x ~ 8x according to the parquet microbenchmark
TPC-H(SF100) Q6 has 11% performance increase with Apache Spark integrating parquet vector optimization
Why are the changes needed?
This PR used to support parquet vector optimization
Does this PR introduce any user-facing change?
Add configuration spark.sql.parquet.vector512.read.enabled, If true and CPU contains avx512vbmi & avx512_vbmi2 instruction set, parquet decodes using Java Vector API. For Intel CPU, Ice Lake or newer contains the required instruction set.
How was this patch tested?
For the test case, there are some problems to fix: