-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2159: Vectorized BytePacker decoder using Java VectorAPI #1011
Conversation
a42553d
to
54237fa
Compare
@wangyum @gszadovszky This PR always fails to build, I do not know why. Is the reason of failure "1 workflow awaiting approval" ? please help me, this is my first PR to parquet-mr community, thanks! |
54237fa
to
78259fd
Compare
@wangyum Could you review the PR ? thanks |
8139fc8
to
72d868c
Compare
@jiangjiguang, Apache introduced a process to require approvals for test executions to keep resources under control. I've approved it so the tests are running. Not sure though when I will have time to properly review it. |
72d868c
to
6a8cff6
Compare
@gszadovszky I resubmitted the code, can you approve the workflow again ? |
@gszadovszky Could you review the PR ? thanks |
@jiangjiguang, however I agree on upgrading to newer java versions (and to have performance benefits like this one) it is not always an easy thing to do. For example our main historical user Hadoop is still on java 1.8 which means they would not be able to upgrade parquet after releasing this one. I think it worth a discussion on the dev list and maybe even a formal vote on it. |
@gszadovszky I agree on it. As far as I know some products(such as hadoop presto flink ) are working towards upgrading to java17. Spark already supports java17, trino requires a minimum java version is java17.0.3, So I think parquet should support java17 as soon as possible and be compatible with java8, because of java17 vector can bring 4x ~ 8x performance gain for parquet encode/decode. The PR uses "maven profile -P vector" and "code gen" to support java17 and is compatible with java8 |
What do you mean by this change is compatible with java8? The vectorized encoding/decoding requires java17 runtime, right? If yes, we will need to have a way release artifacts for java17 and keep java8 compatibility with others. We need to have an agreement (at community level) how we want to achieve this. By running through the change related to the compile it seems that if you have a jdk17 in the compile environment it will be compiled for a java17 runtime automatically. Since we do not have a specific environment for making a release (shame on us) it highly depends on the environment of the dev how creates the release. So, we shall either add a |
|
With you current implementation if parquet-mr is compiled with jdk17 than it will not be compatible with java8 while any jdk under 17 will generate a code compatible with java8 since About starting the discussion about java17 compatibility. I am currently not really active in the community and not even following the dev list. Maybe, @shangxinli or @ggershinsky would like to drive this topic? |
Thank @gszadovszky a lot for helping with this PR! +1 for what @gszadovszky said. The mainstream runtime JDK is still 1.8. Parquet is one of the underlying building blocks for many big data applications. The bare minimum, for now, is to keep java8 compatible. Otherwise forcing applications to upgrade to jdk17 because of Parquet is disruptive and impacts adoptions. @jiangjiguang, I am very happy to see you have this PR to help the Parquet community. Would you mind starting an email discussion to dev@parquet.apache.org for this topic? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This work looks promising! It would be great if you can add some micro-benchmark to parquet-benchmarks.
parquet-column/src/main/java/org/apache/parquet/column/values/bitpacking/ParquetReadRouter.java
Outdated
Show resolved
Hide resolved
parquet-encoding/src/main/java/org/apache/parquet/column/values/bitpacking/BytePacker.java
Outdated
Show resolved
Hide resolved
|
||
/** | ||
* This class generates vector bit packers that pack the most significant bit first. | ||
* The result of the generation is checked in. To regenerate the code run this class and check in the result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is used to generate ByteBitPackingVectorLE class, which unpackValues/packValues with java17 vector api.
If use java17 vector api directly, errors will occur when compiling or developing with java8(<java17).
parquet-generator/src/main/java/org/apache/parquet/encoding/GeneratorVector.java
Outdated
Show resolved
Hide resolved
parquet-generator/src/main/java/org/apache/parquet/encoding/GeneratorVector.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Fang-Xie <fang.xie@intel.com> Co-authored-by: guangzegu <guangze.gu@intel.com> Co-authored-by: jiyu1021 <yu.ji@intel.com> Co-authored-by: jatin-bhateja <jatin.bhateja@intel.com>
6a8cff6
to
2d8cd72
Compare
@wgtmac I have add the micro-benchmark to parquet-benchmarks, this is the result: |
parquet-encoding/src/main/java/org/apache/parquet/column/values/bitpacking/BytePacker.java
Outdated
Show resolved
Hide resolved
...t-benchmarks/src/main/java/org/apache/parquet/benchmarks/ByteBitPackingVectorBenchmarks.java
Outdated
Show resolved
Hide resolved
@wgtmac I have resubmitted the PR, can you review it again ? |
@gszadovszky @shangxinli @wgtmac I have started the discussion about how to upgrade java17 over a month, but nobody involved! So I have updated the PR, it does not involve how to upgrade java17. |
Sorry for the bad experience, @jiangjiguang. Unfortunately, I don't have too much time for the parquet community. |
@jiangjiguang, please check test failures. |
32606a9
to
c0732b4
Compare
@gszadovszky Thank you for your attention about the PR, I have fixed it and the test successes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. I have left some comments and the implementation is overall looking good. Thanks @jiangjiguang for your effort!
My main concern is the extensibility to support other instruction sets. In addition, it seems to me that the java vector api is still incubating. As I am not a java expert, do we have the risk of unstable API?
parquet-column/src/main/java/org/apache/parquet/column/values/bitpacking/ParquetReadRouter.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/values/bitpacking/ParquetReadRouter.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/values/bitpacking/ParquetReadRouter.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/values/bitpacking/ParquetReadRouter.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/values/bitpacking/ParquetReadRouter.java
Outdated
Show resolved
Hide resolved
parquet-encoding/src/main/java/org/apache/parquet/column/values/bitpacking/BytePacker.java
Outdated
Show resolved
Hide resolved
parquet-encoding/src/main/java/org/apache/parquet/column/values/bitpacking/Packer.java
Outdated
Show resolved
Hide resolved
...or/src/main/java/org/apache/parquet/encoding/vectorbitpacking/BitPackingGeneratorVector.java
Outdated
Show resolved
Hide resolved
...or/src/main/java/org/apache/parquet/encoding/vectorbitpacking/BitPackingGeneratorVector.java
Outdated
Show resolved
Hide resolved
…bitpacking/ParquetReadRouter.java Co-authored-by: Gang Wu <ustcwg@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. A minor comment mainly on naming.
pom.xml
Outdated
@@ -659,5 +662,13 @@ | |||
</plugins> | |||
</build> | |||
</profile> | |||
|
|||
<profile> | |||
<id>plugins</id> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plugins
is a little bit generic to me. Rename it to encoding-plugin
or vector-plugins
? Any suggestion? @gszadovszky
In addition, please rename plugins
folder to parquet-plugins
to follow naming of other sub-directories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac @gszadovszky vector-plugins +1, and it can show the feature of this PR.
I have renamed plugins to parquet-plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me. I've added a note for the README, though.
One more thing. We need to add testing of this profile automatically with github actions. (The current test results for this PR have no meaning since these tests are not executed.)
@gszadovszky @wgtmac This feature need avx512vbmi and avx512_vbmi2 instruction set, so it needs github action runners with intel ice lake. I do not know how to select runners with Intel Ice Lake ? So I have submitted the help (actions/runner#2467). |
@gszadovszky @wgtmac I have added a new workflow named Vector-plugins, can you run it ? |
It seems that this PR is closed. Could you please reopen it and see if it can run automatically? |
@wgtmac sorry, may be my wrong click, I have reopened it |
NP. Seems it is running. https://github.com/apache/parquet-mr/actions/runs/4300429406/jobs/7496638049 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @jiangjiguang !
I'd request sign off from @gszadovszky @shangxinli |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some comments about testing. @jiangjiguang, do you have any info about selecting the proper CPU environment for these tests?
...src/test/java/org/apache/parquet/column/values/bitpacking/TestByteBitPacking512VectorLE.java
Outdated
Show resolved
Hide resolved
...-vector/src/test/java/org/apache/parquet/column/values/bitpacking/TestParquetReadRouter.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me.
Thank you for working on it, @jiangjiguang!
Hi @gszadovszky , @wgtmac Thanks for your reviews and guidance during this long review process. May we request to you please merge this PR by adding following people as co-authors ( @jiangjiguang @jiyu1021 @guangzegu @Fang-Xie @jatin-bhateja) alternatively make @jiangjiguang a committer so that he can merge on our behalf. |
@jatin-bhateja I have added them as co-authors on the first commit. |
I have merged it. Thanks @jiangjiguang @jatin-bhateja for the contribution and @gszadovszky for the review! |
@jiangjiguang Sorry to bother you, I am not sure if the use of the AVX 512 will still make other CPU cores to downshift frequency. If so, is there a way to manually turn off this feature now? Otherwise, the new version may be difficult to promote and use in the enterprise. |
The PR includes 3 aspects:
Jira
Tests
Commits
Documentation