-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamically support Spark native engine in Iceberg #9721
Conversation
also cc @rdblue |
I am not sure I agree with the current proposal, given that it exposes lots of internal and evolving classes. Also, it may be too late to plug in another reader as Iceberg makes some assumptions on when vectorized reads can happen way earlier. This means even if the external library supports vectorized reads for nested data, we can't benefit from it because of the existing logic in Have we considered allowing to inject a custom I am not suggesting to switch right away but rather think about this option. Will it even work? Can external libraries ship a custom partition reader factory assuming they have access to |
@aokolnychyi Thanks a lot for your feedback! I agree that the current approach exposes a lot of the internal classes, and injecting a custom |
@huaxingao, we can open up some utilities on Iceberg side, if needed. Unfortunately, the logic will be fairly coupled any way. I kind of hope we can offset some of the duplication by having access to the delegate It is not ideal approach too, let me know if any other ideas come up. Does Comet support vectorized reads with nested data? |
I tried the customized Comet doesn't support nested type yet. |
cc @aokolnychyi @rdblue @RussellSpitzer @flyrain Per our offline discussion with @aokolnychyi, we will not take the |
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
This PR is to introduce a dynamic plugin mechanism to support Spark native execution engines, e.g. Comet
Currently in Iceberg, when vectorization is activated, Iceberg employs the
VectorizedReaderBuilder
to generateVectorizedArrowReader
andColumnarBatchReader
, which are then used for batch reading. I propose to introduce a customizedVectorizedReaderBuilder
and a customizedColumnarBatchReader
. At runtime, if the customizedVectorizedReaderBuilder
andColumnarBatchReader
are accessible, the system will leverage the native vectorized execution engines. In cases where these customized components are not available, Iceberg's standardVectorizedReaderBuilder
andColumnarBatchReader
will be utilized for batch reading.A new
SparkSQLProperties.CUSTOMIZED_VECTORIZATION_IMPL
is added to specify the customized vectorization implementation. IfCUSTOMIZED_VECTORIZATION_IMPL
is not set, the default icebergSparkVectorizedReaderBuilder
andColumnarBatchReader
are used for batch reading. IfVECTORIZATION_IMPL
is set, the customizedSparkVectorizedReaderBuilder
andColumnarBatchReader
are used for batch reading. In addition, a newSparkSQLProperties.CUSTOMIZED_VECTORIZATION_PROPERTY_PREFIX
is added to specify the prefix of the customized vectorization property keys. Using Apache Comet as an example,A VectorizedUtil
class is added to dynamically loadSparkVectorizedReaderBuilder
andBaseColumnarBatchReader
.The customized
VectorizedReaderBuilder
and a customizedColumnarBatchReader
need to be implemented in the native engine (e.g. Comet).