-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support bucket table for Iceberg #430
Comments
This would be a great feature to have. I think Spark might have to be adapted as well. |
@rdblue @aokolnychyi I wrote a simple design doc about bucketing support in Iceberg, would you please help to review, appreciate your time. https://docs.google.com/document/d/1X3tpcJFz8Fd9m2SixHP4psBFHc39Y21TXev4i26ve-I/edit?usp=sharing |
@jerryshao, thanks for posting this! I'll take a look as soon as I can, but I'm going to be at a conference next week so it may not be quick. |
Sure, no problem, take your time :). |
I'm roughly dividing this issue into 3 ongoing PRs:
Currently I'm working on the first task. |
Thanks @jerryshao! I had a look at the doc and made some comments. The main thing is that Iceberg already supports bucketing and has solved many of the challenges you identified, like schema evolution. There are two remaining problems:
For problem 1, we need to allow Iceberg to control the We also need a FunctionCatalog that allows us to return Iceberg transforms as UDFs that Spark can use. For problem 2, we are planning to add support for Spark to be able to use bucket values to speed up joins. We aren't quite sure how to do this yet, but we know that Spark will need to recognize that a table is bucketed (using the Table's partitioning), get the bucket function from the table's catalog (using FunctionCatalog) and use that function to prepare data for the other side of the join. If the other side of the join uses the same partition function, then we can avoid a shuffle for that side of the join as well. Hopefully this short write-up and the comments I left on the doc give you an idea of the current status of bucketed joins. Thanks for working on this! |
@jerryshao, yes that's correct. That's why we need to expose the transformation functions to Spark via FunctionCatalog, and add the ability for DSv2 sources to set distribution and ordering requirements with those functions. |
@dbtsai, FYI |
I am preparing a few optimizations for metadata compaction and will work on sort spec next. |
is there any progress on this ? |
Nice! let me know if you need extra eyes, would love to help, as we run into the issue of shuffling big records recently |
@yupbank sure, here is the design doc. It'd be great to get more comments & feedback on it! |
Yes I'm working on this right now, and the bulk of the work is on Spark side. Please track https://issues.apache.org/jira/browse/SPARK-37375 for progress. BTW good to see you here @SinghAsDev ! |
Great to e-see you too as well Chao!
On Thu, Dec 9, 2021 at 2:43 PM Chao Sun ***@***.***> wrote:
Yes I'm working on this right now, and the bulk of the work is on Spark
side. Please track https://issues.apache.org/jira/browse/SPARK-37375 for
progress.
BTW good to see you here @SinghAsDev <https://github.com/SinghAsDev> !
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#430 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABFQCZJLT6TG7KVST6AL5D3UQEWJ3ANCNFSM4ISKXHYQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
- Ashish
|
Any updates on this? Storage Partitioned Join landed in Spark v3.3. |
I think this is ongoing - first Iceberg needs to support function catalog which is tracked (partially) by #5305. |
I have I will open an issue for the FunctionCatalog and link it to this. 👍 |
Since these were merged, is this working now in 0.14.1? |
There are at least the following work need to be done:
Will update here once the feature is fully available. |
Current Iceberg doesn't support "bucket" semantics both in read and write, so we cannot leverage this to do bucketed join. We should add such support in Iceberg.
The text was updated successfully, but these errors were encountered: