-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for delta atomic commit read #1079
base: master
Are you sure you want to change the base?
Conversation
Co-authored-by: Liqiang Guo <guoliqiang2006@gmail.com> Signed-off-by: Lizhezheng Zhang <hzzlzz@hotmail.com>
I wonder if there is an API to manually create multi-action commits with custom data. |
Hi @hzzlzz thanks for making this PR and writing up the PIP. We'll review this as soon as possible.
I think testing it the same way we test |
That is good. Please go ahead and review. The reason for asking is I need multi-action commits to test this option and I'm relying on repartition for creating that right now. But it seems not guaranteed to produce the exact number of files (for instance 2) if the data volume is small in my offline test so I write 50 items each time in the unit test. |
@hzzlzz If I understand correctly, the behavior of |
Exactly! And in the case of isStartingVersion=true, it will read the whole snapshot in one micro-batch. |
I'm a bit concerned about this. Loading the entire snapshot likely will cause stability issues. You mentioned that you built a service for Delta Lake: #1026 (comment) How do you plan to handle the case that a big snapshot may take down your service? |
This is a totally reasonable concern.
We adopt No. 2 in our system because we have small/medium tables which is OK to read the whole snapshot, and this is our cold start use case too. |
I feel neither is ideal. We don't want to build a feature that a user can shoot themself easily. In a second thought, will your problem be solved by using Merge Into? We have a few examples in our doc to show how to use Merge Into to track updates and make idempotent changes a Delta table: https://docs.delta.io/latest/delta-update.html#merge-examples |
Thanks @zsxwing Thinking further, I thought only delta table stream source could provide such ability, Kafka/Eventhub has no straight forward way. |
Yeah. If you would like to provide some general solution for data sources other than Delta, changing just Delta would not be sufficient. Looks like you have to add something to the data in order to achieve this. Do you still want to add this into Delta Lake? Otherwise, I'm inclined to close this one as I prefer to not add an unscalable solution if possible. |
Thanks @zsxwing |
Thanks @zsxwing |
@zsxwing , what's the next step for figuring out what to do with these two options? |
@scottsand-db since you are working on CDF and it actually needs to read a commit entirely, could you think about how to unify CDF and the normal streaming Delta source together? |
Hi @hzzlzz - as @zsxwing mentioned, I am working on CDF (Change Data Feed) #1105. The next PR I'm working on is CDF + Streaming. One thing that CDF generates for So, I think it makes sense to wait for this CDF PR to be merged first (it's a WIP) and then you can add your read-atomic-commits to the latter part of CDF (for Add and Remove files). Does this sound good? |
Update: that CDF PR is here: #1154 |
Description
This PR adds support for atomic commits read when reading data from a delta table.
PIP
Resolves #1026
How was this patch tested?
Unit tests
Does this PR introduce any user-facing changes?
Add a new config when reading delta as a streaming source