-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: combine normal and cdf plan until write for merge #3142
refactor: combine normal and cdf plan until write for merge #3142
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3142 +/- ##
========================================
Coverage 71.88% 71.89%
========================================
Files 134 135 +1
Lines 43479 43629 +150
Branches 43479 43629 +150
========================================
+ Hits 31257 31367 +110
- Misses 10201 10219 +18
- Partials 2021 2043 +22 ☔ View full report in Codecov by Sentry. |
One possible optimization: |
Regarding min max: I haven't understood it yet, but each stream should actually only get one part of the data, then only this part would be cached, or do you mean that the problem occurs because the cache is not cleared between the calls? |
What do you mean with split them in the write plan? |
The issue is, that the early filter uses min_max stats, which requires an aggregation execution plan. This execution plan will consume the source stream. Once consumed, the RecordBatchGenerator returns 0 batches. The only way to resolve that is caching the result in a memtable prior to executing, but than all of your data is in memory and not streamed anymore. Take a look at the code here, I disabled the min_max stats gatherin when streaming is on: #3145 But if you have any ideas on how to do this while in streaming, I am open for input :D |
Instead of doing the split into normal_df and cdf_df by null filtering you can filter by change value. Then you need to pass the df only once with all changes. Something like:
|
Ah right, see what you mean now! Let me check! |
8df3d14
to
027e369
Compare
Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
027e369
to
9c01a9e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very interesting, let's see what happens 😄
let normal_df = batch_df | ||
.clone() | ||
.filter(col(CDC_COLUMN_NAME).in_list( | ||
vec![lit("delete"), lit("source_delete"), lit("update_preimage")], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the copy missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JonasDev1 this filter does col(CDC_COL) NOT IN [delete, source_delete, update_preimage]
, so we want to keep copy, insert, update_postimage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could have probably just filtered on that now that I think about it haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh the true means negation :D
Then it makes sense, thank you
Description
During some exploration to make merge streamable, the complication is an ArrowArrayStreamReader can only be consumed once, so when we materialize (execute) a physical plan which has this LazyTableProvider twice, the second run will have no data.
This PR makes the MERGE+ MERGE_CDF plan a combined plan, and splits out the data during the write. A side benefit is that we now just have one function to do the writing and return all actions.
@JonasDev1 your work in MERGE to use min_max pruning from the source for the target scan also complicates things a bit since we consume the stream as a whole. We can solve this by caching the df beforehand, but then everything will stay in memory defeating the streamed execution. I'm curious if you have any idea's on how we could do this without full materialization? I couldn't find you on Slack, but feel free to ping me there to discuss it a bit more
Other thoughts
I moved this writer temporarily under the merge module, but the idea is to use this later also for the normal write operation. But I need the Logical plan refactor to be merged first