refactor: combine normal and cdf plan until write for merge #3142

ion-elgreco · 2025-01-18T19:42:42Z

Description

During some exploration to make merge streamable, the complication is an ArrowArrayStreamReader can only be consumed once, so when we materialize (execute) a physical plan which has this LazyTableProvider twice, the second run will have no data.

This PR makes the MERGE+ MERGE_CDF plan a combined plan, and splits out the data during the write. A side benefit is that we now just have one function to do the writing and return all actions.

@JonasDev1 your work in MERGE to use min_max pruning from the source for the target scan also complicates things a bit since we consume the stream as a whole. We can solve this by caching the df beforehand, but then everything will stay in memory defeating the streamed execution. I'm curious if you have any idea's on how we could do this without full materialization? I couldn't find you on Slack, but feel free to ping me there to discuss it a bit more

Other thoughts

I moved this writer temporarily under the merge module, but the idea is to use this later also for the normal write operation. But I need the Logical plan refactor to be merged first

codecov · 2025-01-18T19:51:49Z

Codecov Report

Attention: Patch coverage is 79.75709% with 50 lines in your changes missing coverage. Please review.

Project coverage is 71.89%. Comparing base (a73d646) to head (9c01a9e).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/core/src/operations/merge/writer.rs	80.31%	13 Missing and 24 partials ⚠️
crates/core/src/operations/merge/mod.rs	77.96%	2 Missing and 11 partials ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##             main    #3142    +/-   ##
========================================
  Coverage   71.88%   71.89%            
========================================
  Files         134      135     +1     
  Lines       43479    43629   +150     
  Branches    43479    43629   +150     
========================================
+ Hits        31257    31367   +110     
- Misses      10201    10219    +18     
- Partials     2021     2043    +22

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

JonasDev1 · 2025-01-20T16:49:21Z

One possible optimization:
You could also manage the two writes without the union and the data split durring the write, if you add the cdc column to the original df with all possible states copy, delete, insert, update and then split them in the write plan. This should be faster and save memory.

JonasDev1 · 2025-01-20T16:50:08Z

Regarding min max: I haven't understood it yet, but each stream should actually only get one part of the data, then only this part would be cached, or do you mean that the problem occurs because the cache is not cleared between the calls?

ion-elgreco · 2025-01-20T16:55:23Z

One possible optimization: You could also manage the two writes without the union and the data split durring the write, if you add the cdc column to the original df with all possible states copy, delete, insert, update and then split them in the write plan. This should be faster and save memory.

What do you mean with split them in the write plan?

ion-elgreco · 2025-01-20T16:57:51Z

Regarding min max: I haven't understood it yet, but each stream should actually only get one part of the data, then only this part would be cached, or do you mean that the problem occurs because the cache is not cleared between the calls?

The issue is, that the early filter uses min_max stats, which requires an aggregation execution plan. This execution plan will consume the source stream. Once consumed, the RecordBatchGenerator returns 0 batches.

The only way to resolve that is caching the result in a memtable prior to executing, but than all of your data is in memory and not streamed anymore.

Take a look at the code here, I disabled the min_max stats gatherin when streaming is on: #3145

But if you have any ideas on how to do this while in streaming, I am open for input :D

JonasDev1 · 2025-01-20T17:16:10Z

One possible optimization: You could also manage the two writes without the union and the data split durring the write, if you add the cdc column to the original df with all possible states copy, delete, insert, update and then split them in the write plan. This should be faster and save memory.

What do you mean with split them in the write plan?

Instead of doing the split into normal_df and cdf_df by null filtering you can filter by change value. Then you need to pass the df only once with all changes.

Something like:

normal_df = batch_df.clone().filter(col(CDC_COLUMN_NAME).isin("copy", "insert", "copy")).drop_columns(&[CDC_COLUMN_NAME])?;
cdf_df = batch_df.filter(col(CDC_COLUMN_NAME).isin("delete", "insert", "copy"))

ion-elgreco · 2025-01-20T17:28:37Z

One possible optimization: You could also manage the two writes without the union and the data split durring the write, if you add the cdc column to the original df with all possible states copy, delete, insert, update and then split them in the write plan. This should be faster and save memory.

What do you mean with split them in the write plan?

Instead of doing the split into normal_df and cdf_df by null filtering you can filter by change value. Then you need to pass the df only once with all changes.

Something like:
normal_df = batch_df.clone().filter(col(CDC_COLUMN_NAME).isin("copy", "insert", "copy")).drop_columns(&[CDC_COLUMN_NAME])?;
cdf_df = batch_df.filter(col(CDC_COLUMN_NAME).isin("delete", "insert", "copy"))

Ah right, see what you mean now! Let me check!

Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>

rtyler

very interesting, let's see what happens 😄

JonasDev1 · 2025-01-21T16:13:12Z

crates/core/src/operations/merge/writer.rs

+                        let normal_df = batch_df
+                            .clone()
+                            .filter(col(CDC_COLUMN_NAME).in_list(
+                                vec![lit("delete"), lit("source_delete"), lit("update_preimage")],


Is the copy missing here?

@JonasDev1 this filter does col(CDC_COL) NOT IN [delete, source_delete, update_preimage], so we want to keep copy, insert, update_postimage

I could have probably just filtered on that now that I think about it haha

Ahhh the true means negation :D
Then it makes sense, thank you

ion-elgreco requested review from wjones127, roeap, rtyler and hntd187 as code owners January 18, 2025 19:42

github-actions bot added the binding/rust Issues for the Rust crate label Jan 18, 2025

ion-elgreco changed the title ~~refactor: combine normal and cdf plan until write~~ refactor: combine normal and cdf plan until write for merge Jan 18, 2025

ion-elgreco marked this pull request as draft January 19, 2025 17:18

ion-elgreco marked this pull request as ready for review January 19, 2025 17:29

ion-elgreco mentioned this pull request Jan 19, 2025

feat: streamed execution in MERGE #3145

Merged

ion-elgreco marked this pull request as draft January 20, 2025 19:58

ion-elgreco force-pushed the refactor--combine_execution_plans branch from 8df3d14 to 027e369 Compare January 20, 2025 21:46

ion-elgreco added 2 commits January 20, 2025 22:46

refactor: combine normal and cdf plan until write

7b8fd9c

Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>

refactor: cdf merge, keep all ops in one single df

9c01a9e

Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>

ion-elgreco force-pushed the refactor--combine_execution_plans branch from 027e369 to 9c01a9e Compare January 20, 2025 21:46

ion-elgreco marked this pull request as ready for review January 20, 2025 21:46

rtyler enabled auto-merge January 20, 2025 21:58

rtyler approved these changes Jan 20, 2025

View reviewed changes

rtyler added this pull request to the merge queue Jan 20, 2025

Merged via the queue into delta-io:main with commit 0cf4ff4 Jan 20, 2025
23 checks passed

JonasDev1 reviewed Jan 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: combine normal and cdf plan until write for merge #3142

refactor: combine normal and cdf plan until write for merge #3142

ion-elgreco commented Jan 18, 2025 •

edited

Loading

codecov bot commented Jan 18, 2025 •

edited

Loading

JonasDev1 commented Jan 20, 2025

JonasDev1 commented Jan 20, 2025

ion-elgreco commented Jan 20, 2025 •

edited

Loading

ion-elgreco commented Jan 20, 2025 •

edited

Loading

JonasDev1 commented Jan 20, 2025

ion-elgreco commented Jan 20, 2025

rtyler left a comment

JonasDev1 Jan 21, 2025

ion-elgreco Jan 21, 2025

ion-elgreco Jan 21, 2025

JonasDev1 Jan 22, 2025

refactor: combine normal and cdf plan until write for merge #3142

refactor: combine normal and cdf plan until write for merge #3142

Conversation

ion-elgreco commented Jan 18, 2025 • edited Loading

Description

Other thoughts

codecov bot commented Jan 18, 2025 • edited Loading

Codecov Report

JonasDev1 commented Jan 20, 2025

JonasDev1 commented Jan 20, 2025

ion-elgreco commented Jan 20, 2025 • edited Loading

ion-elgreco commented Jan 20, 2025 • edited Loading

JonasDev1 commented Jan 20, 2025

ion-elgreco commented Jan 20, 2025

rtyler left a comment

Choose a reason for hiding this comment

JonasDev1 Jan 21, 2025

Choose a reason for hiding this comment

ion-elgreco Jan 21, 2025

Choose a reason for hiding this comment

ion-elgreco Jan 21, 2025

Choose a reason for hiding this comment

JonasDev1 Jan 22, 2025

Choose a reason for hiding this comment

ion-elgreco commented Jan 18, 2025 •

edited

Loading

codecov bot commented Jan 18, 2025 •

edited

Loading

ion-elgreco commented Jan 20, 2025 •

edited

Loading

ion-elgreco commented Jan 20, 2025 •

edited

Loading