-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add empty source handling for delta
table format on filesystem
destination
#1617
Add empty source handling for delta
table format on filesystem
destination
#1617
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
dt := try_get_deltatable( | ||
self.client.make_remote_uri(self.make_remote_path()), | ||
storage_options=_deltalake_storage_options(self.client.config), | ||
arrow_table = pa.dataset.dataset(file_paths).to_table() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on large loads this will probably fail or be super slow, no? there must be ways to cheat here, for example just load the first file and if there is something in there we know that we have rows and write the table the same way as before or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work on this! I think we should have another think about the mechanism which we use to determine wether a table is empty. If I understood you correctly then delta-rs uses some kind of multithreaded loading under the hood, so we kill this speed benefit when loading the full table eagerly into memory. I'm sure we can cheat here somehow. A few ideas:
- Only load the first file for a table and if there is something there we can use the old way of loading
- Work with the filesizes of the jobs. I'm not sure how predictable they are, but a 1mb parquet file very likely has data, right?
- Arrow might have some system to only load the first row or a parquet file or list of parquet files, that should be enough for peaking into the files to see if there is anything.
The main point is to avoid a situation where we have 10gb of files for one table and load them all into memory.
…empty-table-delta-table-formatfilesystem
I've refactored the code to use I had to upgrade I don't know how to elegantly work around this conflict. Poetry does not support dependency overriding: python-poetry/poetry#697. This problem will soon solve itself, because This change hasn't been released yet. I disabled the @sh-rp can you:
|
This reverts commit 5bbfba4.
The part about I had to extend the test setup a bit:
|
Description
This PR adds support for the "empty source" case for
delta
table format onfilesystem
destination.Specifically, it enables:
dlt.mark.materialize_table_schema()
These cases need explicit handling because
delta-rs
throws errors in case of empty tables/datasets: delta-io/delta-rs#2686Related Issues
delta
table format onfilesystem
destination #1613