Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add empty source handling for delta table format on filesystem destination #1617

Merged
merged 14 commits into from
Aug 1, 2024

Conversation

jorritsandbrink
Copy link
Collaborator

@jorritsandbrink jorritsandbrink commented Jul 19, 2024

Description

This PR adds support for the "empty source" case for delta table format on filesystem destination.

Specifically, it enables:

  • use of resources that yield empty Arrow tables
  • use of dlt.mark.materialize_table_schema()

These cases need explicit handling because delta-rs throws errors in case of empty tables/datasets: delta-io/delta-rs#2686

Related Issues

Copy link

netlify bot commented Jul 19, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 6be8683
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66aa2f5e2ca8e40008cfe69e

@jorritsandbrink jorritsandbrink marked this pull request as ready for review July 20, 2024 11:01
dt := try_get_deltatable(
self.client.make_remote_uri(self.make_remote_path()),
storage_options=_deltalake_storage_options(self.client.config),
arrow_table = pa.dataset.dataset(file_paths).to_table()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on large loads this will probably fail or be super slow, no? there must be ways to cheat here, for example just load the first file and if there is something in there we know that we have rows and write the table the same way as before or something like that.

Copy link
Collaborator

@sh-rp sh-rp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this! I think we should have another think about the mechanism which we use to determine wether a table is empty. If I understood you correctly then delta-rs uses some kind of multithreaded loading under the hood, so we kill this speed benefit when loading the full table eagerly into memory. I'm sure we can cheat here somehow. A few ideas:

  • Only load the first file for a table and if there is something there we can use the old way of loading
  • Work with the filesizes of the jobs. I'm not sure how predictable they are, but a 1mb parquet file very likely has data, right?
  • Arrow might have some system to only load the first row or a parquet file or list of parquet files, that should be enough for peaking into the files to see if there is anything.

The main point is to avoid a situation where we have 10gb of files for one table and load them all into memory.

@jorritsandbrink
Copy link
Collaborator Author

I've refactored the code to use pyarrow's Dataset and RecordBatchReader to prevent loading all data into memory at once.

I had to upgrade pyarrow to 17.0.0, which led to a dependency conflict with pylance:
image

I don't know how to elegantly work around this conflict. Poetry does not support dependency overriding: python-poetry/poetry#697.

This problem will soon solve itself, because pylance has already loosened the pyarrow dependency recently: lancedb/lance@ac3f75a.

This change hasn't been released yet. pylance releases pretty often, so I expect it will be released soon.

I disabled the lancedb dependency in pyproject.toml to be able to progress. Of course, the lancedb-dependent pipelines are now failing on CI, so this can't be merged like this.

@sh-rp can you:

  • review
  • suggest an approach: wait for pylance release or do something else

@jorritsandbrink
Copy link
Collaborator Author

The part about pyarrow versioning in my comment above no longer applies. As discussed with @rudolfix on Slack, we don't upgrade pyarrow in pyproject.toml. Instead, we expect the user to run dlt in an environment with pyarrow>=17.0.0. when they're using the delta table format on filesystem. This requirement is asserted at runtime, and a DependencyVersionException is raised if not satisfied.

I had to extend the test setup a bit:

  • delta tests are now marked as needspyarrow17
  • needspyarrow17 tests are automatically skipped if pyarrow<17.0.0
  • a new github workflow test_pyarrow17 explicitly installs pyarrow==17.0.0 to make sure needspyarrow17 tests run on CI

@rudolfix rudolfix merged commit b07dddc into devel Aug 1, 2024
53 of 54 checks passed
@rudolfix rudolfix deleted the fix/1613-empty-table-delta-table-formatfilesystem branch August 1, 2024 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't load empty table when using delta table format on filesystem destination
3 participants