Add empty source handling for `delta` table format on `filesystem` destination #1617

jorritsandbrink · 2024-07-19T14:11:52Z

Description

This PR adds support for the "empty source" case for delta table format on filesystem destination.

Specifically, it enables:

use of resources that yield empty Arrow tables
use of dlt.mark.materialize_table_schema()

These cases need explicit handling because delta-rs throws errors in case of empty tables/datasets: delta-io/delta-rs#2686

Related Issues

Closes Can't load empty table when using delta table format on filesystem destination #1613

netlify · 2024-07-19T14:12:06Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`6be8683`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/66aa2f5e2ca8e40008cfe69e

sh-rp · 2024-07-23T09:37:14Z

dlt/destinations/impl/filesystem/filesystem.py

-                dt := try_get_deltatable(
-                    self.client.make_remote_uri(self.make_remote_path()),
-                    storage_options=_deltalake_storage_options(self.client.config),
+        arrow_table = pa.dataset.dataset(file_paths).to_table()


on large loads this will probably fail or be super slow, no? there must be ways to cheat here, for example just load the first file and if there is something in there we know that we have rows and write the table the same way as before or something like that.

sh-rp

Thanks for the work on this! I think we should have another think about the mechanism which we use to determine wether a table is empty. If I understood you correctly then delta-rs uses some kind of multithreaded loading under the hood, so we kill this speed benefit when loading the full table eagerly into memory. I'm sure we can cheat here somehow. A few ideas:

Only load the first file for a table and if there is something there we can use the old way of loading
Work with the filesizes of the jobs. I'm not sure how predictable they are, but a 1mb parquet file very likely has data, right?
Arrow might have some system to only load the first row or a parquet file or list of parquet files, that should be enough for peaking into the files to see if there is anything.

The main point is to avoid a situation where we have 10gb of files for one table and load them all into memory.

…empty-table-delta-table-formatfilesystem

jorritsandbrink · 2024-07-24T15:56:59Z

I've refactored the code to use pyarrow's Dataset and RecordBatchReader to prevent loading all data into memory at once.

I had to upgrade pyarrow to 17.0.0, which led to a dependency conflict with pylance:

I don't know how to elegantly work around this conflict. Poetry does not support dependency overriding: python-poetry/poetry#697.

This problem will soon solve itself, because pylance has already loosened the pyarrow dependency recently: lancedb/lance@ac3f75a.

This change hasn't been released yet. pylance releases pretty often, so I expect it will be released soon.

I disabled the lancedb dependency in pyproject.toml to be able to progress. Of course, the lancedb-dependent pipelines are now failing on CI, so this can't be merged like this.

@sh-rp can you:

review
suggest an approach: wait for pylance release or do something else

This reverts commit 5bbfba4.

jorritsandbrink · 2024-07-31T12:44:29Z

The part about pyarrow versioning in my comment above no longer applies. As discussed with @rudolfix on Slack, we don't upgrade pyarrow in pyproject.toml. Instead, we expect the user to run dlt in an environment with pyarrow>=17.0.0. when they're using the delta table format on filesystem. This requirement is asserted at runtime, and a DependencyVersionException is raised if not satisfied.

I had to extend the test setup a bit:

delta tests are now marked as needspyarrow17
needspyarrow17 tests are automatically skipped if pyarrow<17.0.0
a new github workflow test_pyarrow17 explicitly installs pyarrow==17.0.0 to make sure needspyarrow17 tests run on CI

jorritsandbrink added 2 commits July 19, 2024 17:45

make null column optional in arrow table test case

b3511e8

handle empty source for delta table format

e3d8f95

jorritsandbrink self-assigned this Jul 19, 2024

jorritsandbrink linked an issue Jul 19, 2024 that may be closed by this pull request

Can't load empty table when using delta table format on filesystem destination #1613

Closed

jorritsandbrink marked this pull request as ready for review July 20, 2024 11:01

jorritsandbrink requested a review from sh-rp July 20, 2024 11:02

sh-rp reviewed Jul 23, 2024

View reviewed changes

sh-rp requested changes Jul 23, 2024

View reviewed changes

jorritsandbrink added 3 commits July 24, 2024 16:31

upgrade pyarrow and comment lancedb

5bbfba4

use RecordBatchReader for filesystem delta table writes

79eb1c1

Merge branch 'devel' of https://github.com/dlt-hub/dlt into fix/1613-…

5b8eb24

…empty-table-delta-table-formatfilesystem

jorritsandbrink requested a review from sh-rp July 24, 2024 15:57

jorritsandbrink added 9 commits July 29, 2024 12:46

Revert "upgrade pyarrow and comment lancedb"

c3a9798

This reverts commit 5bbfba4.

mark tests that need pyarrow version 17

9b2035f

assert pyarrow version for delta table format on filesystem destination

8e026f5

autoskip tests if pyarrow dependency is not satisfied

e8d649a

fix destination config issue

f6b3709

add github workflow for needspyarrow17 tests

ad5974d

fix pyarrow17 github workflow

93658b9

fix local filesystem bucket url

e54bce3

fix typo

6be8683

sh-rp approved these changes Aug 1, 2024

View reviewed changes

rudolfix merged commit b07dddc into devel Aug 1, 2024
53 of 54 checks passed

rudolfix deleted the fix/1613-empty-table-delta-table-formatfilesystem branch August 1, 2024 11:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add empty source handling for `delta` table format on `filesystem` destination #1617

Add empty source handling for `delta` table format on `filesystem` destination #1617

jorritsandbrink commented Jul 19, 2024 •

edited

Loading

netlify bot commented Jul 19, 2024 •

edited

Loading

sh-rp Jul 23, 2024

sh-rp left a comment

jorritsandbrink commented Jul 24, 2024

jorritsandbrink commented Jul 31, 2024

Add empty source handling for delta table format on filesystem destination #1617

Add empty source handling for delta table format on filesystem destination #1617

Conversation

jorritsandbrink commented Jul 19, 2024 • edited Loading

Description

Related Issues

netlify bot commented Jul 19, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

sh-rp Jul 23, 2024

Choose a reason for hiding this comment

sh-rp left a comment

Choose a reason for hiding this comment

jorritsandbrink commented Jul 24, 2024

jorritsandbrink commented Jul 31, 2024

Add empty source handling for `delta` table format on `filesystem` destination #1617

Add empty source handling for `delta` table format on `filesystem` destination #1617

jorritsandbrink commented Jul 19, 2024 •

edited

Loading

netlify bot commented Jul 19, 2024 •

edited

Loading