Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to pass empty Arrow tables/datasets to write_deltalake with rust engine #2686

Open
jorritsandbrink opened this issue Jul 19, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@jorritsandbrink
Copy link

Description

It's currently not possible to pass empty Arrow tables or datasets to write_deltalake when using the rust engine.

Write empty Arrow table:

import pyarrow as pa
from deltalake import write_deltalake

arrow_table = pa.Table.from_pydict(
    {"foo": [1, 2], "bar": [True, False]}
)
empty_arrow_table = arrow_table.schema.empty_table()

write_deltalake("my_delta_table", arrow_table, mode="append", engine="rust")  # this creates the Delta table
write_deltalake("my_delta_table", empty_arrow_table, mode="append", engine="rust")  # this errors (but works with `pyarrow` engine)

Error:

DeltaError: Generic error: No data source supplied to write command.

Write empty dataset:

import pyarrow.parquet as pq

pq.write_table(empty_arrow_table, "my_empty_parquet_file.parquet")
empty_arrow_dataset = pa.dataset.dataset("my_empty_parquet_file.parquet")
write_deltalake("my_delta_table", empty_arrow_dataset, mode="append", engine="rust")  # this errors on both `pyarrow` and `rust` engine

Error (when using rust engine, pyarrow throws different error):

PanicException: called `Result::unwrap()` on an `Err` value: CDataInterface("Index error: list index out of range. Detail: Python exception: IndexError")

Use Case
I now use this flow to handle an empty table:

if arrow_table.num_rows > 0:
    write_deltalake(...)

And this flow to handle an empty dataset:

arrow_table = arrow_dataset.to_table()

if arrow_table.num_rows > 0:
    write_deltalake(...)

The dataset case is particulary unpleasant, because you need to eagerly materialize the dataset to a table in memory just to check if it's empty.

It would be nice if we could simply use

write_deltalake(..., data=potentially_empty_arrow_table_or_dataset, ...)

and leave handling of the "empty case" to delta-rs.

Related Issue(s)

@jorritsandbrink jorritsandbrink added the enhancement New feature or request label Jul 19, 2024
@sherlockbeard
Copy link
Contributor

can you share full code example for pyarrow dataset ?.

@jorritsandbrink
Copy link
Author

@sherlockbeard yes, here it is:

import pyarrow as pa
import pyarrow.parquet as pq
from deltalake import write_deltalake

arrow_table = pa.Table.from_pydict(
    {"foo": [1, 2], "bar": [True, False]}
)
empty_arrow_table = arrow_table.schema.empty_table()

pq.write_table(empty_arrow_table, "my_empty_parquet_file.parquet")
empty_arrow_dataset = pa.dataset.dataset("my_empty_parquet_file.parquet")

write_deltalake("my_delta_table", arrow_table, mode="append", engine="rust")  # this creates the Delta table
write_deltalake("my_delta_table", empty_arrow_dataset, mode="append", engine="rust")  # this errors on both `pyarrow` and `rust` engine

ion-elgreco pushed a commit that referenced this issue Jul 21, 2024
# Description
part of #2686  fix writing empty arrow dataset with pyarrow engine 

# Related Issue(s)
<!---
For example:

- closes #106
--->
part of #2686 

# Documentation

<!---
Share links to useful documentation
--->
@ion-elgreco ion-elgreco self-assigned this Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants