Cannot read delta table partitioned by a date typed column #563

burakyilmaz321 · 2022-02-25T11:50:53Z

Environment

Delta-rs version: 0.5.6

Binding: python

Environment: any

Cloud provider: any
OS: any
Other: n/a

Bug

What happened: Cannot read delta table partitioned by a date type column.

What you expected to happen: It should be able to read tables partitioned by date.

How to reproduce it:

Generate fake data and save it

from pyspark.sql import functions as F

df = spark.range(5).withColumn("date", F.lit("2021-01-01").cast("date"))
df.write.partitionBy("date").format("delta").save("partitioned_df")
print(df.schema())
# StructType(List(StructField(id,LongType,false),StructField(date,DateType,true)))

Read this with deltalake

from deltalake import DeltaTable

DeltaTable("partitioned_df").to_pandas()

This raises ArrowNotImplementedError: Unsupported cast from string to date32 using function cast_date32

Full trace:

ArrowNotImplementedError                  Traceback (most recent call last)
/Users/buraky/delta/sketch.ipynb Cell 29' in <module>
----> 1 DeltaTable("partitioned_df").to_pandas()

File ~/Environments/delta/lib/python3.8/site-packages/deltalake/table.py:321, in DeltaTable.to_pandas(self, partitions, columns, filesystem)
    307 def to_pandas(
    308     self,
    309     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    310     columns: Optional[List[str]] = None,
    311     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    312 ) -> "pandas.DataFrame":
    313     """
    314     Build a pandas dataframe using data from the DeltaTable.
    315 
   (...)
    319     :return: a pandas dataframe
    320     """
--> 321     return self.to_pyarrow_table(
    322         partitions=partitions, columns=columns, filesystem=filesystem
    323     ).to_pandas()

File ~/Environments/delta/lib/python3.8/site-packages/deltalake/table.py:303, in DeltaTable.to_pyarrow_table(self, partitions, columns, filesystem)
    289 def to_pyarrow_table(
    290     self,
    291     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    292     columns: Optional[List[str]] = None,
    293     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    294 ) -> pyarrow.Table:
    295     """
    296     Build a PyArrow Table using data from the DeltaTable.
    297 
   (...)
    301     :return: the PyArrow table
    302     """
--> 303     return self.to_pyarrow_dataset(
    304         partitions=partitions, filesystem=filesystem
    305     ).to_table(columns=columns)

File ~/Environments/delta/lib/python3.8/site-packages/pyarrow/_dataset.pyx:323, in pyarrow._dataset.Dataset.to_table()

File ~/Environments/delta/lib/python3.8/site-packages/pyarrow/_dataset.pyx:2311, in pyarrow._dataset.Scanner.to_table()

File ~/Environments/delta/lib/python3.8/site-packages/pyarrow/error.pxi:143, in pyarrow.lib.pyarrow_internal_check_status()

File ~/Environments/delta/lib/python3.8/site-packages/pyarrow/error.pxi:120, in pyarrow.lib.check_status()

ArrowNotImplementedError: Unsupported cast from string to date32 using function cast_date32

More details:

As the exception message says, it seems like there is no implementation for string to date32 in arrow. I checked arrow and saw that it eventually calls this one I guess, and string to date32 casting is not implemented.

Question: Is this a concern for deltalake project, or should this be handled within pyarrow?

The text was updated successfully, but these errors were encountered:

wjones127 · 2022-02-25T19:28:25Z

@burakyilmaz321 Thanks for reporting this. I think we should be able to fix this in delta-rs. PyArrow does seem capable of parsing the partition column as a date:

>>> import pyarrow as pa
>>> import pyarrow.dataset as ds
>>> from datetime import date
>>> from tempfile import mkdtemp
>>>
>>> tab = pa.table({
...     'd': pa.array([date(2020, 1, x) for x in range(1, 4)]),
...     'x': pa.array([1, 2, 3]),
... })
>>>
>>> tmp_dir = mkdtemp()
>>> part = ds.partitioning(
...     pa.schema([("d", pa.date32())]), flavor="hive"
... )
>>> ds.write_dataset(tab, tmp_dir, partitioning=part, format='parquet')
>>> ds.dataset(tmp_dir, partitioning="hive").to_table()
pyarrow.Table
x: int64
d: string
----
x: [[1],[2],[3]]
d: [["2020-01-01"],["2020-01-02"],["2020-01-03"]]
>>> ds.dataset(tmp_dir, partitioning=part).to_table()
pyarrow.Table
x: int64
d: date32[day]
----
x: [[1],[2],[3]]
d: [[2020-01-01],[2020-01-02],[2020-01-03]]

burakyilmaz321 · 2022-02-25T21:29:41Z

@wjones127 I made a patch here and it works for me. I can create a PR if you are interested.

wjones127 · 2022-02-25T22:21:19Z

@burakyilmaz321 That looks like it will solve your case, but will break the pass through of file statistics to datasets. (In particular, I think this test will fail with this change: https://github.com/delta-io/delta-rs/blob/main/python/tests/test_table_read.py#L148).

I will likely look at fixing this issue this weekend.

wjones127 · 2022-02-26T17:19:50Z

@burakyilmaz321 Thanks for reporting this! 🙌

It looks like this was a regression introduced in 0.5.5; if you downgrade to 0.5.4 you should be able to read the table. I have an open PR to fix this and add tests to prevent this issue from coming up again.

burakyilmaz321 · 2022-02-26T19:40:41Z

Great news! Thanks ✋

juancruzruizdiaz · 2023-08-29T01:57:02Z

Hi! I am facing exactly the same issue on version 0.10.1. Does anybody knows why they have removed this fix from version 0.5.5? Thank you! 🙏

wjones127 · 2023-08-29T05:00:40Z

Does anybody knows why they have removed this fix from version 0.5.5?

Are you using PyArrow 13.0.0? There was a regression in the PyArrow library that may cause this to fail.

apache/arrow#37411

Working on fixing those soon: #1602

juancruzruizdiaz · 2023-08-29T13:00:56Z

I had downgrade pyarrow to 12.0.0 and it works.
Thank you! @wjones127

burakyilmaz321 added the bug Something isn't working label Feb 25, 2022

wjones127 mentioned this issue Feb 26, 2022

Parse partition values before handing to PyArrow #565

Merged

fvaleye closed this as completed in #565 Mar 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot read delta table partitioned by a date typed column #563

Cannot read delta table partitioned by a date typed column #563

burakyilmaz321 commented Feb 25, 2022

wjones127 commented Feb 25, 2022

burakyilmaz321 commented Feb 25, 2022

wjones127 commented Feb 25, 2022

wjones127 commented Feb 26, 2022

burakyilmaz321 commented Feb 26, 2022

juancruzruizdiaz commented Aug 29, 2023

wjones127 commented Aug 29, 2023 •

edited

Loading

juancruzruizdiaz commented Aug 29, 2023

Cannot read delta table partitioned by a date typed column #563

Cannot read delta table partitioned by a date typed column #563

Comments

burakyilmaz321 commented Feb 25, 2022

Environment

Bug

wjones127 commented Feb 25, 2022

burakyilmaz321 commented Feb 25, 2022

wjones127 commented Feb 25, 2022

wjones127 commented Feb 26, 2022

burakyilmaz321 commented Feb 26, 2022

juancruzruizdiaz commented Aug 29, 2023

wjones127 commented Aug 29, 2023 • edited Loading

juancruzruizdiaz commented Aug 29, 2023

wjones127 commented Aug 29, 2023 •

edited

Loading