Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot read delta table partitioned by a date typed column #563

Closed
burakyilmaz321 opened this issue Feb 25, 2022 · 8 comments · Fixed by #565
Closed

Cannot read delta table partitioned by a date typed column #563

burakyilmaz321 opened this issue Feb 25, 2022 · 8 comments · Fixed by #565
Labels
bug Something isn't working

Comments

@burakyilmaz321
Copy link

Environment

Delta-rs version: 0.5.6

Binding: python

Environment: any

  • Cloud provider: any
  • OS: any
  • Other: n/a

Bug

What happened: Cannot read delta table partitioned by a date type column.

What you expected to happen: It should be able to read tables partitioned by date.

How to reproduce it:

Generate fake data and save it

from pyspark.sql import functions as F

df = spark.range(5).withColumn("date", F.lit("2021-01-01").cast("date"))
df.write.partitionBy("date").format("delta").save("partitioned_df")
print(df.schema())
# StructType(List(StructField(id,LongType,false),StructField(date,DateType,true)))

Read this with deltalake

from deltalake import DeltaTable

DeltaTable("partitioned_df").to_pandas()

This raises ArrowNotImplementedError: Unsupported cast from string to date32 using function cast_date32

Full trace:

ArrowNotImplementedError                  Traceback (most recent call last)
/Users/buraky/delta/sketch.ipynb Cell 29' in <module>
----> 1 DeltaTable("partitioned_df").to_pandas()

File ~/Environments/delta/lib/python3.8/site-packages/deltalake/table.py:321, in DeltaTable.to_pandas(self, partitions, columns, filesystem)
    307 def to_pandas(
    308     self,
    309     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    310     columns: Optional[List[str]] = None,
    311     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    312 ) -> "pandas.DataFrame":
    313     """
    314     Build a pandas dataframe using data from the DeltaTable.
    315 
   (...)
    319     :return: a pandas dataframe
    320     """
--> 321     return self.to_pyarrow_table(
    322         partitions=partitions, columns=columns, filesystem=filesystem
    323     ).to_pandas()

File ~/Environments/delta/lib/python3.8/site-packages/deltalake/table.py:303, in DeltaTable.to_pyarrow_table(self, partitions, columns, filesystem)
    289 def to_pyarrow_table(
    290     self,
    291     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    292     columns: Optional[List[str]] = None,
    293     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    294 ) -> pyarrow.Table:
    295     """
    296     Build a PyArrow Table using data from the DeltaTable.
    297 
   (...)
    301     :return: the PyArrow table
    302     """
--> 303     return self.to_pyarrow_dataset(
    304         partitions=partitions, filesystem=filesystem
    305     ).to_table(columns=columns)

File ~/Environments/delta/lib/python3.8/site-packages/pyarrow/_dataset.pyx:323, in pyarrow._dataset.Dataset.to_table()

File ~/Environments/delta/lib/python3.8/site-packages/pyarrow/_dataset.pyx:2311, in pyarrow._dataset.Scanner.to_table()

File ~/Environments/delta/lib/python3.8/site-packages/pyarrow/error.pxi:143, in pyarrow.lib.pyarrow_internal_check_status()

File ~/Environments/delta/lib/python3.8/site-packages/pyarrow/error.pxi:120, in pyarrow.lib.check_status()

ArrowNotImplementedError: Unsupported cast from string to date32 using function cast_date32

More details:

As the exception message says, it seems like there is no implementation for string to date32 in arrow. I checked arrow and saw that it eventually calls this one I guess, and string to date32 casting is not implemented.

Question: Is this a concern for deltalake project, or should this be handled within pyarrow?

@burakyilmaz321 burakyilmaz321 added the bug Something isn't working label Feb 25, 2022
@wjones127
Copy link
Collaborator

@burakyilmaz321 Thanks for reporting this. I think we should be able to fix this in delta-rs. PyArrow does seem capable of parsing the partition column as a date:

>>> import pyarrow as pa
>>> import pyarrow.dataset as ds
>>> from datetime import date
>>> from tempfile import mkdtemp
>>>
>>> tab = pa.table({
...     'd': pa.array([date(2020, 1, x) for x in range(1, 4)]),
...     'x': pa.array([1, 2, 3]),
... })
>>>
>>> tmp_dir = mkdtemp()
>>> part = ds.partitioning(
...     pa.schema([("d", pa.date32())]), flavor="hive"
... )
>>> ds.write_dataset(tab, tmp_dir, partitioning=part, format='parquet')
>>> ds.dataset(tmp_dir, partitioning="hive").to_table()
pyarrow.Table
x: int64
d: string
----
x: [[1],[2],[3]]
d: [["2020-01-01"],["2020-01-02"],["2020-01-03"]]
>>> ds.dataset(tmp_dir, partitioning=part).to_table()
pyarrow.Table
x: int64
d: date32[day]
----
x: [[1],[2],[3]]
d: [[2020-01-01],[2020-01-02],[2020-01-03]]

@burakyilmaz321
Copy link
Author

@wjones127 I made a patch here and it works for me. I can create a PR if you are interested.

@wjones127
Copy link
Collaborator

@burakyilmaz321 That looks like it will solve your case, but will break the pass through of file statistics to datasets. (In particular, I think this test will fail with this change: https://github.com/delta-io/delta-rs/blob/main/python/tests/test_table_read.py#L148).

I will likely look at fixing this issue this weekend.

@wjones127
Copy link
Collaborator

@burakyilmaz321 Thanks for reporting this! 🙌

It looks like this was a regression introduced in 0.5.5; if you downgrade to 0.5.4 you should be able to read the table. I have an open PR to fix this and add tests to prevent this issue from coming up again.

@burakyilmaz321
Copy link
Author

Great news! Thanks ✋

@juancruzruizdiaz
Copy link

Hi! I am facing exactly the same issue on version 0.10.1. Does anybody knows why they have removed this fix from version 0.5.5? Thank you! 🙏

@wjones127
Copy link
Collaborator

wjones127 commented Aug 29, 2023

Does anybody knows why they have removed this fix from version 0.5.5?

Are you using PyArrow 13.0.0? There was a regression in the PyArrow library that may cause this to fail.

apache/arrow#37411

Working on fixing those soon: #1602

@juancruzruizdiaz
Copy link

I had downgrade pyarrow to 12.0.0 and it works.
Thank you! @wjones127

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants