-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Could not read partition values correctly with special characters in partition column name #495
Comments
@zijie0 could you check if |
@houqp
I think we pass the correct file paths and schema to Pyarrow but the partition column values are inferred by Pyarrow: delta-rs/python/deltalake/table.py Lines 282 to 288 in 36d56ec
I'm not sure if we need to do some modification to the partitioning parameter in this situation. |
Looking at the upstream arrow code, I think it's only doing url decode for partition value, but not the column name? https://github.com/apache/arrow/blob/e5f3e04b4b80c9b9c53f1f0f71f39d9f8308dced/cpp/src/arrow/dataset/partition.cc#L593-L596. @zijie0 can you try creating the dataset without providing the pyarrow schema to confirm whether this is the case? Ideally, we shouldn't depend on the file path for partitioning discovery because the delta spec doesn't care how the file paths are generated. In theory, a writer can just use random value as the file path without any partition value encoding. It's the delta table reader's responsibility to populate the partition column info based off the partition value info from the corresponding add action. This would be more of a long term fix for this problem. |
@houqp you are right, arrow doesn't decode the column name:
|
Thanks @zijie0 for the deep dive :) Unfortunately there is no quick and easy way to fix this. The two options that come to my mind are:
At the very minimum, I think we should file an upstream arrow github issue :) |
@houqp I just tested the case on arrow version 7.0.0, it is fixed by apache/arrow#11858 . Thanks for following up this issue. |
I am having a similar issue with special characters in azure data lake gen2 on databricks. In the below code pyarrow decodes spaces in the partitioned value paths as %20. For example column%20value. Running this: from deltalake import DeltaTable
dt = DeltaTable('file:///dbfs/<path-to-delta-table>')
df = dt.to_pyarrow_table(partitions=[('key', '=', 'spaced value')]) Results in: PyDeltaTableError: Object at location /dbfs/<path-to-table>/key=spaced%20value/{<parquet-name>.parquet not found: No such file or directory (os error 2). Any help would be appreciated. delta-rs==0.6.1 |
After doing a deep dive of the docs and the codebases of deltalake and arrow I found a very simple solution. Just pass a local pyarrow filesystem instance Sharing below in case it helps anyone else. from pyarrow import fs
from deltalake import DeltaTable
local = fs.LocalFileSystem()
dt = DeltaTable('file:///dbfs/<path-to-delta-table')
df = dt.to_pyarrow_table(partitions=[('key', '=', 'spaced value')], filesystem = local) |
Environment
Delta-rs version: 0.4.1
Binding: Python
Environment:
Bug
What happened:
When reading partitioned tables with special characters in partition column name, the partition column would get NaN values.
What you expected to happen:
Could be able to read these tables and got the same results with native Spark.
How to reproduce it:
Create a partitioned table with special character in the partition column name.
Then try to read the table:
It seems that we are leveraging pyarrow to handle the partition folders. But Delta generated file paths differently than Hive?
We can see from _delta_log:
The field name is 'x%20x'. But the file path generated by Delta is:
Could anyone take a look? Thanks.
The text was updated successfully, but these errors were encountered: