Could not read partition values correctly with special characters in partition column name #495

zijie0 · 2021-11-11T15:40:31Z

Environment

Delta-rs version: 0.4.1

Binding: Python

Environment:

Cloud provider: N/A
OS: Windows/Ubuntu

Bug

What happened:

When reading partitioned tables with special characters in partition column name, the partition column would get NaN values.

What you expected to happen:

Could be able to read these tables and got the same results with native Spark.

How to reproduce it:

Create a partitioned table with special character in the partition column name.

sdf = spark.createDataFrame([[1, 10], [2, 20]], schema=['x%20x', 'y'])
sdf.write.format("delta").partitionBy(["x%20x"]).save("partition_table")

Then try to read the table:

In [1]: from deltalake import DeltaTable

In [2]: dt = DeltaTable('partition_table')

In [3]: dt.to_pandas()
Out[3]:
   x%20x   y
0    NaN  10
1    NaN  20

It seems that we are leveraging pyarrow to handle the partition folders. But Delta generated file paths differently than Hive?

We can see from _delta_log:

{"metaData":{"id":"7e6ec837-c032-4cec-adab-d2299042e622","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"x%20x\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"y\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":["x%20x"],"configuration":{},"createdTime":1636641415873}}

The field name is 'x%20x'. But the file path generated by Delta is:

├─x%2520x=1
├─x%2520x=2
└─_delta_log

Could anyone take a look? Thanks.

The text was updated successfully, but these errors were encountered:

houqp · 2021-11-14T00:46:49Z

@zijie0 could you check if dt.pyarrow_schema contains the right partition column name? The path generated by spark looks correct to me. The path needs to be uri encoded, that's why %20 is saved as %2520. Pyarrow's hive partitioning implementation by default does uri decode as well: https://github.com/apache/arrow/blob/caf1e1ed459e2c6ff66255fc915a5a7b16c0f5c5/python/pyarrow/_dataset.pyx#L2283.

zijie0 · 2021-11-14T06:16:02Z

@houqp dt.pyarrow_schema contains the correct partition column name:

In [10]: dt.pyarrow_schema()
Out[10]:
x%20x: int64
y: int64

I think we pass the correct file paths and schema to Pyarrow but the partition column values are inferred by Pyarrow:

delta-rs/python/deltalake/table.py

Lines 282 to 288 in 36d56ec

    
           return dataset( 
        
               file_paths, 
        
               schema=self.pyarrow_schema(), 
        
               format="parquet", 
        
               filesystem=filesystem, 
        
               partitioning=partitioning(flavor="hive"), 
        
           )

I'm not sure if we need to do some modification to the partitioning parameter in this situation.

houqp · 2021-11-15T02:29:37Z

Looking at the upstream arrow code, I think it's only doing url decode for partition value, but not the column name? https://github.com/apache/arrow/blob/e5f3e04b4b80c9b9c53f1f0f71f39d9f8308dced/cpp/src/arrow/dataset/partition.cc#L593-L596.

@zijie0 can you try creating the dataset without providing the pyarrow schema to confirm whether this is the case?

Ideally, we shouldn't depend on the file path for partitioning discovery because the delta spec doesn't care how the file paths are generated. In theory, a writer can just use random value as the file path without any partition value encoding. It's the delta table reader's responsibility to populate the partition column info based off the partition value info from the corresponding add action. This would be more of a long term fix for this problem.

zijie0 · 2021-11-16T06:21:02Z

@houqp you are right, arrow doesn't decode the column name:

In [3]: ds = dataset(paths, format="parquet", partitioning=partitioning(flavor="hive"))

In [4]: ds.to_table().to_pandas()
Out[4]:
    y  x%2520x
0  20        2
1  10        1

houqp · 2021-11-16T06:33:35Z

Thanks @zijie0 for the deep dive :)

Unfortunately there is no quick and easy way to fix this. The two options that come to my mind are:

Fix this in upstream pyarrow, add partition name url decode in their c++ core.
Do it properly in delta-rs by performing the io and arrow record batch conversion entirely in rust, where we will be able to populate partition values using table metadata instead of relying on file path.

At the very minimum, I think we should file an upstream arrow github issue :)

zijie0 · 2022-02-24T01:57:03Z

@houqp I just tested the case on arrow version 7.0.0, it is fixed by apache/arrow#11858 . Thanks for following up this issue.

rando-brando · 2022-09-25T19:18:02Z

I am having a similar issue with special characters in azure data lake gen2 on databricks.

In the below code pyarrow decodes spaces in the partitioned value paths as %20. For example column%20value.
If I pass the encoded value in the error message using %20 to the partitions arg the result is an empty arrow table.
I would like to be able to resolve the encoding so I can use the package.
Note: It appears that using the dbfs filesystem is interpreted as local rather than azure gen2.

Running this:

from deltalake import DeltaTable
dt = DeltaTable('file:///dbfs/<path-to-delta-table>')
df = dt.to_pyarrow_table(partitions=[('key', '=', 'spaced value')])

Results in:

PyDeltaTableError: Object at location /dbfs/<path-to-table>/key=spaced%20value/{<parquet-name>.parquet not found: No such file or directory (os error 2).

Any help would be appreciated.

delta-rs==0.6.1
pyarrow==9.0.0

rando-brando · 2022-09-26T13:05:06Z

After doing a deep dive of the docs and the codebases of deltalake and arrow I found a very simple solution. Just pass a local pyarrow filesystem instance Sharing below in case it helps anyone else.

from pyarrow import fs
from deltalake import DeltaTable

local = fs.LocalFileSystem()

dt = DeltaTable('file:///dbfs/<path-to-delta-table')
df = dt.to_pyarrow_table(partitions=[('key', '=', 'spaced value')], filesystem = local)

zijie0 added the bug Something isn't working label Nov 11, 2021

zijie0 mentioned this issue Nov 17, 2021

[CPP] Arrow does not decode partition column name from directory path apache/arrow#11718

Closed

houqp mentioned this issue Dec 8, 2021

ARROW-14737: [C++][Dataset] Support URI-decoding partition keys apache/arrow#11858

Closed

zijie0 closed this as completed Feb 24, 2022

This was referenced Dec 23, 2024

chore(deps): update delta_kernel requirement from 0.5.0 to 0.6.0 #3083

Closed

chore(deps): update delta_kernel requirement from 0.5.0 to 0.6.0 hntd187/delta-rs#21

Closed

chore(deps): update delta_kernel requirement from 0.5.0 to 0.6.0 roeap/delta-rs#249

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could not read partition values correctly with special characters in partition column name #495

Could not read partition values correctly with special characters in partition column name #495

zijie0 commented Nov 11, 2021

houqp commented Nov 14, 2021

zijie0 commented Nov 14, 2021

houqp commented Nov 15, 2021

zijie0 commented Nov 16, 2021

houqp commented Nov 16, 2021

zijie0 commented Feb 24, 2022

rando-brando commented Sep 25, 2022 •

edited

Loading

rando-brando commented Sep 26, 2022 •

edited

Loading

Could not read partition values correctly with special characters in partition column name #495

Could not read partition values correctly with special characters in partition column name #495

Comments

zijie0 commented Nov 11, 2021

Environment

Bug

houqp commented Nov 14, 2021

zijie0 commented Nov 14, 2021

houqp commented Nov 15, 2021

zijie0 commented Nov 16, 2021

houqp commented Nov 16, 2021

zijie0 commented Feb 24, 2022

rando-brando commented Sep 25, 2022 • edited Loading

rando-brando commented Sep 26, 2022 • edited Loading

rando-brando commented Sep 25, 2022 •

edited

Loading

rando-brando commented Sep 26, 2022 •

edited

Loading