Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not read partition values correctly with special characters in partition column name #495

Closed
zijie0 opened this issue Nov 11, 2021 · 8 comments
Labels
bug Something isn't working

Comments

@zijie0
Copy link
Contributor

zijie0 commented Nov 11, 2021

Environment

Delta-rs version: 0.4.1

Binding: Python

Environment:

  • Cloud provider: N/A
  • OS: Windows/Ubuntu

Bug

What happened:

When reading partitioned tables with special characters in partition column name, the partition column would get NaN values.

What you expected to happen:

Could be able to read these tables and got the same results with native Spark.

How to reproduce it:

Create a partitioned table with special character in the partition column name.

sdf = spark.createDataFrame([[1, 10], [2, 20]], schema=['x%20x', 'y'])
sdf.write.format("delta").partitionBy(["x%20x"]).save("partition_table")

Then try to read the table:

In [1]: from deltalake import DeltaTable

In [2]: dt = DeltaTable('partition_table')

In [3]: dt.to_pandas()
Out[3]:
   x%20x   y
0    NaN  10
1    NaN  20

It seems that we are leveraging pyarrow to handle the partition folders. But Delta generated file paths differently than Hive?

We can see from _delta_log:

{"metaData":{"id":"7e6ec837-c032-4cec-adab-d2299042e622","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"x%20x\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"y\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":["x%20x"],"configuration":{},"createdTime":1636641415873}}

The field name is 'x%20x'. But the file path generated by Delta is:

├─x%2520x=1
├─x%2520x=2
└─_delta_log

Could anyone take a look? Thanks.

@zijie0 zijie0 added the bug Something isn't working label Nov 11, 2021
@houqp
Copy link
Member

houqp commented Nov 14, 2021

@zijie0 could you check if dt.pyarrow_schema contains the right partition column name? The path generated by spark looks correct to me. The path needs to be uri encoded, that's why %20 is saved as %2520. Pyarrow's hive partitioning implementation by default does uri decode as well: https://github.com/apache/arrow/blob/caf1e1ed459e2c6ff66255fc915a5a7b16c0f5c5/python/pyarrow/_dataset.pyx#L2283.

@zijie0
Copy link
Contributor Author

zijie0 commented Nov 14, 2021

@houqp dt.pyarrow_schema contains the correct partition column name:

In [10]: dt.pyarrow_schema()
Out[10]:
x%20x: int64
y: int64

I think we pass the correct file paths and schema to Pyarrow but the partition column values are inferred by Pyarrow:

return dataset(
file_paths,
schema=self.pyarrow_schema(),
format="parquet",
filesystem=filesystem,
partitioning=partitioning(flavor="hive"),
)

I'm not sure if we need to do some modification to the partitioning parameter in this situation.

@houqp
Copy link
Member

houqp commented Nov 15, 2021

Looking at the upstream arrow code, I think it's only doing url decode for partition value, but not the column name? https://github.com/apache/arrow/blob/e5f3e04b4b80c9b9c53f1f0f71f39d9f8308dced/cpp/src/arrow/dataset/partition.cc#L593-L596.

@zijie0 can you try creating the dataset without providing the pyarrow schema to confirm whether this is the case?

Ideally, we shouldn't depend on the file path for partitioning discovery because the delta spec doesn't care how the file paths are generated. In theory, a writer can just use random value as the file path without any partition value encoding. It's the delta table reader's responsibility to populate the partition column info based off the partition value info from the corresponding add action. This would be more of a long term fix for this problem.

@zijie0
Copy link
Contributor Author

zijie0 commented Nov 16, 2021

@houqp you are right, arrow doesn't decode the column name:

In [3]: ds = dataset(paths, format="parquet", partitioning=partitioning(flavor="hive"))

In [4]: ds.to_table().to_pandas()
Out[4]:
    y  x%2520x
0  20        2
1  10        1

@houqp
Copy link
Member

houqp commented Nov 16, 2021

Thanks @zijie0 for the deep dive :)

Unfortunately there is no quick and easy way to fix this. The two options that come to my mind are:

  1. Fix this in upstream pyarrow, add partition name url decode in their c++ core.
  2. Do it properly in delta-rs by performing the io and arrow record batch conversion entirely in rust, where we will be able to populate partition values using table metadata instead of relying on file path.

At the very minimum, I think we should file an upstream arrow github issue :)

@zijie0
Copy link
Contributor Author

zijie0 commented Feb 24, 2022

@houqp I just tested the case on arrow version 7.0.0, it is fixed by apache/arrow#11858 . Thanks for following up this issue.

@zijie0 zijie0 closed this as completed Feb 24, 2022
@rando-brando
Copy link

rando-brando commented Sep 25, 2022

I am having a similar issue with special characters in azure data lake gen2 on databricks.

In the below code pyarrow decodes spaces in the partitioned value paths as %20. For example column%20value.
If I pass the encoded value in the error message using %20 to the partitions arg the result is an empty arrow table.
I would like to be able to resolve the encoding so I can use the package.
Note: It appears that using the dbfs filesystem is interpreted as local rather than azure gen2.

Running this:

from deltalake import DeltaTable
dt = DeltaTable('file:///dbfs/<path-to-delta-table>')
df = dt.to_pyarrow_table(partitions=[('key', '=', 'spaced value')])

Results in:

PyDeltaTableError: Object at location /dbfs/<path-to-table>/key=spaced%20value/{<parquet-name>.parquet not found: No such file or directory (os error 2).

Any help would be appreciated.

delta-rs==0.6.1
pyarrow==9.0.0

@rando-brando
Copy link

rando-brando commented Sep 26, 2022

After doing a deep dive of the docs and the codebases of deltalake and arrow I found a very simple solution. Just pass a local pyarrow filesystem instance Sharing below in case it helps anyone else.

from pyarrow import fs
from deltalake import DeltaTable

local = fs.LocalFileSystem()

dt = DeltaTable('file:///dbfs/<path-to-delta-table')
df = dt.to_pyarrow_table(partitions=[('key', '=', 'spaced value')], filesystem = local)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants