Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_pyarrow_dataset() ignores dictionary_columns in read_options #938

Closed
Kuhlwein opened this issue Nov 15, 2022 · 1 comment · Fixed by #941
Closed

to_pyarrow_dataset() ignores dictionary_columns in read_options #938

Kuhlwein opened this issue Nov 15, 2022 · 1 comment · Fixed by #941
Labels
bug Something isn't working

Comments

@Kuhlwein
Copy link
Contributor

Environment

Delta-rs version: 0.6.3

Binding: python

Environment:

  • Cloud provider: Azure
  • OS: Windows/linux
  • Other:

Bug

When passing parquet_read_options to to_pyarrow_dataset, it is possible to use dictionary_columns to control which columns should be dictionary encoded as they are read.

What happened:
Such columns are not dictionary encoded when they are read, rather they are of type string, as they would be had dictionary_columns been empty.

What you expected to happen:
I expect the particular columns to be dictionary encoded.

How to reproduce it:

import pyarrow.dataset as ds

dt = DeltaTable(path)
read_options = ds.ParquetReadOptions(dictionary_columns=["test"])
data = dt.to_pyarrow_dataset(parquet_read_options=read_options)
table = data.to_table()
print(table.schema)

I expect the column "test" to be dictionary encoded, rather than just of type string.

More details:
I believe that the problem is in table.py at line 338, the schema of self.schema.to_pyarrow() has the wrong type for the columns which are to be dictionary encoded.

Maybe it is possible to use the physical_schema property of the fragments defined just above to get the right schema, or otherwise parse the read options to modify the schema?

@Kuhlwein Kuhlwein added the bug Something isn't working label Nov 15, 2022
@wjones127
Copy link
Collaborator

Delta Tables have a specific schema, and we enforce that when reading we always use that exact schema. Dictionary types are different types in Arrow, not just a minor detail of the array, so reading as dictionary array would mean a different schema.

But perhaps we could parse the read_dictionary and have special handling for that. It does seem desirable to be able to read columns as dictionaries.

wjones127 pushed a commit that referenced this issue Nov 17, 2022
)

# Description
When passing `parquet_read_options` to `to_pyarrow_dataset` it is now
possible to use `dictionary_columns` to control which columns should be
dictionary encoded as they are read.

# Related Issue(s)
- closes #938 
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants