Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_parquet(use_pyarrow=True) does not work for parquet datasets (directory) #6462

Closed
2 tasks done
karenkewitsch opened this issue Jan 26, 2023 · 4 comments
Closed
2 tasks done
Labels
bug Something isn't working python Related to Python Polars

Comments

@karenkewitsch
Copy link

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

Seems to be working up till 0.15.14, but now broken.

Reproducible example

from pathlib import Path

import pyarrow
import pyarrow.parquet as pq
import polars

parquet_path = Path(__file__).parent / "my_dataset"
test_table = pyarrow.table([["foo", "bar"], [1, 2]], names=["name", "age"])
pq.write_to_dataset(test_table, str(parquet_path), partition_cols=["age"])

polars_df = polars.read_parquet(str(parquet_path), use_pyarrow=True)

Expected behavior

I would expect it to behave like previous versions and load all the partitions in the dir.

Installed versions

---Version info---
Polars: 0.15.17
Index type: UInt32
Platform: macOS-13.1-arm64-arm-64bit
Python: 3.10.9 (main, Jan 11 2023, 08:18:22) [Clang 14.0.0 (clang-1400.0.29.202)]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.5.2
numpy: 1.23.5
fsspec: 2022.11.0
connectorx: <not installed>
xlsx2csv: <not installed>
deltalake: <not installed>
matplotlib: 3.6.3```

</details>
@karenkewitsch karenkewitsch added bug Something isn't working python Related to Python Polars labels Jan 26, 2023
@ritchie46
Copy link
Member

Read parquet never intended to do that. You can use globbing patterns to read multiple files. E.g.: pl.read_parquet("mydir/*.parquet").

@karenkewitsch
Copy link
Author

@ritchie46 , that seems to contradict the past behaviour and the documentation of the function, which states "... If the path is a directory, that directory will be used as partition aware scan." regarding the 'source' argument.

@ritchie46
Copy link
Member

@ritchie46 , that seems to contradict the past behaviour and the documentation of the function, which states "... If the path is a directory, that directory will be used as partition aware scan." regarding the 'source' argument.

Hmm.. that shouldn't be there I believe. 🤔

@ritchie46
Copy link
Member

It does work if pyarrow is false?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants