Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Change API for writing partitioned Parquet to reduce code duplication #17586

Merged
merged 6 commits into from
Jul 14, 2024

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Jul 12, 2024

Move the partition parameters to existing scan_* functions instead of creating an entirely new function per scan type. There is also a bit of pre-work for getting IPC support in.

Before

df.write_parquet_partitioned(root, ["a", "b"])

After

df.write_parquet("docs/data/hive_write/", partition_by=["a", "b"])

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jul 12, 2024
@@ -258,7 +155,7 @@ where
.par_iter()
.map(|group| {
let df = unsafe {
df._take_unchecked_slice_sorted(group, false, IsSorted::Ascending)
df._take_unchecked_slice_sorted(group, true, IsSorted::Ascending)
Copy link
Collaborator Author

@nameexhaustion nameexhaustion Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by - we can enable parallel gather now that we are processing in chunks.

Copy link

codecov bot commented Jul 12, 2024

Codecov Report

Attention: Patch coverage is 86.04651% with 6 lines in your changes missing coverage. Please review.

Project coverage is 80.47%. Comparing base (64b45a8) to head (e0123c3).
Report is 8 commits behind head on main.

Files Patch % Lines
crates/polars-io/src/ipc/write.rs 0.00% 3 Missing ⚠️
crates/polars-io/src/partition.rs 72.72% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #17586      +/-   ##
==========================================
+ Coverage   80.45%   80.47%   +0.02%     
==========================================
  Files        1484     1484              
  Lines      195341   195485     +144     
  Branches     2778     2776       -2     
==========================================
+ Hits       157154   157322     +168     
+ Misses      37675    37652      -23     
+ Partials      512      511       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Julian-J-S
Copy link
Contributor

Very nice to consolidate those 2 functions!
My 2 cents on the api:

  • Using pl.PartitionedWriteOptions(...) in a function is very "rusty" but imo also very "un-pythonic" and does not fit with any other polars function. Usually python api tries to solver everything using basic types like dicts/lists and avoid additional classes
  • imo a partition_cols (arrow, pandas) / partition_by (polars) / partition (my fav.) as list[str] | None would be a much cleaner and simples api for python users

@etiennebacher
Copy link

* imo a `partition_cols` (arrow, pandas) / `partition_by` (polars) / `partition` (my fav.) as `list[str] | None` would be a much cleaner and simples api for python users

I also think this would be more readable. Maybe even a hive_partitioning argument for the symmetry with the read/scan functions?

@ritchie46
Copy link
Member

The reason we are considering the Options route is because you will have mutual exclusive arguments. E.g. if you pass a bytes object, which is a valid argument for writing a single parquet file, doesn't make sense if you write a partitioned dataset. With these options you ensure that those combinations don't occur.

@Julian-J-S
Copy link
Contributor

yeah, totally get that!

I think python and rust have very different approaches here

  • python
    • easy, concise, short
    • raise on bad input
    • join for exmaple also has left_on, right_on and on which are exclusive
  • rust
    • verbose, strict, umambiguous
    • prevent bad input by requirering "perfect" input

@stinodego
Copy link
Member

We can just raise when path is a BytesIO and the other parameters indicate partitioning. We do that often enough in other places in the API.

I'd say add a TypedDict parameter partitioning is most sensible.

@nameexhaustion nameexhaustion added the needs decision Awaiting decision by a maintainer label Jul 12, 2024
@ritchie46
Copy link
Member

I'd say add a TypedDict parameter partitioning is most sensible.

If we go for mutual excl params (which I am willing to go when properly raised), I do really think we shouldn't have nested parameters. That's just worst of both options, as you still have the worse/"unpythonic" ergonomics, but no type guarantees.

@nameexhaustion can you create the needed parameters and mark them as unstable? Then we can feel out the waters.

@nameexhaustion
Copy link
Collaborator Author

Updated

@nameexhaustion nameexhaustion changed the title feat: Change API for writing partitioned Parquet feat: Reduce duplication in API for writing partitioned Parquet Jul 12, 2024
@nameexhaustion nameexhaustion changed the title feat: Reduce duplication in API for writing partitioned Parquet feat: Change API for writing partitioned Parquet to reduce code duplication Jul 12, 2024
@ritchie46 ritchie46 merged commit d43df08 into pola-rs:main Jul 14, 2024
33 checks passed
@nameexhaustion nameexhaustion removed the needs decision Awaiting decision by a maintainer label Jul 14, 2024
@c-peters c-peters added the accepted Ready for implementation label Jul 15, 2024
@nameexhaustion nameexhaustion deleted the hive-write-refactor branch July 19, 2024 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants