Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(output-formats): add support for to_parquet_dir #9781

Merged
merged 17 commits into from
Aug 8, 2024

Conversation

ncclementi
Copy link
Contributor

@ncclementi ncclementi commented Aug 6, 2024

Initial implementation to add support to write parquet files to a directory as a single file or writing each batch to a different file.

Partially resolves #8584 (comment) (it doesn't cover writing csv to dirs)

TODOs:

  • Implement DuckDB case (EDIT) -> looks like duckdb only supports multiple files writing when partitioned_by which is already supported via to_parquet, unless I'm missing something.
  • remove? limit= from pyspark, not sure why is there. Looks like this was a request for to_delta (see bug(pyspark): export functions don't take param or limit #8898), and it was replicated for to_parquet_dir
  • Understand why in ibis/expr/types/core.py the method to_parquet in the class Expr doesn't pass through the params?

@ncclementi ncclementi marked this pull request as ready for review August 8, 2024 14:19
ibis/backends/__init__.py Outdated Show resolved Hide resolved
ibis/backends/__init__.py Outdated Show resolved Hide resolved
@ncclementi ncclementi changed the title feat: add support for to_parquet_dir [WIP] feat: add support for to_parquet_dir Aug 8, 2024
Copy link
Contributor

github-actions bot commented Aug 8, 2024

ACTION NEEDED

Ibis follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message.

Please update your PR title and description to match the specification.

@ncclementi
Copy link
Contributor Author

ncclementi commented Aug 8, 2024

So I'm forcing the test to write more than one file, and I have assertion errors because of sorting problems. But it looks like I can't provide force_sort directly to assert_frames and for the base backend level this is set to False.

I can do the sort by columns in the test directly (see latest push), unless there is a better way to approach this.
EDIT: sort globs

ncclementi and others added 2 commits August 8, 2024 12:37
Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>
@ncclementi ncclementi changed the title feat: add support for to_parquet_dir feat: add support for to_parquet_dir Aug 8, 2024
ibis/backends/__init__.py Outdated Show resolved Hide resolved
ibis/backends/tests/test_export.py Show resolved Hide resolved
ibis/expr/types/core.py Outdated Show resolved Hide resolved
ibis/backends/pyspark/__init__.py Show resolved Hide resolved
ibis/backends/__init__.py Outdated Show resolved Hide resolved
ibis/backends/__init__.py Outdated Show resolved Hide resolved
ncclementi and others added 3 commits August 8, 2024 13:08
Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>
Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding this!

@cpcloud
Copy link
Member

cpcloud commented Aug 8, 2024

Let's run the clouds to make sure this works there too!

@cpcloud cpcloud added the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Aug 8, 2024
@cpcloud cpcloud added the feature Features or general enhancements label Aug 8, 2024
@cpcloud cpcloud changed the title feat: add support for to_parquet_dir feat(output-formats): add support for to_parquet_dir Aug 8, 2024
@ibis-docs-bot ibis-docs-bot bot removed the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Aug 8, 2024
@ncclementi
Copy link
Contributor Author

It looks like the cloud ones were skipped, not sure why.

@cpcloud
Copy link
Member

cpcloud commented Aug 8, 2024

Oh, probably because you can't have two PRs running them simultaneously, it's annoying.

@cpcloud
Copy link
Member

cpcloud commented Aug 8, 2024

I will run them locally.

ibis/expr/types/core.py Outdated Show resolved Hide resolved
ibis/expr/types/core.py Outdated Show resolved Hide resolved
ncclementi and others added 3 commits August 8, 2024 15:26
Co-authored-by: Gil Forsyth <gforsyth@users.noreply.github.com>
Co-authored-by: Gil Forsyth <gforsyth@users.noreply.github.com>
@cpcloud
Copy link
Member

cpcloud commented Aug 8, 2024

Clouds look good

cloud in 🌐 falcon in …/ibis on  to_pq_dir is 📦 v9.2.0 via 🐍 v3.10.14 via ❄️   impure (ibis-3.10.14-env)
❯ pytest -m 'bigquery or snowflake' -n 8 --dist loadgroup --snapshot-update -q
bringing up nodes...
......x.....x.......................x......x.................xxx..xxxx.x.x.xxxx.........x...........x......x..x..................x..............xx.x.x..x....xx...x.x......x..x....x.xx.x.x.x.........x.x..x.x....x..x.xxx.xx.x.xx...x.xxx.x.x..xx.x...xx...xxxx.x..x...x.....x.x.x.....xx.xx.xxxxx.x.xxxxxxxxx...x..x...x.xx....x.x...x...x.....sx........................xx...x.... [  9%]
....x........x......x........................................x.........x.......x.x........x..........x..xx..x....................x..............x.................................x...................................x..x.....x.........x..............x.....x..x........x...x.......x.............x....x.x.....................................s..........x........................ [ 19%]
.....s....s................................x...............sx....x................x.......................x..............................x..x.............................x...............s.ssssssssssssssssssss.........................................................x....x...xx..........x...xx............x...xx.........x.......x...x....sssssssssssssssssssssssssssssssssssss [ 28%]
ssssssssssssssssssssxsssssxsx...x.......x...x.......x..x.x.x...x......x....x......x...........x........x...........x..x.........x..x...xxxxxxxxxx.x...........xx.............x............x.......x....xx...x..x.....x....................x...x..x........................................x......x............xx...x....x...........xx......x...............x........................ [ 38%]
...x.x..............................................................................................x...........................x.....x...x...................x......................x............x........xx.......x.....s........x............xx.....xx.x..xxxxxx.x.x...x......xxxxx.x...xxxxxx.xxxxxxxxx.xxx..........x....x.x...xxx...............x.......x................x..... [ 47%]
....x....xxx..xxxxxxxxxx.xxxxx.xx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.x...............x..........................................x.........................................................................................x..................x.......................x..........x..x....x.x..x.........x.....x..x [ 57%]
............x...x...x.....x....x..............x..........x...x.............xx.................xx.x..x.x......x....s........................s............x................s.........................x...................x..................................x..x.............s....................x..x...x.....................x........x.....x.......................................x [ 66%]
...................................x.x..x.......x......................x..............x......xx......x......x.....x....x......x....x.....x.............x....x...x....x........x....x..x.xx........xx.x.....xxxxxxxxxxxx..................xx.x.x.xx.....x...x.........................x........x...x....xx......x................x......xx....x........x.....x..x.x......x............ [ 76%]
..x....xx........x.x...............x........x.x.....x....x.....x..x..x........x.....x..xx...x..x.x..x..............x.......x.x..........xx............x...x.xx..x...xx............x..x..............xx.xx...x......x.x.xxx..xx...x...sxx..x.x....x..x..x....x.x..............sx..s..x....x....s.......x...x....s.........s................................x....................x..x.. [ 86%]
.........................................................................................................................................................................................x...........................................................s..............................................s................................................................................ [ 95%]
...........................................................................................................................................................................                                                                                                                                                                                                           [100%]
3238 passed, 102 skipped, 561 xfailed in 703.77s (0:11:43)

@cpcloud cpcloud merged commit 80dfbe2 into ibis-project:main Aug 8, 2024
82 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(parquet): figure out convention for multi-file parquet writing
3 participants