Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Split output into multiple files #466

Open
2 tasks
bjchambers opened this issue Jun 30, 2023 · 3 comments
Open
2 tasks

feat: Split output into multiple files #466

bjchambers opened this issue Jun 30, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@bjchambers
Copy link
Collaborator

Summary
Rather than producing a single Parquet file containing the entire result set, we should split the results into files.

There are two reasons -- separate partitions should be able to write separate files and large results should be able to roll into multiple files, allowing the index columns to be written out and dropped, etc.

The API already supports this, but the Python client (and other places) likely don't have all the plumbing in place.

For an initial pass, it is likely OK to have the Python client download all files and combine them to a single data frame, but this can (and should) evolve over time to allow paging over the files (eg., fetch the first file and turn that into a data frame) and/or streaming support (fetch files as they are available), etc.

  • Have the Parquet sink rotate files every N (~1,000,000 rows or so)
  • Verify everything works when producing multiple files
@bjchambers bjchambers added the enhancement New feature or request label Jun 30, 2023
@bjchambers
Copy link
Collaborator Author

This is likely necessary to make maximal use of partitioned execution.

@kevinjnguyen
Copy link
Contributor

I wrote up a Python Client proposal design doc here: https://docs.google.com/document/d/1CHTiyLDD52FpwSI-SEhqft9HT1bYB-2WFTrCNrxqC0w/edit?usp=sharing

github-merge-queue bot pushed a commit that referenced this issue Jul 7, 2023
Updates the Python client to support multiple file on output. Writing
the tests for this proved difficult so I went with manually testing: a
single result ✅ and multiple results ✅
bjchambers added a commit that referenced this issue Jul 10, 2023
Also introduces paged file output.

This does a hacky CSV write by buffering a batch at a time.

This is related to #486.
This is part of #465.
This is part of #466.
bjchambers added a commit that referenced this issue Jul 10, 2023
Also introduces paged file output.

This does a hacky CSV write by buffering a batch at a time.

This is related to #486.
This is part of #465.
This is part of #466.
@bjchambers
Copy link
Collaborator Author

My latest PR (#495) should write multiple files. @kevinjnguyen once that goes in, would you be able to verify everything is working with the python client support?

github-merge-queue bot pushed a commit that referenced this issue Jul 12, 2023
Also introduces paged file output.

This does a hacky CSV write by buffering a batch at a time.

This is related to #486.
This is part of #465.
This is part of #466.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants