feat: Split output into multiple files #466

bjchambers · 2023-06-30T16:42:06Z

Summary
Rather than producing a single Parquet file containing the entire result set, we should split the results into files.

There are two reasons -- separate partitions should be able to write separate files and large results should be able to roll into multiple files, allowing the index columns to be written out and dropped, etc.

The API already supports this, but the Python client (and other places) likely don't have all the plumbing in place.

For an initial pass, it is likely OK to have the Python client download all files and combine them to a single data frame, but this can (and should) evolve over time to allow paging over the files (eg., fetch the first file and turn that into a data frame) and/or streaming support (fetch files as they are available), etc.

Have the Parquet sink rotate files every N (~1,000,000 rows or so)
Verify everything works when producing multiple files

bjchambers · 2023-06-30T16:42:50Z

This is likely necessary to make maximal use of partitioned execution.

kevinjnguyen · 2023-07-05T22:52:14Z

I wrote up a Python Client proposal design doc here: https://docs.google.com/document/d/1CHTiyLDD52FpwSI-SEhqft9HT1bYB-2WFTrCNrxqC0w/edit?usp=sharing

Updates the Python client to support multiple file on output. Writing the tests for this proved difficult so I went with manually testing: a single result ✅ and multiple results ✅

Also introduces paged file output. This does a hacky CSV write by buffering a batch at a time. This is related to #486. This is part of #465. This is part of #466.

bjchambers · 2023-07-10T21:57:58Z

My latest PR (#495) should write multiple files. @kevinjnguyen once that goes in, would you be able to verify everything is working with the python client support?

Also introduces paged file output. This does a hacky CSV write by buffering a batch at a time. This is related to #486. This is part of #465. This is part of #466.

bjchambers added the enhancement New feature or request label Jun 30, 2023

bjchambers added a commit that referenced this issue Jul 10, 2023

feat: Use object_store to write files

12109b5

Also introduces paged file output. This does a hacky CSV write by buffering a batch at a time. This is related to #486. This is part of #465. This is part of #466.

bjchambers added a commit that referenced this issue Jul 10, 2023

feat: Use object_store to write files

7b81dc4

Also introduces paged file output. This does a hacky CSV write by buffering a batch at a time. This is related to #486. This is part of #465. This is part of #466.

bjchambers mentioned this issue Jul 10, 2023

feat: Use object_store to write files #492

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Split output into multiple files #466

feat: Split output into multiple files #466

bjchambers commented Jun 30, 2023

bjchambers commented Jun 30, 2023

kevinjnguyen commented Jul 5, 2023

bjchambers commented Jul 10, 2023

feat: Split output into multiple files #466

feat: Split output into multiple files #466

Comments

bjchambers commented Jun 30, 2023

bjchambers commented Jun 30, 2023

kevinjnguyen commented Jul 5, 2023

bjchambers commented Jul 10, 2023