Support writing to Arrow files #8504

alamb · 2023-12-11T22:06:48Z

Is your feature request related to a problem or challenge?

We currently support reading Arrow files:

https://github.com/apache/arrow-datafusion/blob/95ba48bd2291dd5c303bdaf88cbb55c79d395930/datafusion/sqllogictest/test_files/arrow_files.slt#L1

However, we do not support writing them:

❯ copy (values (1)) to '/tmp/data.arrow';
This feature is not implemented: Writer not implemented for this format

Describe the solution you'd like

I would like to be able to write to arrow files using the COPY command and EXTERNAL TABLEs

The idea would be to implement create_writer_physical_plan

https://github.com/apache/arrow-datafusion/blob/95ba48bd2291dd5c303bdaf88cbb55c79d395930/datafusion/core/src/datasource/file_format/arrow.rs#L51

Following the model of the CSV file format and
https://github.com/apache/arrow-datafusion/blob/95ba48bd2291dd5c303bdaf88cbb55c79d395930/datafusion/core/src/datasource/file_format/csv.rs#L262-L290

Then add tests in copy.slt and arrow_file.slt

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

alamb · 2023-12-11T22:08:55Z

@devinjdangelo is there any "gotchas" you know about implementing this support? If not I think it would be a good first issue for someone (and I will mark it as such)

devinjdangelo · 2023-12-11T22:20:08Z

I'm not expecting any gotchas. If the Arrow format can serialize each RecordBatch independently (as in CSV and JSON), then it can reuse most of the CSV/JSON write code to get parallelism for free. If not, it still shouldn't require too much custom code to model after the non parallelized Parquet write code.

tustvold · 2023-12-12T01:24:19Z

The major gotcha will be dictionaries, which I don't have a good solution for

alamb · 2023-12-12T19:53:20Z

🤔 maybe initially we can error out with "not supported" if someone tries to write out arrow data with dictionaries.

alamb · 2023-12-12T19:55:31Z

I am tentatively marking this as a good first issue (probably for a intermediately skilled Rust developer). It may be the case that we can't parallelize this quite as efficiently (at least at first) as parquet, but I think we can at least support basic writing

The writer is here: https://docs.rs/arrow-ipc/49.0.0/arrow_ipc/writer/struct.FileWriter.html

devinjdangelo · 2023-12-13T14:30:43Z

I'm interested in getting more familiar with the arrow file format, and expect to have some time to work on this around the middle of next week. If no one can get to it by then, I'll work on it.

alamb added the enhancement New feature or request label Dec 11, 2023

alamb mentioned this issue Dec 11, 2023

[EPIC] Streaming partitioned writes #6569

Open

38 tasks

alamb mentioned this issue Dec 11, 2023

Parallel Arrow file format reading #8503

Closed

alamb added the good first issue Good for newcomers label Dec 12, 2023

devinjdangelo mentioned this issue Dec 20, 2023

Support Writing Arrow files #8608

Merged

alamb closed this as completed in #8608 Dec 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support writing to Arrow files #8504

Support writing to Arrow files #8504

alamb commented Dec 11, 2023

alamb commented Dec 11, 2023

devinjdangelo commented Dec 11, 2023

tustvold commented Dec 12, 2023

alamb commented Dec 12, 2023

alamb commented Dec 12, 2023

devinjdangelo commented Dec 13, 2023

Support writing to Arrow files #8504

Support writing to Arrow files #8504

Comments

alamb commented Dec 11, 2023

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Dec 11, 2023

devinjdangelo commented Dec 11, 2023

tustvold commented Dec 12, 2023

alamb commented Dec 12, 2023

alamb commented Dec 12, 2023

devinjdangelo commented Dec 13, 2023