-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support writing to Arrow files #8504
Comments
@devinjdangelo is there any "gotchas" you know about implementing this support? If not I think it would be a good first issue for someone (and I will mark it as such) |
I'm not expecting any gotchas. If the Arrow format can serialize each RecordBatch independently (as in CSV and JSON), then it can reuse most of the CSV/JSON write code to get parallelism for free. If not, it still shouldn't require too much custom code to model after the non parallelized Parquet write code. |
The major gotcha will be dictionaries, which I don't have a good solution for |
🤔 maybe initially we can error out with "not supported" if someone tries to write out arrow data with dictionaries. |
I am tentatively marking this as a good first issue (probably for a intermediately skilled Rust developer). It may be the case that we can't parallelize this quite as efficiently (at least at first) as parquet, but I think we can at least support basic writing The writer is here: https://docs.rs/arrow-ipc/49.0.0/arrow_ipc/writer/struct.FileWriter.html |
I'm interested in getting more familiar with the arrow file format, and expect to have some time to work on this around the middle of next week. If no one can get to it by then, I'll work on it. |
Is your feature request related to a problem or challenge?
We currently support reading Arrow files:
https://github.com/apache/arrow-datafusion/blob/95ba48bd2291dd5c303bdaf88cbb55c79d395930/datafusion/sqllogictest/test_files/arrow_files.slt#L1
However, we do not support writing them:
Describe the solution you'd like
I would like to be able to write to arrow files using the
COPY
command andEXTERNAL TABLE
sThe idea would be to implement
create_writer_physical_plan
https://github.com/apache/arrow-datafusion/blob/95ba48bd2291dd5c303bdaf88cbb55c79d395930/datafusion/core/src/datasource/file_format/arrow.rs#L51
Following the model of the CSV file format and
https://github.com/apache/arrow-datafusion/blob/95ba48bd2291dd5c303bdaf88cbb55c79d395930/datafusion/core/src/datasource/file_format/csv.rs#L262-L290
Then add tests in copy.slt and arrow_file.slt
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: