Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support writing to Arrow files #8504

Closed
alamb opened this issue Dec 11, 2023 · 6 comments · Fixed by #8608
Closed

Support writing to Arrow files #8504

alamb opened this issue Dec 11, 2023 · 6 comments · Fixed by #8608
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Dec 11, 2023

Is your feature request related to a problem or challenge?

We currently support reading Arrow files:

https://github.com/apache/arrow-datafusion/blob/95ba48bd2291dd5c303bdaf88cbb55c79d395930/datafusion/sqllogictest/test_files/arrow_files.slt#L1

However, we do not support writing them:

❯ copy (values (1)) to '/tmp/data.arrow';
This feature is not implemented: Writer not implemented for this format

Describe the solution you'd like

I would like to be able to write to arrow files using the COPY command and EXTERNAL TABLEs

The idea would be to implement create_writer_physical_plan

https://github.com/apache/arrow-datafusion/blob/95ba48bd2291dd5c303bdaf88cbb55c79d395930/datafusion/core/src/datasource/file_format/arrow.rs#L51

Following the model of the CSV file format and
https://github.com/apache/arrow-datafusion/blob/95ba48bd2291dd5c303bdaf88cbb55c79d395930/datafusion/core/src/datasource/file_format/csv.rs#L262-L290

Then add tests in copy.slt and arrow_file.slt

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Dec 11, 2023
@alamb
Copy link
Contributor Author

alamb commented Dec 11, 2023

@devinjdangelo is there any "gotchas" you know about implementing this support? If not I think it would be a good first issue for someone (and I will mark it as such)

@devinjdangelo
Copy link
Contributor

I'm not expecting any gotchas. If the Arrow format can serialize each RecordBatch independently (as in CSV and JSON), then it can reuse most of the CSV/JSON write code to get parallelism for free. If not, it still shouldn't require too much custom code to model after the non parallelized Parquet write code.

@tustvold
Copy link
Contributor

The major gotcha will be dictionaries, which I don't have a good solution for

@alamb
Copy link
Contributor Author

alamb commented Dec 12, 2023

🤔 maybe initially we can error out with "not supported" if someone tries to write out arrow data with dictionaries.

@alamb alamb added the good first issue Good for newcomers label Dec 12, 2023
@alamb
Copy link
Contributor Author

alamb commented Dec 12, 2023

I am tentatively marking this as a good first issue (probably for a intermediately skilled Rust developer). It may be the case that we can't parallelize this quite as efficiently (at least at first) as parquet, but I think we can at least support basic writing

The writer is here: https://docs.rs/arrow-ipc/49.0.0/arrow_ipc/writer/struct.FileWriter.html

@devinjdangelo
Copy link
Contributor

I'm interested in getting more familiar with the arrow file format, and expect to have some time to work on this around the middle of next week. If no one can get to it by then, I'll work on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants