Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement support for FileFormat Options for COPY and Create External Table #7298

Closed
devinjdangelo opened this issue Aug 16, 2023 · 2 comments · Fixed by #7390
Closed

Implement support for FileFormat Options for COPY and Create External Table #7298

devinjdangelo opened this issue Aug 16, 2023 · 2 comments · Fixed by #7390
Labels
enhancement New feature or request

Comments

@devinjdangelo
Copy link
Contributor

Is your feature request related to a problem or challenge?

Currently, the only way to customize how files are written as the result of a COPY or INSERT query is via session level defaults. E.g.

set datafusion.execution.parquet.max_row_group_size=9999;

INSERT INTO my_table values (1,2), (3,4);
COPY my_table to mytable.parquet;

We should implement statement and table level options so individual statements can customize the write behavior as desired. E.g.:

COPY my_table to mytable.parquet (max_row_group_size 9999)

Or to set default options for a specific table, rather than globally in a session:

CREATE EXTERNAL TABLE my_table 
STORED AS PARQUET
LOCATION 'my_table/' 
OPTIONS (max_row_group_size 9999)

Describe the solution you'd like

We could implement a WriteOptions struct with a WriteOptions::from(Vec<(String, String)>) method so the struct can be created from arbitrary string tuples passed like in the statements above. FileSinks could then accept a WriteOptions struct and use it to construct a serializer with the desired settings. DataFrame API can be refactored to accept WriteOptions directly.

The existing code which creates a parquet::WriterProperties from session configs should be refactored to reduce code duplication / share implementation details with parsing statement level overrides.

Describe alternatives you've considered

Rather than just a generic WriteOptions struct, we may want a WriteOptions trait with specific structs for each file format, i.e. CsvWriteOptions. Each file format can decide how to handle each option and if desired emit a warning/error if invalid options are passed (e.g. row_group_size is passed to Csv writer).

Additional context

Relevant recent PRs for supporting writes: #7244 #7283

@devinjdangelo devinjdangelo added the enhancement New feature or request label Aug 16, 2023
@alamb
Copy link
Contributor

alamb commented Aug 16, 2023

Relevant comment from @devinjdangelo on #7283 (comment):

This raises an interesting question about the desired behavior in this scenario. If the options specify an irrelevant setting (row_group_size for a json setting), should DataFusion:

Ignore the irrelevant setting (current behavior)
Ignore the irrelevant setting but emit a warning
Raise an error and refuse to execute the query entirely

I personally think returning an error and refuse the execute the query is the most user friendly thing to do -- that way user mistakes or typo/misspellings of the option will be caught quickly rather than being masked.

A warning would also be ok, but many places DataFusion is used don't necessarily have a way to sending warnings back to the user (e.g. the warnings may end up in a server log somewhere)

@devinjdangelo
Copy link
Contributor Author

Thanks @alamb . I agree that throwing an error seems like the best approach to start. I plan to work on this over the next few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants