You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We could implement a WriteOptions struct with a WriteOptions::from(Vec<(String, String)>) method so the struct can be created from arbitrary string tuples passed like in the statements above. FileSinks could then accept a WriteOptions struct and use it to construct a serializer with the desired settings. DataFrame API can be refactored to accept WriteOptions directly.
The existing code which creates a parquet::WriterProperties from session configs should be refactored to reduce code duplication / share implementation details with parsing statement level overrides.
Describe alternatives you've considered
Rather than just a generic WriteOptions struct, we may want a WriteOptions trait with specific structs for each file format, i.e. CsvWriteOptions. Each file format can decide how to handle each option and if desired emit a warning/error if invalid options are passed (e.g. row_group_size is passed to Csv writer).
Additional context
Relevant recent PRs for supporting writes: #7244#7283
The text was updated successfully, but these errors were encountered:
This raises an interesting question about the desired behavior in this scenario. If the options specify an irrelevant setting (row_group_size for a json setting), should DataFusion:
Ignore the irrelevant setting (current behavior)
Ignore the irrelevant setting but emit a warning
Raise an error and refuse to execute the query entirely
I personally think returning an error and refuse the execute the query is the most user friendly thing to do -- that way user mistakes or typo/misspellings of the option will be caught quickly rather than being masked.
A warning would also be ok, but many places DataFusion is used don't necessarily have a way to sending warnings back to the user (e.g. the warnings may end up in a server log somewhere)
Is your feature request related to a problem or challenge?
Currently, the only way to customize how files are written as the result of a
COPY
orINSERT
query is via session level defaults. E.g.We should implement statement and table level options so individual statements can customize the write behavior as desired. E.g.:
Or to set default options for a specific table, rather than globally in a session:
Describe the solution you'd like
We could implement a
WriteOptions
struct with aWriteOptions::from(Vec<(String, String)>)
method so the struct can be created from arbitrary string tuples passed like in the statements above.FileSink
s could then accept aWriteOptions
struct and use it to construct a serializer with the desired settings. DataFrame API can be refactored to acceptWriteOptions
directly.The existing code which creates a
parquet::WriterProperties
from session configs should be refactored to reduce code duplication / share implementation details with parsing statement level overrides.Describe alternatives you've considered
Rather than just a generic
WriteOptions
struct, we may want aWriteOptions
trait with specific structs for each file format, i.e.CsvWriteOptions
. Each file format can decide how to handle each option and if desired emit a warning/error if invalid options are passed (e.g. row_group_size is passed to Csv writer).Additional context
Relevant recent PRs for supporting writes: #7244 #7283
The text was updated successfully, but these errors were encountered: