-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implements :streaming
option for DataFrame.to_csv/3
#889
Implements :streaming
option for DataFrame.to_csv/3
#889
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! 🚀
Regarding the options Polars provides, I think we would could expose then in a option that accepts a keyword list, like we do with :config
. We could pass through, and let the backend validate it.
lib/explorer/data_frame.ex
Outdated
""" | ||
@doc type: :io | ||
@spec to_csv(df :: DataFrame.t(), filename :: fs_entry() | String.t(), opts :: Keyword.t()) :: | ||
:ok | {:error, Exception.t()} | ||
def to_csv(df, filename, opts \\ []) do | ||
opts = Keyword.validate!(opts, header: true, delimiter: ",", config: nil) | ||
opts = Keyword.validate!(opts, header: true, delimiter: ",", streaming: false, config: nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably default to true as in the other operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@josevalim - Will do! Do you think it's worth mentioning in the docs that streaming set to true for a DataFrame with an S3 entry will fail explicitly? I could also design it so that the streaming option is ignored for a LazyFrame with an S3 entry, but for now I used the same behavior as to_parquet/3
prior to having streaming implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryancurtin the most important is to default to true, so I would say "defaults to true on supported filesystems, ignored on all others".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I just added that and tweaked the behavior so as not to raise an error when using the streaming option on an S3 filesystem given that true
is now the default. Tests are also updated to reflect that and try both cases.
@ryancurtin thank you! |
Closes #888
Description
I have added a streaming option for the
to_csv/3
that uses the Polarssink_csv
function under the hood. Polars does not have asink_csv_cloud
function similar tosink_parquet_cloud
, so for now, I implemented it only for a LazyFrame using a local entry.Notes
As far as my implementation goes, I've kept the DataFrame API clean since I'm really just adding an option, but it does beg the question of how we'd want to expose some other Polars serialization options for users of the
to_csv
function. For now I won't worry about that, though.