Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements :streaming option for DataFrame.to_csv/3 #889

Merged
merged 5 commits into from
Mar 27, 2024

Conversation

ryancurtin
Copy link
Contributor

Closes #888

Description

I have added a streaming option for the to_csv/3 that uses the Polars sink_csv function under the hood. Polars does not have a sink_csv_cloud function similar to sink_parquet_cloud, so for now, I implemented it only for a LazyFrame using a local entry.

Notes

As far as my implementation goes, I've kept the DataFrame API clean since I'm really just adding an option, but it does beg the question of how we'd want to expose some other Polars serialization options for users of the to_csv function. For now I won't worry about that, though.

Copy link
Member

@philss philss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! 🚀

Regarding the options Polars provides, I think we would could expose then in a option that accepts a keyword list, like we do with :config. We could pass through, and let the backend validate it.

"""
@doc type: :io
@spec to_csv(df :: DataFrame.t(), filename :: fs_entry() | String.t(), opts :: Keyword.t()) ::
:ok | {:error, Exception.t()}
def to_csv(df, filename, opts \\ []) do
opts = Keyword.validate!(opts, header: true, delimiter: ",", config: nil)
opts = Keyword.validate!(opts, header: true, delimiter: ",", streaming: false, config: nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably default to true as in the other operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josevalim - Will do! Do you think it's worth mentioning in the docs that streaming set to true for a DataFrame with an S3 entry will fail explicitly? I could also design it so that the streaming option is ignored for a LazyFrame with an S3 entry, but for now I used the same behavior as to_parquet/3 prior to having streaming implemented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryancurtin the most important is to default to true, so I would say "defaults to true on supported filesystems, ignored on all others".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I just added that and tweaked the behavior so as not to raise an error when using the streaming option on an S3 filesystem given that true is now the default. Tests are also updated to reflect that and try both cases.

@philss philss merged commit 5b6eaf3 into elixir-explorer:main Mar 27, 2024
4 checks passed
@philss
Copy link
Member

philss commented Mar 27, 2024

@ryancurtin thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add :streaming option to DataFrame.to_csv/3
3 participants