-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery: Allow choice of compression when loading from dataframe #8938
Conversation
@@ -1527,7 +1537,7 @@ def load_table_from_dataframe( | |||
PendingDeprecationWarning, | |||
stacklevel=2, | |||
) | |||
dataframe.to_parquet(tmppath) | |||
dataframe.to_parquet(tmppath, compression=parquet_compression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do need to modify _pandas_helpers.dataframe_to_parquet
as well. See:
pyarrow.parquet.write_table(arrow_table, filepath) |
Long-term, I expect the _pandas_helpers.dataframe_to_parquet
function to get used more often than the dataframe.to_parquet
method. We'll want to start fetching the table schema if not provided and use that for pandas to BigQuery type conversions #8142.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a mismatch between the two methods, the pyarrow's accepts a richer range of compression methods.
Having either two different compression parameters, or a single parameter that accepts different values depending on the context, could be confusing to the end users, thus I will only allow the compression methods supported by both.
However, if there are uses cases that would need specific support for LZO, LZ4, and ZSTD as well, please let me know. Probably there aren't, because to date the compression method has not been exposed anyway?
It's good that we are marking the parameter as beta, as I can see how this can change in the future. 👍
Update:
Changed my mind after realizing that we probably should document the underlying serialization methods and link to their original docs. Since we are already exposing that detail, it makes less sense to try hiding compression options behind the lowest common denominator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
I agree that it's confusing to expose this level of internal details, but I think your docstring description is clear.
Closes #7701.
This PR adds an optional parameter to
Client.load_table_from_dataframe()
method that allows selecting the compression method if directly serializing a dataframe to a parquet file.How to test
The issue description should be self-explanatory, but mind one of the comments:
I opted for the name
parquet_compression
instead ofcompression
to indicate that the parameter is specific to parquet serialization, and not applicable in all use cases of the method.