Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: Allow choice of compression when loading from dataframe #8938

Merged
merged 3 commits into from
Aug 6, 2019

Conversation

plamut
Copy link
Contributor

@plamut plamut commented Aug 5, 2019

Closes #7701.

This PR adds an optional parameter to Client.load_table_from_dataframe() method that allows selecting the compression method if directly serializing a dataframe to a parquet file.

How to test

The issue description should be self-explanatory, but mind one of the comments:

Also I'm actually a bit wary of exposing compression, since I'd like to keep the option open to change the file format we serialize to. Parquet happens to be the best match to BigQuery's data types right now, but it'd be good to keep the option open to move to something else in the future.

I opted for the name parquet_compression instead of compression to indicate that the parameter is specific to parquet serialization, and not applicable in all use cases of the method.

@plamut plamut added the api: bigquery Issues related to the BigQuery API. label Aug 5, 2019
@plamut plamut requested a review from a team August 5, 2019 15:02
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Aug 5, 2019
bigquery/google/cloud/bigquery/client.py Outdated Show resolved Hide resolved
@@ -1527,7 +1537,7 @@ def load_table_from_dataframe(
PendingDeprecationWarning,
stacklevel=2,
)
dataframe.to_parquet(tmppath)
dataframe.to_parquet(tmppath, compression=parquet_compression)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need to modify _pandas_helpers.dataframe_to_parquet as well. See:

pyarrow.parquet.write_table(arrow_table, filepath)
The underlying pyarrow.parquet.write_table function also takes a compression argument.

Long-term, I expect the _pandas_helpers.dataframe_to_parquet function to get used more often than the dataframe.to_parquet method. We'll want to start fetching the table schema if not provided and use that for pandas to BigQuery type conversions #8142.

Copy link
Contributor Author

@plamut plamut Aug 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a mismatch between the two methods, the pyarrow's accepts a richer range of compression methods.

Having either two different compression parameters, or a single parameter that accepts different values depending on the context, could be confusing to the end users, thus I will only allow the compression methods supported by both.

However, if there are uses cases that would need specific support for LZO, LZ4, and ZSTD as well, please let me know. Probably there aren't, because to date the compression method has not been exposed anyway?

It's good that we are marking the parameter as beta, as I can see how this can change in the future. 👍

Update:
Changed my mind after realizing that we probably should document the underlying serialization methods and link to their original docs. Since we are already exposing that detail, it makes less sense to try hiding compression options behind the lowest common denominator.

Copy link
Contributor

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

I agree that it's confusing to expose this level of internal details, but I think your docstring description is clear.

@plamut plamut added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 6, 2019
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 6, 2019
@plamut plamut merged commit a015978 into googleapis:master Aug 6, 2019
@plamut plamut deleted the iss-7701 branch August 6, 2019 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BigQuery: Allow choice of compression for loading from dataframe
4 participants