BigQuery: Allow choice of compression when loading from dataframe #8938

plamut · 2019-08-05T15:02:25Z

Closes #7701.

This PR adds an optional parameter to Client.load_table_from_dataframe() method that allows selecting the compression method if directly serializing a dataframe to a parquet file.

How to test

The issue description should be self-explanatory, but mind one of the comments:

Also I'm actually a bit wary of exposing compression, since I'd like to keep the option open to change the file format we serialize to. Parquet happens to be the best match to BigQuery's data types right now, but it'd be good to keep the option open to move to something else in the future.

I opted for the name parquet_compression instead of compression to indicate that the parameter is specific to parquet serialization, and not applicable in all use cases of the method.

bigquery/google/cloud/bigquery/client.py

tswast · 2019-08-05T23:24:42Z

bigquery/google/cloud/bigquery/client.py

@@ -1527,7 +1537,7 @@ def load_table_from_dataframe(
                        PendingDeprecationWarning,
                        stacklevel=2,
                    )
-                dataframe.to_parquet(tmppath)
+                dataframe.to_parquet(tmppath, compression=parquet_compression)


We do need to modify _pandas_helpers.dataframe_to_parquet as well. See:

google-cloud-python/bigquery/google/cloud/bigquery/_pandas_helpers.py

Line 230 in 1b3e822

pyarrow.parquet.write_table(arrow_table, filepath)

The underlying pyarrow.parquet.write_table function also takes a compression argument.

Long-term, I expect the _pandas_helpers.dataframe_to_parquet function to get used more often than the dataframe.to_parquet method. We'll want to start fetching the table schema if not provided and use that for pandas to BigQuery type conversions #8142.

There is a mismatch between the two methods, the pyarrow's accepts a richer range of compression methods.

Having either two different compression parameters, or a single parameter that accepts different values depending on the context, could be confusing to the end users, thus I will only allow the compression methods supported by both.

However, if there are uses cases that would need specific support for LZO, LZ4, and ZSTD as well, please let me know. Probably there aren't, because to date the compression method has not been exposed anyway?

It's good that we are marking the parameter as beta, as I can see how this can change in the future. 👍

Update:
Changed my mind after realizing that we probably should document the underlying serialization methods and link to their original docs. Since we are already exposing that detail, it makes less sense to try hiding compression options behind the lowest common denominator.

tswast

Thanks.

I agree that it's confusing to expose this level of internal details, but I think your docstring description is clear.

Allow choice of compression when loading from DF

bf8513b

plamut added the api: bigquery Issues related to the BigQuery API. label Aug 5, 2019

plamut requested a review from a team August 5, 2019 15:02

plamut assigned tswast Aug 5, 2019

googlebot added the cla: yes This human has signed the Contributor License Agreement. label Aug 5, 2019

tswast reviewed Aug 5, 2019

View reviewed changes

plamut added 2 commits August 6, 2019 10:52

Mark parquet_compression parameter as [Beta]

f8170e9

Support compression arg in dataframe_to_parquet()

6e6797a

plamut force-pushed the iss-7701 branch from 6b36d6d to 6e6797a Compare August 6, 2019 15:56

tswast approved these changes Aug 6, 2019

View reviewed changes

plamut added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 6, 2019

yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 6, 2019

plamut merged commit a015978 into googleapis:master Aug 6, 2019

plamut deleted the iss-7701 branch August 6, 2019 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: Allow choice of compression when loading from dataframe #8938

BigQuery: Allow choice of compression when loading from dataframe #8938

plamut commented Aug 5, 2019 •

edited

Loading

tswast Aug 5, 2019

plamut Aug 6, 2019 •

edited

Loading

tswast left a comment

BigQuery: Allow choice of compression when loading from dataframe #8938

BigQuery: Allow choice of compression when loading from dataframe #8938

Conversation

plamut commented Aug 5, 2019 • edited Loading

How to test

tswast Aug 5, 2019

Choose a reason for hiding this comment

plamut Aug 6, 2019 • edited Loading

Choose a reason for hiding this comment

tswast left a comment

Choose a reason for hiding this comment

plamut commented Aug 5, 2019 •

edited

Loading

plamut Aug 6, 2019 •

edited

Loading