Use `job_config.schema` for data type conversion if specified in `load_table_from_dataframe`. #8105

tswast · 2019-05-22T21:22:31Z

Use the BigQuery schema to inform encoding of file used in load job.
This fixes an issue where a dataframe with ambiguous types (such as an
object column containing all None values) could not be appended to
an existing table, since the schemas wouldn't match in most cases.

Closes #7370.

tswast · 2019-05-23T15:16:59Z

bigquery/google/cloud/bigquery/_pandas_helpers.py

+        raise ValueError("pyarrow is required for BigQuery schema conversion")
+
+    if len(bq_schema) != len(dataframe.columns):
+        raise ValueError(


Note from chat: Maybe we want to allow the bq_schema to be used as an override? Any unmentioned columns get the default pandas type inference.

This is how pandas-gbq works. The schema argument is more used as an override for when a particular column is ambiguous.

On second thought, let's leave this as-is and fixup later. Filed #8140 as a feature request.

tswast · 2019-05-23T15:18:13Z

bigquery/google/cloud/bigquery/client.py

@@ -1296,7 +1304,10 @@ def load_table_from_dataframe(
        os.close(tmpfd)

        try:
-            dataframe.to_parquet(tmppath)
+            if job_config.schema:


Note from chat: if schema isn't populated, we might want to call get_table and use the table's schema if it the table already exists and we're appending to it. (This is what pandas-gbq does)

Ditto. Filed #8142. I think this would make a good feature, but shouldn't block this PR.

shollyman

Everything seems reasonable, but the multiple conversions make me a bit twitchy. The part that I didn't verify is the parquet-to-bq type mappings: is there anything special we need to do similar to the avro logicaltype annotations to get the correct type mappings from there?

aryann · 2019-05-28T17:40:29Z

bigquery/google/cloud/bigquery/_pandas_helpers.py

+
+
+BQ_TO_ARROW_SCALARS = {}
+if pyarrow is not None:  # pragma: NO COVER


Consider this format, which is more idiomatic:

if pyarrow:
MY_CONST = {
...
}
else:
MY_CONST = {}

Note two things: (1) all initialization is inside a branch and (2) we no longer use "is not None" or "is None".

aryann · 2019-05-28T17:40:57Z

bigquery/google/cloud/bigquery/_pandas_helpers.py

+
+    if len(bq_schema) != len(dataframe.columns):
+        raise ValueError(
+            "Number of columns in schema must match number of columns in dataframe"


aryann · 2019-05-28T20:13:53Z

bigquery/google/cloud/bigquery/_pandas_helpers.py

+        "STRING": pyarrow.string,
+        "TIME": pyarrow_time,
+        "TIMESTAMP": pyarrow_timestamp,
+    }


Is there a list somewhere that defines BQ types? I wonder if we can add an assertion here that BQ_TO_ARROW_SCALARS.keys() == BQ_TYPES.keys(), so we have a better guarantee that all types are accounted for.

Not yet. There's an open FR at #7632 I've been hesitant to add such a list, since it's yet another thing to keep in sync manually, but I agree it'd be useful for cases such as this.

aryann · 2019-05-28T20:15:14Z

bigquery/google/cloud/bigquery/_pandas_helpers.py

+
+    Returns None if default Arrow type inspection should be used.
+    """
+    # TODO: Use pyarrow.list_(item_type) for repeated (array) fields.


It would be good to include a little more context in the TODO comment as to why we are not adding support for these in this change.

There wasn't a good reason before, so I implemented this.

I tried adding it to the system tests, but now I see there are some open issues in pyarrow that are preventing this support. I think REPEATED support may get fixed when #8093 is fixed, since there's a mode mismatch right now (fields are always set to nullable in the parquet file).

Struct support depends on https://jira.apache.org/jira/browse/ARROW-2587. I've filed https://github.com/googleapis/google-cloud-python/issues/8191 to track this as an open issue.

aryann · 2019-05-28T20:18:16Z

bigquery/tests/system.py

+        )
+        num_rows = 100
+        nulls = [None] * num_rows
+        dataframe = pandas.DataFrame(


Nit: I would suggest putting in non-null values for the sample data to make the test more complete.

The bug actually only shows up when the whole column contains nulls, because when at least one non-null value is present, pandas auto-detect code works correctly. I do include non-nulls in the unit tests.

aryann · 2019-05-28T20:19:23Z

bigquery/tests/system.py

+        table = retry_403(Config.CLIENT.create_table)(
+            Table(table_id, schema=table_schema)
+        )
+        self.to_delete.insert(0, table)


Is there a reason why we prepend the table ref to to_delete instead of appending it?

So that the table gets deleted before the dataset does.

aryann · 2019-05-28T20:22:13Z

bigquery/tests/unit/test__pandas_helpers.py

+try:
+    import pandas
+except ImportError:  # pragma: NO COVER
+    pandas = None


If you ever get tired of the try/except pattern, you can write a maybe_import(*args) function that returns a tuple of modules:

def maybe_import(*args): modules = [] for arg in args: try: modules.append(__import__(arg)) except ImportError: return tupe([None] * len(args)) return tuple(modules)

aryann · 2019-05-28T20:23:30Z

bigquery/tests/unit/test__pandas_helpers.py

+@pytest.mark.skipIf(pyarrow is None, "Requires `pyarrow`")
+def test_bq_to_arrow_data_type(module_under_test, bq_type, bq_mode, is_correct_type):
+    field = schema.SchemaField("ignored_name", bq_type, mode=bq_mode)
+    got = module_under_test.bq_to_arrow_data_type(field)


s/got/actual/?

…d_table_from_dataframe`. Use the BigQuery schema to inform encoding of file used in load job. This fixes an issue where a dataframe with ambiguous types (such as an `object` column containing all `None` values) could not be appended to an existing table, since the schemas wouldn't match in most cases.

tswast

Thanks for the review @aryann !

I tried to add support for REPEATED and RECORD columns, but hit some roadblocks. I'll follow-up with those types.

Note: Since I did add partial support, I know test coverage will fail. I'll add a commit with additional tests before submitting.

tswast · 2019-05-29T18:30:43Z

bigquery/tests/system.py

+        )
+        num_rows = 100
+        nulls = [None] * num_rows
+        dataframe = pandas.DataFrame(


The bug actually only shows up when the whole column contains nulls, because when at least one non-null value is present, pandas auto-detect code works correctly. I do include non-nulls in the unit tests.

tswast · 2019-05-29T19:25:03Z

bigquery/tests/unit/test__pandas_helpers.py

+@pytest.mark.skipIf(pyarrow is None, "Requires `pyarrow`")
+def test_bq_to_arrow_data_type(module_under_test, bq_type, bq_mode, is_correct_type):
+    field = schema.SchemaField("ignored_name", bq_type, mode=bq_mode)
+    got = module_under_test.bq_to_arrow_data_type(field)


tswast · 2019-05-30T00:37:42Z

bigquery/google/cloud/bigquery/_pandas_helpers.py

+
+    Returns None if default Arrow type inspection should be used.
+    """
+    # TODO: Use pyarrow.list_(item_type) for repeated (array) fields.


There wasn't a good reason before, so I implemented this.

I tried adding it to the system tests, but now I see there are some open issues in pyarrow that are preventing this support. I think REPEATED support may get fixed when #8093 is fixed, since there's a mode mismatch right now (fields are always set to nullable in the parquet file).

Struct support depends on https://jira.apache.org/jira/browse/ARROW-2587. I've filed https://github.com/googleapis/google-cloud-python/issues/8191 to track this as an open issue.

tswast · 2019-05-30T00:37:52Z

bigquery/google/cloud/bigquery/_pandas_helpers.py

+
+    if len(bq_schema) != len(dataframe.columns):
+        raise ValueError(
+            "Number of columns in schema must match number of columns in dataframe"


…ema.

tswast · 2019-05-30T21:07:18Z

@aryann Just pushed some commits that get unit test coverage back to 100%. Added system test for non-null scalar values + explicit schema.

aryann · 2019-05-30T23:13:28Z

bigquery/tests/unit/test__pandas_helpers.py

+@pytest.fixture
+def module_under_test():
+    from google.cloud.bigquery import _pandas_helpers
+


Nit: rm empty line?

Thanks. We actually don't have much control over this, since black (Python code formatter) will just add it back.

googlebot added the cla: yes This human has signed the Contributor License Agreement. label May 22, 2019

tswast marked this pull request as ready for review May 22, 2019 23:36

tswast requested a review from a team May 22, 2019 23:36

tswast added the api: bigquery Issues related to the BigQuery API. label May 22, 2019

tswast changed the title ~~Use job_config.schema if specified in load_table_from_dataframe.~~ Use job_config.schema for data type conversion if specified in load_table_from_dataframe. May 22, 2019

tswast force-pushed the issue7370-b132658518-load-dataframe-nulls branch from 4c90553 to 2e8290e Compare May 22, 2019 23:45

tswast requested a review from shollyman May 23, 2019 14:17

tswast commented May 23, 2019

View reviewed changes

shollyman reviewed May 23, 2019

View reviewed changes

This was referenced May 24, 2019

BigQuery: Allow partial schema in load_table_from_dataframe #8140

Closed

BigQuery: get table schema if not supplied (and have pyarrow) in load_table_from_dataframe #8142

Closed

aryann suggested changes May 28, 2019

View reviewed changes

tswast added 4 commits May 29, 2019 11:09

Improve code coverage.

46c3a12

Link to LoadJobConfig.schema in docstring.

58e59ea

Support array and struct data type conversions.

aa38e42

tswast force-pushed the issue7370-b132658518-load-dataframe-nulls branch from 2e8290e to aa38e42 Compare May 30, 2019 00:38

tswast commented May 30, 2019

View reviewed changes

tswast added 2 commits May 30, 2019 13:51

Improve test coverage with unit tests.

6b22c34

Add system test for loading dataframe with non-nulls and explicit sch…

cf40257

…ema.

tswast requested a review from shollyman May 30, 2019 21:06

aryann reviewed May 30, 2019

View reviewed changes

aryann approved these changes May 30, 2019

View reviewed changes

tswast merged commit c37afe5 into googleapis:master May 31, 2019

tswast deleted the issue7370-b132658518-load-dataframe-nulls branch May 31, 2019 00:23

tswast mentioned this pull request May 31, 2019

Fix breaking change. Don't require pyarrow if schema is set, but warn. #8202

Merged

timocb mentioned this pull request Jun 5, 2019

BigQuery: Field <field> has changed mode from REQUIRED to NULLABLE #8093

Closed

tswast mentioned this pull request Nov 12, 2019

Schema conflict when storing dataframes with datetime objects using load_table_from_dataframe() #6542

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `job_config.schema` for data type conversion if specified in `load_table_from_dataframe`. #8105

Use `job_config.schema` for data type conversion if specified in `load_table_from_dataframe`. #8105

tswast commented May 22, 2019 •

edited

Loading

tswast May 23, 2019

tswast May 24, 2019

tswast May 23, 2019

tswast May 24, 2019

shollyman left a comment

aryann May 28, 2019

aryann May 28, 2019

tswast May 30, 2019

aryann May 28, 2019

tswast May 29, 2019

aryann May 28, 2019

tswast May 30, 2019

aryann May 28, 2019

tswast May 29, 2019

aryann May 28, 2019

tswast May 29, 2019

aryann May 28, 2019

aryann May 28, 2019

tswast May 29, 2019

tswast left a comment

tswast May 29, 2019

tswast May 29, 2019

tswast May 30, 2019

tswast May 30, 2019

tswast commented May 30, 2019

aryann May 30, 2019

tswast May 31, 2019



		BQ_TO_ARROW_SCALARS = {}
		if pyarrow is not None: # pragma: NO COVER

Use job_config.schema for data type conversion if specified in load_table_from_dataframe. #8105

Use job_config.schema for data type conversion if specified in load_table_from_dataframe. #8105

Conversation

tswast commented May 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shollyman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tswast left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tswast commented May 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Use `job_config.schema` for data type conversion if specified in `load_table_from_dataframe`. #8105

Use `job_config.schema` for data type conversion if specified in `load_table_from_dataframe`. #8105

tswast commented May 22, 2019 •

edited

Loading