Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use job_config.schema for data type conversion if specified in load_table_from_dataframe. #8105

Merged

Conversation

tswast
Copy link
Contributor

@tswast tswast commented May 22, 2019

Use the BigQuery schema to inform encoding of file used in load job.
This fixes an issue where a dataframe with ambiguous types (such as an
object column containing all None values) could not be appended to
an existing table, since the schemas wouldn't match in most cases.

Closes #7370.

@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label May 22, 2019
@tswast tswast marked this pull request as ready for review May 22, 2019 23:36
@tswast tswast requested a review from a team May 22, 2019 23:36
@tswast tswast added the api: bigquery Issues related to the BigQuery API. label May 22, 2019
@tswast tswast changed the title Use job_config.schema if specified in load_table_from_dataframe. Use job_config.schema for data type conversion if specified in load_table_from_dataframe. May 22, 2019
@tswast tswast force-pushed the issue7370-b132658518-load-dataframe-nulls branch from 4c90553 to 2e8290e Compare May 22, 2019 23:45
@tswast tswast requested a review from shollyman May 23, 2019 14:17
raise ValueError("pyarrow is required for BigQuery schema conversion")

if len(bq_schema) != len(dataframe.columns):
raise ValueError(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note from chat: Maybe we want to allow the bq_schema to be used as an override? Any unmentioned columns get the default pandas type inference.

This is how pandas-gbq works. The schema argument is more used as an override for when a particular column is ambiguous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, let's leave this as-is and fixup later. Filed #8140 as a feature request.

@@ -1296,7 +1304,10 @@ def load_table_from_dataframe(
os.close(tmpfd)

try:
dataframe.to_parquet(tmppath)
if job_config.schema:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note from chat: if schema isn't populated, we might want to call get_table and use the table's schema if it the table already exists and we're appending to it. (This is what pandas-gbq does)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Filed #8142. I think this would make a good feature, but shouldn't block this PR.

Copy link
Contributor

@shollyman shollyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything seems reasonable, but the multiple conversions make me a bit twitchy. The part that I didn't verify is the parquet-to-bq type mappings: is there anything special we need to do similar to the avro logicaltype annotations to get the correct type mappings from there?



BQ_TO_ARROW_SCALARS = {}
if pyarrow is not None: # pragma: NO COVER
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this format, which is more idiomatic:

if pyarrow:
MY_CONST = {
...
}
else:
MY_CONST = {}

Note two things: (1) all initialization is inside a branch and (2) we no longer use "is not None" or "is None".


if len(bq_schema) != len(dataframe.columns):
raise ValueError(
"Number of columns in schema must match number of columns in dataframe"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Period?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"STRING": pyarrow.string,
"TIME": pyarrow_time,
"TIMESTAMP": pyarrow_timestamp,
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a list somewhere that defines BQ types? I wonder if we can add an assertion here that BQ_TO_ARROW_SCALARS.keys() == BQ_TYPES.keys(), so we have a better guarantee that all types are accounted for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet. There's an open FR at #7632 I've been hesitant to add such a list, since it's yet another thing to keep in sync manually, but I agree it'd be useful for cases such as this.


Returns None if default Arrow type inspection should be used.
"""
# TODO: Use pyarrow.list_(item_type) for repeated (array) fields.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to include a little more context in the TODO comment as to why we are not adding support for these in this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There wasn't a good reason before, so I implemented this.

I tried adding it to the system tests, but now I see there are some open issues in pyarrow that are preventing this support. I think REPEATED support may get fixed when #8093 is fixed, since there's a mode mismatch right now (fields are always set to nullable in the parquet file).

Struct support depends on https://jira.apache.org/jira/browse/ARROW-2587. I've filed https://github.com/googleapis/google-cloud-python/issues/8191 to track this as an open issue.

)
num_rows = 100
nulls = [None] * num_rows
dataframe = pandas.DataFrame(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would suggest putting in non-null values for the sample data to make the test more complete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug actually only shows up when the whole column contains nulls, because when at least one non-null value is present, pandas auto-detect code works correctly. I do include non-nulls in the unit tests.

table = retry_403(Config.CLIENT.create_table)(
Table(table_id, schema=table_schema)
)
self.to_delete.insert(0, table)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we prepend the table ref to to_delete instead of appending it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that the table gets deleted before the dataset does.

try:
import pandas
except ImportError: # pragma: NO COVER
pandas = None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you ever get tired of the try/except pattern, you can write a maybe_import(*args) function that returns a tuple of modules:

def maybe_import(*args):
  modules = []
  for arg in args:
    try:
      modules.append(__import__(arg))
    except ImportError:
      return tupe([None] * len(args))
  return tuple(modules)

@pytest.mark.skipIf(pyarrow is None, "Requires `pyarrow`")
def test_bq_to_arrow_data_type(module_under_test, bq_type, bq_mode, is_correct_type):
field = schema.SchemaField("ignored_name", bq_type, mode=bq_mode)
got = module_under_test.bq_to_arrow_data_type(field)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/got/actual/?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

…d_table_from_dataframe`.

Use the BigQuery schema to inform encoding of file used in load job.
This fixes an issue where a dataframe with ambiguous types (such as an
`object` column containing all `None` values) could not be appended to
an existing table, since the schemas wouldn't match in most cases.
@tswast tswast force-pushed the issue7370-b132658518-load-dataframe-nulls branch from 2e8290e to aa38e42 Compare May 30, 2019 00:38
Copy link
Contributor Author

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @aryann !

I tried to add support for REPEATED and RECORD columns, but hit some roadblocks. I'll follow-up with those types.

Note: Since I did add partial support, I know test coverage will fail. I'll add a commit with additional tests before submitting.

)
num_rows = 100
nulls = [None] * num_rows
dataframe = pandas.DataFrame(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug actually only shows up when the whole column contains nulls, because when at least one non-null value is present, pandas auto-detect code works correctly. I do include non-nulls in the unit tests.

@pytest.mark.skipIf(pyarrow is None, "Requires `pyarrow`")
def test_bq_to_arrow_data_type(module_under_test, bq_type, bq_mode, is_correct_type):
field = schema.SchemaField("ignored_name", bq_type, mode=bq_mode)
got = module_under_test.bq_to_arrow_data_type(field)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Returns None if default Arrow type inspection should be used.
"""
# TODO: Use pyarrow.list_(item_type) for repeated (array) fields.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There wasn't a good reason before, so I implemented this.

I tried adding it to the system tests, but now I see there are some open issues in pyarrow that are preventing this support. I think REPEATED support may get fixed when #8093 is fixed, since there's a mode mismatch right now (fields are always set to nullable in the parquet file).

Struct support depends on https://jira.apache.org/jira/browse/ARROW-2587. I've filed https://github.com/googleapis/google-cloud-python/issues/8191 to track this as an open issue.


if len(bq_schema) != len(dataframe.columns):
raise ValueError(
"Number of columns in schema must match number of columns in dataframe"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@tswast tswast requested a review from shollyman May 30, 2019 21:06
@tswast
Copy link
Contributor Author

tswast commented May 30, 2019

@aryann Just pushed some commits that get unit test coverage back to 100%. Added system test for non-null scalar values + explicit schema.

@pytest.fixture
def module_under_test():
from google.cloud.bigquery import _pandas_helpers

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rm empty line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. We actually don't have much control over this, since black (Python code formatter) will just add it back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
4 participants