-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use job_config.schema
for data type conversion if specified in load_table_from_dataframe
.
#8105
Use job_config.schema
for data type conversion if specified in load_table_from_dataframe
.
#8105
Conversation
job_config.schema
if specified in load_table_from_dataframe
.job_config.schema
for data type conversion if specified in load_table_from_dataframe
.
4c90553
to
2e8290e
Compare
raise ValueError("pyarrow is required for BigQuery schema conversion") | ||
|
||
if len(bq_schema) != len(dataframe.columns): | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note from chat: Maybe we want to allow the bq_schema to be used as an override? Any unmentioned columns get the default pandas type inference.
This is how pandas-gbq works. The schema argument is more used as an override for when a particular column is ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, let's leave this as-is and fixup later. Filed #8140 as a feature request.
@@ -1296,7 +1304,10 @@ def load_table_from_dataframe( | |||
os.close(tmpfd) | |||
|
|||
try: | |||
dataframe.to_parquet(tmppath) | |||
if job_config.schema: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note from chat: if schema isn't populated, we might want to call get_table and use the table's schema if it the table already exists and we're appending to it. (This is what pandas-gbq does)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. Filed #8142. I think this would make a good feature, but shouldn't block this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything seems reasonable, but the multiple conversions make me a bit twitchy. The part that I didn't verify is the parquet-to-bq type mappings: is there anything special we need to do similar to the avro logicaltype annotations to get the correct type mappings from there?
|
||
|
||
BQ_TO_ARROW_SCALARS = {} | ||
if pyarrow is not None: # pragma: NO COVER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider this format, which is more idiomatic:
if pyarrow:
MY_CONST = {
...
}
else:
MY_CONST = {}
Note two things: (1) all initialization is inside a branch and (2) we no longer use "is not None" or "is None".
|
||
if len(bq_schema) != len(dataframe.columns): | ||
raise ValueError( | ||
"Number of columns in schema must match number of columns in dataframe" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Period?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
"STRING": pyarrow.string, | ||
"TIME": pyarrow_time, | ||
"TIMESTAMP": pyarrow_timestamp, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a list somewhere that defines BQ types? I wonder if we can add an assertion here that BQ_TO_ARROW_SCALARS.keys() == BQ_TYPES.keys(), so we have a better guarantee that all types are accounted for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet. There's an open FR at #7632 I've been hesitant to add such a list, since it's yet another thing to keep in sync manually, but I agree it'd be useful for cases such as this.
|
||
Returns None if default Arrow type inspection should be used. | ||
""" | ||
# TODO: Use pyarrow.list_(item_type) for repeated (array) fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to include a little more context in the TODO comment as to why we are not adding support for these in this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There wasn't a good reason before, so I implemented this.
I tried adding it to the system tests, but now I see there are some open issues in pyarrow that are preventing this support. I think REPEATED support may get fixed when #8093 is fixed, since there's a mode mismatch right now (fields are always set to nullable in the parquet file).
Struct support depends on https://jira.apache.org/jira/browse/ARROW-2587. I've filed https://github.com/googleapis/google-cloud-python/issues/8191 to track this as an open issue.
) | ||
num_rows = 100 | ||
nulls = [None] * num_rows | ||
dataframe = pandas.DataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I would suggest putting in non-null values for the sample data to make the test more complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug actually only shows up when the whole column contains nulls, because when at least one non-null value is present, pandas auto-detect code works correctly. I do include non-nulls in the unit tests.
table = retry_403(Config.CLIENT.create_table)( | ||
Table(table_id, schema=table_schema) | ||
) | ||
self.to_delete.insert(0, table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we prepend the table ref to to_delete instead of appending it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that the table gets deleted before the dataset does.
try: | ||
import pandas | ||
except ImportError: # pragma: NO COVER | ||
pandas = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you ever get tired of the try/except pattern, you can write a maybe_import(*args) function that returns a tuple of modules:
def maybe_import(*args):
modules = []
for arg in args:
try:
modules.append(__import__(arg))
except ImportError:
return tupe([None] * len(args))
return tuple(modules)
@pytest.mark.skipIf(pyarrow is None, "Requires `pyarrow`") | ||
def test_bq_to_arrow_data_type(module_under_test, bq_type, bq_mode, is_correct_type): | ||
field = schema.SchemaField("ignored_name", bq_type, mode=bq_mode) | ||
got = module_under_test.bq_to_arrow_data_type(field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/got/actual/?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
…d_table_from_dataframe`. Use the BigQuery schema to inform encoding of file used in load job. This fixes an issue where a dataframe with ambiguous types (such as an `object` column containing all `None` values) could not be appended to an existing table, since the schemas wouldn't match in most cases.
2e8290e
to
aa38e42
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @aryann !
I tried to add support for REPEATED and RECORD columns, but hit some roadblocks. I'll follow-up with those types.
Note: Since I did add partial support, I know test coverage will fail. I'll add a commit with additional tests before submitting.
) | ||
num_rows = 100 | ||
nulls = [None] * num_rows | ||
dataframe = pandas.DataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug actually only shows up when the whole column contains nulls, because when at least one non-null value is present, pandas auto-detect code works correctly. I do include non-nulls in the unit tests.
@pytest.mark.skipIf(pyarrow is None, "Requires `pyarrow`") | ||
def test_bq_to_arrow_data_type(module_under_test, bq_type, bq_mode, is_correct_type): | ||
field = schema.SchemaField("ignored_name", bq_type, mode=bq_mode) | ||
got = module_under_test.bq_to_arrow_data_type(field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
Returns None if default Arrow type inspection should be used. | ||
""" | ||
# TODO: Use pyarrow.list_(item_type) for repeated (array) fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There wasn't a good reason before, so I implemented this.
I tried adding it to the system tests, but now I see there are some open issues in pyarrow that are preventing this support. I think REPEATED support may get fixed when #8093 is fixed, since there's a mode mismatch right now (fields are always set to nullable in the parquet file).
Struct support depends on https://jira.apache.org/jira/browse/ARROW-2587. I've filed https://github.com/googleapis/google-cloud-python/issues/8191 to track this as an open issue.
|
||
if len(bq_schema) != len(dataframe.columns): | ||
raise ValueError( | ||
"Number of columns in schema must match number of columns in dataframe" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@aryann Just pushed some commits that get unit test coverage back to 100%. Added system test for non-null scalar values + explicit schema. |
@pytest.fixture | ||
def module_under_test(): | ||
from google.cloud.bigquery import _pandas_helpers | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: rm empty line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. We actually don't have much control over this, since black
(Python code formatter) will just add it back.
Use the BigQuery schema to inform encoding of file used in load job.
This fixes an issue where a dataframe with ambiguous types (such as an
object
column containing allNone
values) could not be appended toan existing table, since the schemas wouldn't match in most cases.
Closes #7370.