BigQuery: Use BigQuery schema (from LoadJobConfig) if available when converting to Parquet in load_table_from_dataframe #7370

Unprocessable · 2019-02-15T13:31:23Z

Environment details

OS = Windows 7
Python = 3.7.2
google-cloud-bigquery 1.8.1

Steps to reproduce

Try to upload a DataFrame with only pd.np.nan in a column into a table that normally uses STRING

Code example

import pandas as pd
from google.cloud import bigquery
from oauth2client.client import GoogleCredentials
from oauth2client.service_account import ServiceAccountCredentials

JSON = # insert json access here
dataset_name = # insert dataset name here
client = bigquery.Client.from_service_account_json(JSON)

test1 = pd.DataFrame({'column_1':[pd.np.nan, pd.np.nan, 'b'], 
					  'column_2':['1', '1', '1']})

test2 = test1.copy()
test2 = test2[:2]

def upload(data, name):
	dataset = client.dataset(dataset_name)
	table = dataset.table(name)
	return client.load_table_from_dataframe(data, table).result()

print('uploading 1: nan with string in column')
result = upload(test1, 'test1')

print('uploading 2: nan without string in column (fails)')
result2 = upload(test2, 'test1')

Error

google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table. Field column_1 has changed type from STRING to INTEGER

So I've been in contact with enterprise support, but they were unable to help me. Apparently there's a bug in google.bigquery.Client in the line dataframe.to_parquet(buffer) that causes columns with all NaN values to be interpreted as FLOAT or INTEGER instead of STRING. This prevents the dataframe from being uploaded and there is no other way to introduce NULLs into the table in BigQuery. The issue does not occur in pandas-gbq. Support (ticket 18371705) advised me to use that as a workaround instead until this is fixed and report the issue here. If you have any questions or need more information, feel free to ask.

The text was updated successfully, but these errors were encountered:

tseaver · 2019-02-15T21:32:14Z

NaN is not a null value, AFAIK: it is a float. @tswast, can you please clarify here?

tswast · 2019-02-16T00:05:26Z

It is somewhat common in the pandas community to use NaN for even non-float values. See: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#values-considered-missing The google-cloud-bigquery library uses None in the case of missing strings, though.

The google-cloud-bigquery library doesn't do any dtype conversions itself to create the parquet file (which embeds a schema as part of the file format). Instead, it relies on to_parquet and through a dependency pyarrow.parquet.write_table.

As I follow the breadcrumbs through that stack, I think the problem is likely in pyarrow's schema inference.

That said, in this case we know the desired BigQuery schema. As a workaround for issues with type inference, this library should probably look at the schema in the load job config passed to the load_table_from_dataframe method and convert to the right pyarrow Schema before serializing to parquet.

tswast · 2019-02-16T00:07:45Z

The reason that this issue does not occur in pandas-gbq is (1) pandas-gbq serializes to CSV and (2) pandas-gbq does it's own dtypes to BQ schema translation.

tswast · 2019-02-16T00:10:28Z

Possibly related: https://issues.apache.org/jira/browse/ARROW-2298 and https://issues.apache.org/jira/browse/ARROW-2135 (I believe this means NaNs are/were always treated as floats in pyarrow). As a workaround, try using None for missing string values.

max-sixty · 2019-02-16T03:57:56Z

tangentially related: apache/arrow#3479

I think the best design would be to use the Pandas schema to build a schema object and send that to BigQuery. That still relies on arrow encoding the df correctly.

Unprocessable · 2019-02-18T15:35:20Z

tangentially related: apache/arrow#3479

I think the best design would be to use the Pandas schema to build a schema object and send that to BigQuery. That still relies on arrow encoding the df correctly.

Except that internally, a Pandas column with NaN will probably default to FLOAT if I understand it correctly.

Possibly related: https://issues.apache.org/jira/browse/ARROW-2298 and https://issues.apache.org/jira/browse/ARROW-2135 (I believe this means NaNs are/were always treated as floats in pyarrow). As a workaround, try using None for missing string values.

Setting a column to None changes the issue to "google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table. Field application has changed type from STRING to INTEGER", even after first setting the whole column to strings.

This fix probably works best from our end, though not the easiest to build:

As a workaround for issues with type inference, this library should probably look at the schema in the load job config passed to the load_table_from_dataframe method and convert to the right pyarrow Schema before serializing to parquet.

However, I am not sure if there is a monetary cost to requesting that data.

max-sixty · 2019-02-26T21:00:02Z

Agree on the resolution of using LoadJobSchema, whether or not that's generated from the pandas' dataframe schema

However, I am not sure if there is a monetary cost to requesting that data.

TBC the schema that should be looked at is a python object LoadJobSchema passed to load_table_from_dataframe, so ppl can override the schema from the client.

tswast · 2019-05-17T07:26:12Z

Some thoughts on possible implementation.

When using the pyarrow engine for to parquet, we basically convert to an arrow table first. It's likely at this conversion that the object dtype for all None gets treated as a nullable float column, since there's no other information given to pyarrow about the types besides the dtype and the actual values. Perhaps we should always use pyarrow and explicitly convert to the table ourselves, giving ourselves the chance to adjust dtypes.

If we do end up with an intermediate pyarrow table, we could also use this to determine if there are any array or struct columns to deal with. If not, we could serialize to CSV, instead of parquet. Anecdotally, CSV is faster to serialize and for BigQuery to parse and load, so there could be performance benefits.

If it turns out to be difficult to actually change the dtype during the conversion to a pyarrow table, then if we use CSV it will matter less what dtype we actually ended up with since null values will serialize the same. That means this bug would only creep in the rarer case where structs / arrays are used in the same DataFrame as a completely null column.

tswast · 2019-05-17T22:34:25Z

I've investigated this approach (manually converting to Arrow and specifying needed column types via BigQuery column type). I think it will work. I've come up with the Arrow types we'll need to use for a given column type.

pandas -> Arrow -> Parquet -> BigQuery

We need to be able to use BQ schema to override behavior of the pandas -> Arrow + Parquet conversion, especially when the pandas type is ambiguous (object dtype containing all None values, for example).

What happens if the number of columns (in LoadJobConfig and DataFrame) doesn't match? Error! But, don't forget about indexes, those could be the difference between a schema matching and not. Maybe don't support writing indexes if schema is supplied?

Can we tell what the desired schema is if they don't provide it? Not always. If it's an append job, maybe we can make a GET call to compare schemas if we get ambiguity? Let's do that as a follow-up if explicitly passing in a schema doesn't work.

pandas -> Arrow type conversions:

https://arrow.apache.org/docs/python/pandas.html#pandas-arrow-conversion

Arrow -> Parquet type conversions:

https://arrow.apache.org/docs/python/parquet.html#data-type-handling

Parquet -> BigQuery type conversions:

https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#parquet_conversions

How should we change the Arrow types if a BigQuery schema is available in the LoadJobConfig?

https://cloud.google.com/bigquery/data-types#legacy_sql_data_types

BigQuery Legacy SQL Data Type	Arrow Array Data Type
STRING	pyarrow.string()
BYTES	pyarrow.binary()
INTEGER	pyarrow.int64()
FLOAT	pyarrow.float64()
NUMERIC	pyarrow.decimal128(38, 9)
BOOLEAN	pyarrow.bool_()
TIMESTAMP	pyarrow.timestamp("us", tz="UTC")
DATE	pyarrow.date32()
TIME	Parquet format for BigQuery loader not documented. Not supported by BigQuery Parquet loader. pyarrow.time64("us") loads as INTEGER in BigQuery.
DATETIME	Parquet format for BigQuery loader not documented. Not supported by BigQuery Parquet loader. pyarrow.timestamp("us", tz=None) loads as TIMESTAMP.
GEOGRAPHY	Parquet format for BigQuery loader not documented. Not clear which pyarrow type we'd use for this, either.
RECORD (STRUCT)	Leave type unspecified for now. Default to pyarrow inference. \ \ Need recursive type mapping for this feature.
MODE=REPEATED (ARRAY)	Leave type unspecified for now. Default to pyarrow inference. \ \ Need more complex type mapping for this feature.

I've manually constructed a parquet file with both non-null values and all nulls and confirmed I'm able to append a file with null columns to an existing table.

What about pandas indexes?

For now: Leave them off if schema is supplied.

How could we use the schema to match DataFrame indexes + columns? Maybe assume the indexes are the first "columns".

Unprocessable changed the title ~~BigQuery: Enterprize support ticket 18371705, Pandas DataFrame with a column of NaNs causes 400 STRING to FLOAT error~~ BigQuery: Enterprise support ticket 18371705, Pandas DataFrame with a column of NaNs causes 400 STRING to FLOAT error Feb 15, 2019

tseaver added type: question Request for information or clarification. Not an issue. api: bigquery Issues related to the BigQuery API. labels Feb 15, 2019

tseaver assigned tswast Feb 15, 2019

tswast added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed type: question Request for information or clarification. Not an issue. labels Feb 16, 2019

tswast changed the title ~~BigQuery: Enterprise support ticket 18371705, Pandas DataFrame with a column of NaNs causes 400 STRING to FLOAT error~~ BigQuery: Use BigQuery schema (from LoadJobConfig) if available when converting to Parquet in load_table_from_dataframe Feb 16, 2019

tswast mentioned this issue Feb 16, 2019

document pandas-gbq vision and roadmap googleapis/python-bigquery-pandas#149

Closed

tseaver mentioned this issue May 22, 2019

BigQuery: Field <field> has changed mode from REQUIRED to NULLABLE #8093

Closed

This was referenced May 22, 2019

Use job_config.schema for data type conversion if specified in load_table_from_dataframe. #8105

Merged

BigQuery: get table schema if not supplied (and have pyarrow) in load_table_from_dataframe #8142

Closed

tswast closed this as completed in #8105 May 31, 2019

tswast mentioned this issue May 31, 2019

Fix breaking change. Don't require pyarrow if schema is set, but warn. #8202

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: Use BigQuery schema (from LoadJobConfig) if available when converting to Parquet in load_table_from_dataframe #7370

BigQuery: Use BigQuery schema (from LoadJobConfig) if available when converting to Parquet in load_table_from_dataframe #7370

Unprocessable commented Feb 15, 2019

tseaver commented Feb 15, 2019

tswast commented Feb 16, 2019 •

edited

Loading

tswast commented Feb 16, 2019

tswast commented Feb 16, 2019 •

edited

Loading

max-sixty commented Feb 16, 2019

Unprocessable commented Feb 18, 2019 •

edited

Loading

max-sixty commented Feb 26, 2019

tswast commented May 17, 2019

tswast commented May 17, 2019 •

edited

Loading

BigQuery: Use BigQuery schema (from LoadJobConfig) if available when converting to Parquet in load_table_from_dataframe #7370

BigQuery: Use BigQuery schema (from LoadJobConfig) if available when converting to Parquet in load_table_from_dataframe #7370

Comments

Unprocessable commented Feb 15, 2019

Environment details

Steps to reproduce

Code example

Error

tseaver commented Feb 15, 2019

tswast commented Feb 16, 2019 • edited Loading

tswast commented Feb 16, 2019

tswast commented Feb 16, 2019 • edited Loading

max-sixty commented Feb 16, 2019

Unprocessable commented Feb 18, 2019 • edited Loading

max-sixty commented Feb 26, 2019

tswast commented May 17, 2019

tswast commented May 17, 2019 • edited Loading

pandas -> Arrow -> Parquet -> BigQuery

How should we change the Arrow types if a BigQuery schema is available in the LoadJobConfig?

What about pandas indexes?

tswast commented Feb 16, 2019 •

edited

Loading

tswast commented Feb 16, 2019 •

edited

Loading

Unprocessable commented Feb 18, 2019 •

edited

Loading

tswast commented May 17, 2019 •

edited

Loading