Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: Use BigQuery schema (from LoadJobConfig) if available when converting to Parquet in load_table_from_dataframe #7370

Closed
Unprocessable opened this issue Feb 15, 2019 · 9 comments · Fixed by #8105
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@Unprocessable
Copy link

Environment details

OS = Windows 7
Python = 3.7.2
google-cloud-bigquery 1.8.1

Steps to reproduce

  1. Try to upload a DataFrame with only pd.np.nan in a column into a table that normally uses STRING

Code example

import pandas as pd
from google.cloud import bigquery
from oauth2client.client import GoogleCredentials
from oauth2client.service_account import ServiceAccountCredentials

JSON = # insert json access here
dataset_name = # insert dataset name here
client = bigquery.Client.from_service_account_json(JSON)

test1 = pd.DataFrame({'column_1':[pd.np.nan, pd.np.nan, 'b'], 
					  'column_2':['1', '1', '1']})

test2 = test1.copy()
test2 = test2[:2]

def upload(data, name):
	dataset = client.dataset(dataset_name)
	table = dataset.table(name)
	return client.load_table_from_dataframe(data, table).result()

print('uploading 1: nan with string in column')
result = upload(test1, 'test1')

print('uploading 2: nan without string in column (fails)')
result2 = upload(test2, 'test1')

Error

google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table. Field column_1 has changed type from STRING to INTEGER

So I've been in contact with enterprise support, but they were unable to help me. Apparently there's a bug in google.bigquery.Client in the line dataframe.to_parquet(buffer) that causes columns with all NaN values to be interpreted as FLOAT or INTEGER instead of STRING. This prevents the dataframe from being uploaded and there is no other way to introduce NULLs into the table in BigQuery. The issue does not occur in pandas-gbq. Support (ticket 18371705) advised me to use that as a workaround instead until this is fixed and report the issue here. If you have any questions or need more information, feel free to ask.

@Unprocessable Unprocessable changed the title BigQuery: Enterprize support ticket 18371705, Pandas DataFrame with a column of NaNs causes 400 STRING to FLOAT error BigQuery: Enterprise support ticket 18371705, Pandas DataFrame with a column of NaNs causes 400 STRING to FLOAT error Feb 15, 2019
@tseaver tseaver added type: question Request for information or clarification. Not an issue. api: bigquery Issues related to the BigQuery API. labels Feb 15, 2019
@tseaver
Copy link
Contributor

tseaver commented Feb 15, 2019

NaN is not a null value, AFAIK: it is a float. @tswast, can you please clarify here?

@tswast
Copy link
Contributor

tswast commented Feb 16, 2019

It is somewhat common in the pandas community to use NaN for even non-float values. See: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#values-considered-missing The google-cloud-bigquery library uses None in the case of missing strings, though.

The google-cloud-bigquery library doesn't do any dtype conversions itself to create the parquet file (which embeds a schema as part of the file format). Instead, it relies on to_parquet and through a dependency pyarrow.parquet.write_table.

As I follow the breadcrumbs through that stack, I think the problem is likely in pyarrow's schema inference.

That said, in this case we know the desired BigQuery schema. As a workaround for issues with type inference, this library should probably look at the schema in the load job config passed to the load_table_from_dataframe method and convert to the right pyarrow Schema before serializing to parquet.

@tswast
Copy link
Contributor

tswast commented Feb 16, 2019

The reason that this issue does not occur in pandas-gbq is (1) pandas-gbq serializes to CSV and (2) pandas-gbq does it's own dtypes to BQ schema translation.

@tswast
Copy link
Contributor

tswast commented Feb 16, 2019

Possibly related: https://issues.apache.org/jira/browse/ARROW-2298 and https://issues.apache.org/jira/browse/ARROW-2135 (I believe this means NaNs are/were always treated as floats in pyarrow). As a workaround, try using None for missing string values.

@tswast tswast added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed type: question Request for information or clarification. Not an issue. labels Feb 16, 2019
@tswast tswast changed the title BigQuery: Enterprise support ticket 18371705, Pandas DataFrame with a column of NaNs causes 400 STRING to FLOAT error BigQuery: Use BigQuery schema (from LoadJobConfig) if available when converting to Parquet in load_table_from_dataframe Feb 16, 2019
@max-sixty
Copy link

tangentially related: apache/arrow#3479

I think the best design would be to use the Pandas schema to build a schema object and send that to BigQuery. That still relies on arrow encoding the df correctly.

@Unprocessable
Copy link
Author

Unprocessable commented Feb 18, 2019

tangentially related: apache/arrow#3479

I think the best design would be to use the Pandas schema to build a schema object and send that to BigQuery. That still relies on arrow encoding the df correctly.

Except that internally, a Pandas column with NaN will probably default to FLOAT if I understand it correctly.

Possibly related: https://issues.apache.org/jira/browse/ARROW-2298 and https://issues.apache.org/jira/browse/ARROW-2135 (I believe this means NaNs are/were always treated as floats in pyarrow). As a workaround, try using None for missing string values.

Setting a column to None changes the issue to "google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table. Field application has changed type from STRING to INTEGER", even after first setting the whole column to strings.

This fix probably works best from our end, though not the easiest to build:

As a workaround for issues with type inference, this library should probably look at the schema in the load job config passed to the load_table_from_dataframe method and convert to the right pyarrow Schema before serializing to parquet.

However, I am not sure if there is a monetary cost to requesting that data.

@max-sixty
Copy link

Agree on the resolution of using LoadJobSchema, whether or not that's generated from the pandas' dataframe schema

However, I am not sure if there is a monetary cost to requesting that data.

TBC the schema that should be looked at is a python object LoadJobSchema passed to load_table_from_dataframe, so ppl can override the schema from the client.

@tswast
Copy link
Contributor

tswast commented May 17, 2019

Some thoughts on possible implementation.

When using the pyarrow engine for to parquet, we basically convert to an arrow table first. It's likely at this conversion that the object dtype for all None gets treated as a nullable float column, since there's no other information given to pyarrow about the types besides the dtype and the actual values. Perhaps we should always use pyarrow and explicitly convert to the table ourselves, giving ourselves the chance to adjust dtypes.

If we do end up with an intermediate pyarrow table, we could also use this to determine if there are any array or struct columns to deal with. If not, we could serialize to CSV, instead of parquet. Anecdotally, CSV is faster to serialize and for BigQuery to parse and load, so there could be performance benefits.

If it turns out to be difficult to actually change the dtype during the conversion to a pyarrow table, then if we use CSV it will matter less what dtype we actually ended up with since null values will serialize the same. That means this bug would only creep in the rarer case where structs / arrays are used in the same DataFrame as a completely null column.

@tswast
Copy link
Contributor

tswast commented May 17, 2019

I've investigated this approach (manually converting to Arrow and specifying needed column types via BigQuery column type). I think it will work. I've come up with the Arrow types we'll need to use for a given column type.

pandas -> Arrow -> Parquet -> BigQuery

We need to be able to use BQ schema to override behavior of the pandas -> Arrow + Parquet conversion, especially when the pandas type is ambiguous (object dtype containing all None values, for example).

What happens if the number of columns (in LoadJobConfig and DataFrame) doesn't match? Error! But, don't forget about indexes, those could be the difference between a schema matching and not. Maybe don't support writing indexes if schema is supplied?

Can we tell what the desired schema is if they don't provide it? Not always. If it's an append job, maybe we can make a GET call to compare schemas if we get ambiguity? Let's do that as a follow-up if explicitly passing in a schema doesn't work.

pandas -> Arrow type conversions:

https://arrow.apache.org/docs/python/pandas.html#pandas-arrow-conversion

Arrow -> Parquet type conversions:

https://arrow.apache.org/docs/python/parquet.html#data-type-handling

Parquet -> BigQuery type conversions:

https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#parquet_conversions

How should we change the Arrow types if a BigQuery schema is available in the LoadJobConfig?

https://cloud.google.com/bigquery/data-types#legacy_sql_data_types

BigQuery Legacy SQL Data Type Arrow Array Data Type
STRING pyarrow.string()
BYTES pyarrow.binary()
INTEGER pyarrow.int64()
FLOAT pyarrow.float64()
NUMERIC pyarrow.decimal128(38, 9)
BOOLEAN pyarrow.bool_()
TIMESTAMP pyarrow.timestamp("us", tz="UTC")
DATE pyarrow.date32()
TIME Parquet format for BigQuery loader not documented.

Not supported by BigQuery Parquet loader. pyarrow.time64("us") loads as INTEGER in BigQuery.

DATETIME Parquet format for BigQuery loader not documented.

Not supported by BigQuery Parquet loader. pyarrow.timestamp("us", tz=None) loads as TIMESTAMP.

GEOGRAPHY Parquet format for BigQuery loader not documented.

Not clear which pyarrow type we'd use for this, either.

RECORD (STRUCT) Leave type unspecified for now. Default to pyarrow inference. \ \ Need recursive type mapping for this feature.
MODE=REPEATED (ARRAY) Leave type unspecified for now. Default to pyarrow inference. \ \ Need more complex type mapping for this feature.

I've manually constructed a parquet file with both non-null values and all nulls and confirmed I'm able to append a file with null columns to an existing table.

What about pandas indexes?

For now: Leave them off if schema is supplied.

How could we use the schema to match DataFrame indexes + columns? Maybe assume the indexes are the first "columns".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
4 participants