BigQuery: Field <field> has changed mode from REQUIRED to NULLABLE #8093

timocb · 2019-05-22T12:52:04Z

I am encountering the following problem, when uploading a Pandas DataFrame to a partitioned table:

Environment details

API: BigQuery
OS: macOS High Sierra 10.13.6
Python: 3.5.7
Packages:

google-api-core==1.11.0
google-api-python-client==1.7.8
google-auth==1.6.3
google-auth-httplib2==0.0.3
google-cloud==0.34.0
google-cloud-bigquery==1.12.1
google-cloud-core==1.0.0
google-cloud-dataproc==0.3.1
google-cloud-datastore==1.8.0
google-cloud-storage==1.16.0
google-resumable-media==0.3.2
googleapis-common-protos==1.5.10
parquet==1.2

Steps to reproduce

Create a table on BigQuery with the following fields:

float_value, FLOAT, required
int_value, INTEGER, required

Reproducible code example (includes creating table)

import pandas as pd
from google.cloud import bigquery


PROJECT = "my-project"
DATASET = "my_dataset"
TABLE = "my_table"


# My table schema
schema = [
    bigquery.SchemaField("foo", "FLOAT", mode="REQUIRED"),
    bigquery.SchemaField("bar", "INTEGER", mode="REQUIRED"),
]


# Set everything up
client = bigquery.Client(PROJECT)
dataset_ref = client.dataset(DATASET)
table_ref = dataset_ref.table(TABLE)


# Delete the table if exists
print("Deleting table if exists...")
client.delete_table(table_ref, not_found_ok=True)


# Create the table
print("Creating table...")
table = bigquery.Table(table_ref, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
    type_=bigquery.TimePartitioningType.DAY
)
table = client.create_table(table, exists_ok=True)

print("Table schema:")
print(table.schema)

print("Table partitioning:")
print(table.time_partitioning)

# Upload data to partition
table_partition = TABLE + "$20190522"
table_ref = dataset_ref.table(table_partition)

df = pd.DataFrame({"foo": [1, 2, 3], "bar": [2.0, 3.0, 4.0]})
client.load_table_from_dataframe(df, table_ref).result()

Output:

Deleting table if exists...
Creating table...
Table schema:
[SchemaField('foo', 'FLOAT', 'REQUIRED', None, ()), SchemaField('bar', 'INTEGER', 'REQUIRED', None, ())]
Table partitioning:
TimePartitioning(type=DAY)
Traceback (most recent call last):
  File "<my-project>/bigquery_failure.py", line 49, in <module>
    client.load_table_from_dataframe(df, table_ref).result()
  File "<my-env>/lib/python3.5/site-packages/google/cloud/bigquery/job.py", line 732, in result
    return super(_AsyncJob, self).result(timeout=timeout)
  File "<my-env>/lib/python3.5/site-packages/google/api_core/future/polling.py", line 127, in result
    raise self._exception
google.api_core.exceptions.BadRequest:
400 Provided Schema does not match Table my-project:my_dataset.my_table$20190522.
Field bar has changed mode from REQUIRED to NULLABLE

Process finished with exit code 1

The text was updated successfully, but these errors were encountered:

tseaver · 2019-05-22T16:21:45Z

@tswast ISTM that Client.load_table_from_dataframe is generating a schema with NULLABLE mode, which isn't compatible with the original table's schema, presumably in the process of calling Client.load_table_from_file with the generated parquet file.

tseaver · 2019-05-22T20:10:49Z

Hmm, looks like this one is related to #7370.

tswast · 2019-05-22T21:27:54Z

I think BigQuery is probably auto-detecting the column as nullable since it's a parquet file. I don't think parquet has the option of required types.

@timocb Does this error still occur when you supply a schema manually to the load job? e.g.

job_config = bigquery.LoadJobConfig(schema=schema)
load_job = Config.CLIENT.load_table_from_dataframe(
    df, table_ref, job_config=job_config
)
load_job.result()

timocb · 2019-05-23T07:37:49Z

@tswast Using your suggestion of passing the schema using the job_config, I get the following error:

google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: 
Provided schema is not compatible with the file 'prod-scotty-e26a7c4b-827d-4d3e-bb1f-002c27becd42'.
Field 'bar' is specified as REQUIRED in provided schema which does not match NULLABLE as specified in the file.

It seems like what @tseaver is saying is correct. Parquet specifies the fields as NULLABLE, but the schema we provide to the job specifies them as REQUIRED.

tswast · 2019-05-23T15:32:01Z

@timocb Thanks for reporting. As far as I can tell, there's no way to mark a column as REQUIRED in a parquet file, so I've raised this as a backend feature request at https://issuetracker.google.com/133415569 feel free to "star" it to watch for updates.

tswast · 2019-05-23T17:52:24Z

Turns out Parquet does have the ability to mark columns as required, but there's an open issue in Arrow to support it. https://issues.apache.org/jira/browse/ARROW-5169

timocb · 2019-06-05T08:57:07Z

Hi @tswast, does #8105 fix this issue?

tswast · 2019-06-05T19:49:52Z

@timocb #8105 gets us a step closer, but I need to follow-up and populate the requiredness bit in the parquet file based on the BQ schema.

tseaver added api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue. labels May 22, 2019

tswast mentioned this issue May 30, 2019

Use job_config.schema for data type conversion if specified in load_table_from_dataframe. #8105

Merged

tswast self-assigned this Jun 5, 2019

tswast added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: question Request for information or clarification. Not an issue. labels Jun 5, 2019

tswast mentioned this issue Jun 5, 2019

BigQuery: Fix bug where load_table_from_dataframe could not append to REQUIRED fields. #8230

Merged

plamut closed this as completed in #8230 Jun 7, 2019

tswast mentioned this issue Jun 14, 2019

BigQuery: Cannot push pandas dataframe to bigquery table using load_table_from_dataframe() #8305

Closed

timocb mentioned this issue Aug 7, 2019

BigQuery: get table schema if not supplied (and have pyarrow) in load_table_from_dataframe #8142

Closed

timocb mentioned this issue Feb 4, 2020

BigQuery: Field 'bar' is specified as REPEATED in provided schema which does not match REQUIRED as specified in the file. googleapis/python-bigquery#17

Closed

henryharbeck mentioned this issue Jul 20, 2024

Cannot append to REQUIRED field when using client.load_table_from_file without providing table schema googleapis/python-bigquery#1981

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: Field <field> has changed mode from REQUIRED to NULLABLE #8093

BigQuery: Field <field> has changed mode from REQUIRED to NULLABLE #8093

timocb commented May 22, 2019 •

edited

Loading

tseaver commented May 22, 2019

tseaver commented May 22, 2019

tswast commented May 22, 2019

timocb commented May 23, 2019 •

edited

Loading

tswast commented May 23, 2019

tswast commented May 23, 2019

timocb commented Jun 5, 2019

tswast commented Jun 5, 2019

BigQuery: Field <field> has changed mode from REQUIRED to NULLABLE #8093

BigQuery: Field <field> has changed mode from REQUIRED to NULLABLE #8093

Comments

timocb commented May 22, 2019 • edited Loading

Environment details

Steps to reproduce

Reproducible code example (includes creating table)

Output:

tseaver commented May 22, 2019

tseaver commented May 22, 2019

tswast commented May 22, 2019

timocb commented May 23, 2019 • edited Loading

tswast commented May 23, 2019

tswast commented May 23, 2019

timocb commented Jun 5, 2019

tswast commented Jun 5, 2019

timocb commented May 22, 2019 •

edited

Loading

timocb commented May 23, 2019 •

edited

Loading