Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: Field <field> has changed mode from REQUIRED to NULLABLE #8093

Closed
timocb opened this issue May 22, 2019 · 8 comments · Fixed by #8230
Closed

BigQuery: Field <field> has changed mode from REQUIRED to NULLABLE #8093

timocb opened this issue May 22, 2019 · 8 comments · Fixed by #8230
Assignees
Labels
api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@timocb
Copy link

timocb commented May 22, 2019

I am encountering the following problem, when uploading a Pandas DataFrame to a partitioned table:

Environment details

API: BigQuery
OS: macOS High Sierra 10.13.6
Python: 3.5.7
Packages:

google-api-core==1.11.0
google-api-python-client==1.7.8
google-auth==1.6.3
google-auth-httplib2==0.0.3
google-cloud==0.34.0
google-cloud-bigquery==1.12.1
google-cloud-core==1.0.0
google-cloud-dataproc==0.3.1
google-cloud-datastore==1.8.0
google-cloud-storage==1.16.0
google-resumable-media==0.3.2
googleapis-common-protos==1.5.10
parquet==1.2

Steps to reproduce

Create a table on BigQuery with the following fields:

  • float_value, FLOAT, required
  • int_value, INTEGER, required

Reproducible code example (includes creating table)

import pandas as pd
from google.cloud import bigquery


PROJECT = "my-project"
DATASET = "my_dataset"
TABLE = "my_table"


# My table schema
schema = [
    bigquery.SchemaField("foo", "FLOAT", mode="REQUIRED"),
    bigquery.SchemaField("bar", "INTEGER", mode="REQUIRED"),
]


# Set everything up
client = bigquery.Client(PROJECT)
dataset_ref = client.dataset(DATASET)
table_ref = dataset_ref.table(TABLE)


# Delete the table if exists
print("Deleting table if exists...")
client.delete_table(table_ref, not_found_ok=True)


# Create the table
print("Creating table...")
table = bigquery.Table(table_ref, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
    type_=bigquery.TimePartitioningType.DAY
)
table = client.create_table(table, exists_ok=True)

print("Table schema:")
print(table.schema)

print("Table partitioning:")
print(table.time_partitioning)

# Upload data to partition
table_partition = TABLE + "$20190522"
table_ref = dataset_ref.table(table_partition)

df = pd.DataFrame({"foo": [1, 2, 3], "bar": [2.0, 3.0, 4.0]})
client.load_table_from_dataframe(df, table_ref).result()

Output:

Deleting table if exists...
Creating table...
Table schema:
[SchemaField('foo', 'FLOAT', 'REQUIRED', None, ()), SchemaField('bar', 'INTEGER', 'REQUIRED', None, ())]
Table partitioning:
TimePartitioning(type=DAY)
Traceback (most recent call last):
  File "<my-project>/bigquery_failure.py", line 49, in <module>
    client.load_table_from_dataframe(df, table_ref).result()
  File "<my-env>/lib/python3.5/site-packages/google/cloud/bigquery/job.py", line 732, in result
    return super(_AsyncJob, self).result(timeout=timeout)
  File "<my-env>/lib/python3.5/site-packages/google/api_core/future/polling.py", line 127, in result
    raise self._exception
google.api_core.exceptions.BadRequest:
400 Provided Schema does not match Table my-project:my_dataset.my_table$20190522.
Field bar has changed mode from REQUIRED to NULLABLE

Process finished with exit code 1
@tseaver tseaver added api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue. labels May 22, 2019
@tseaver
Copy link
Contributor

tseaver commented May 22, 2019

@tswast ISTM that Client.load_table_from_dataframe is generating a schema with NULLABLE mode, which isn't compatible with the original table's schema, presumably in the process of calling Client.load_table_from_file with the generated parquet file.

@tseaver
Copy link
Contributor

tseaver commented May 22, 2019

Hmm, looks like this one is related to #7370.

@tswast
Copy link
Contributor

tswast commented May 22, 2019

I think BigQuery is probably auto-detecting the column as nullable since it's a parquet file. I don't think parquet has the option of required types.

@timocb Does this error still occur when you supply a schema manually to the load job? e.g.

job_config = bigquery.LoadJobConfig(schema=schema)
load_job = Config.CLIENT.load_table_from_dataframe(
    df, table_ref, job_config=job_config
)
load_job.result()

@timocb
Copy link
Author

timocb commented May 23, 2019

@tswast Using your suggestion of passing the schema using the job_config, I get the following error:

google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: 
Provided schema is not compatible with the file 'prod-scotty-e26a7c4b-827d-4d3e-bb1f-002c27becd42'.
Field 'bar' is specified as REQUIRED in provided schema which does not match NULLABLE as specified in the file.

It seems like what @tseaver is saying is correct. Parquet specifies the fields as NULLABLE, but the schema we provide to the job specifies them as REQUIRED.

@tswast
Copy link
Contributor

tswast commented May 23, 2019

@timocb Thanks for reporting. As far as I can tell, there's no way to mark a column as REQUIRED in a parquet file, so I've raised this as a backend feature request at https://issuetracker.google.com/133415569 feel free to "star" it to watch for updates.

@tswast
Copy link
Contributor

tswast commented May 23, 2019

Turns out Parquet does have the ability to mark columns as required, but there's an open issue in Arrow to support it. https://issues.apache.org/jira/browse/ARROW-5169

@timocb
Copy link
Author

timocb commented Jun 5, 2019

Hi @tswast, does #8105 fix this issue?

@tswast
Copy link
Contributor

tswast commented Jun 5, 2019

@timocb #8105 gets us a step closer, but I need to follow-up and populate the requiredness bit in the parquet file based on the BQ schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
3 participants