Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_table_from_dataframe does not error out when nan in a required column - Million dollar bug #1692

Closed
gbmarc1 opened this issue Oct 18, 2023 · 6 comments · Fixed by #1735
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@gbmarc1
Copy link

gbmarc1 commented Oct 18, 2023

When loading a pandas dataframe into Bigquery that contains a nan in a required column, the upload succeeds but the resulting table is not representative of the dataframe. The values in the column containing the nan are unordered and the user is not aware of it.

Environment details

  • OS type and version:
  • Python version: Python 3.11.4
  • pip version: pip 23.2.1
  • google-cloud-bigquery version:
  • Name: google-cloud-bigquery
Version: 3.12.0
Summary: Google BigQuery API client library
Home-page: https://github.com/googleapis/python-bigquery
Author: Google LLC
Author-email: googleapis-packages@google.com
License: Apache 2.0
Requires: google-api-core, google-cloud-core, google-resumable-media, grpcio, grpcio, packaging, proto-plus, protobuf, python-dateutil, requests
Required-by: google-cloud-aiplatform, pandas-gbq

Steps to reproduce

  1. Run the code below

Code example

from google.cloud import bigquery 
import pandas as pd
import numpy as np

df = pd.DataFrame([["hello", "string"], ["hello2", np.nan], ["hello3", "valid"], ["hello4", "valid2"]], columns=["image_uri", "phash"])

client = bigquery.Client(project="project")
job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("image_uri", "STRING", mode="REQUIRED"),
        bigquery.SchemaField("phash", "STRING", mode="REQUIRED"),
    ]
)
job = client.load_table_from_dataframe(
    df, "foo.foo_bar", job_config=job_config
)
job.result()

df_read = pd.read_gbq("foo.foo_bar", project_id="project")

image
image

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Oct 18, 2023
@Gaurang033
Copy link
Contributor

seems like there is an issue with pyarrow library. Following code show the reproduction of the same bug.

    import pyarrow as pa
    import pyarrow.parquet
    import pandas as pd

    fields = [pa.field('id', pa.string(), nullable=False),
              pa.field('name', pa.string(), nullable=False)]
    array = [pa.array(['10', '11', '12', '13']),
             pa.array(['AAA', None, 'BBB', 'CCC'])]
    table = pa.Table.from_arrays(array, schema=pa.schema(fields))
    pyarrow.parquet.write_table(table, 'test_arrow.parquet', compression='SNAPPY', use_compliant_nested_type=True)
    df = pd.read_parquet("/Users/fki/Documents/git/Demo/bq_api/test_arrow.parquet", engine='pyarrow')
    print("\n\n\n")
    print(df)

Possible solution

until issue is fixed in Pyarrow. Disable the nullability for schema in pyarrow table. This will throw unhandled exception though.

_pandas_helper.py

def bq_to_arrow_field(bq_field, array_type=None):
    """Return the Arrow field, corresponding to a given BigQuery column.

    Returns:
        None: if the Arrow type cannot be determined.
    """
    arrow_type = bq_to_arrow_data_type(bq_field)
    if arrow_type is not None:
        if array_type is not None:
            arrow_type = array_type  # For GEOGRAPHY, at least initially
        # is_nullable = bq_field.mode.upper() == "NULLABLE"
        is_nullable = True

Exception if remove nullability

        if self._exception is not None:
            # pylint: disable=raising-bad-type
            # Pylint doesn't recognize that this is valid in this case.
>           raise self._exception
E           google.api_core.exceptions.BadRequest: 400 Required field phash cannot be null

@Gaurang033
Copy link
Contributor

other way to fix this is to verify if data has null when column has data type as required. created the pull request for that.

@Linchin Linchin added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p3 Desirable enhancement or fix. May not be included in next release. labels Oct 31, 2023
@Linchin
Copy link
Contributor

Linchin commented Nov 3, 2023

@Gaurang033 Thank you so much for your investigation, and proposing a solution! It looks like pyarrow did not check whether the arrays are consistent with schema when Table.from_arrays() is called, because table = pa.Table.from_arrays(array, schema=pa.schema(fields))(in your reproduction example) yields the following without an error:

   id  name
0  10   AAA
1  11  None
2  12   BBB
3  13   CCC

It feels like a major neligence on the pyarrow side. I wonder if it's a bug, or it's done by design.

At the same time, we have an umbrella issue #1646, which covers a bunch of issues related to the load_table_from_*() methods. We may need a more comprehensive look into them all, in case this patch would break other user paths.

@Gaurang033
Copy link
Contributor

@Linchin totally agree. Seems like a bug from pyarrow library as I can reproduce it without using bigquery. if you think we need this patch until pyarrow fix this, let's merge this. I fixed the failing test case. when we provided unknown data type arrow field was coming as null. and so method I added was throwing exception.

@chalmerlowe
Copy link
Collaborator

I added a similar comment on the PR #1698:

I worry about spending too much time and energy trying to create a work around for what we all agree is a problem in pyarrow. This feels like it creates greater complexity in our code, increased fragility, and a higher maintenance burden in the long run.

Am I missing something?

I am inclined to close this issue as out of scope for us.

@tswast tswast reopened this Nov 22, 2023
@tswast
Copy link
Contributor

tswast commented Nov 22, 2023

Given that arrow is intended for fast conversions, I think the correct fix for us is to stop using REQUIRED/NULLABLE from BigQuery to determine our local arrow schema. Instead, always set nullable=True locally.

gcf-merge-on-green bot pushed a commit that referenced this issue Nov 22, 2023
…values (#1735)

Even if the remote schema is REQUIRED

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
- [ ] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-bigquery/issues/new/choose) before writing your code!  That way we can discuss the change, evaluate designs, and agree on the general idea
- [ ] Ensure the tests and linter pass
- [ ] Code coverage does not decrease (if any source code was changed)
- [ ] Appropriate docs were updated (if necessary)

Fixes #1692 🦕
@tswast tswast self-assigned this Nov 22, 2023
kiraksi pushed a commit to kiraksi/python-bigquery that referenced this issue Nov 27, 2023
…values (googleapis#1735)

Even if the remote schema is REQUIRED

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
- [ ] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-bigquery/issues/new/choose) before writing your code!  That way we can discuss the change, evaluate designs, and agree on the general idea
- [ ] Ensure the tests and linter pass
- [ ] Code coverage does not decrease (if any source code was changed)
- [ ] Appropriate docs were updated (if necessary)

Fixes googleapis#1692 🦕
kiraksi added a commit that referenced this issue Nov 28, 2023
…nto pandas extra (#1726)

* feat: Introduce compatibility with native namespace packages

* Update copyright year

* removed pkg_resources from all test files and moved importlib into pandas extra

* feat: removed pkg_resources from all test files and moved importlib into pandas extra

* Adding no cover tag to test code

* reformatted with black

* undo revert

* perf: use the first page a results when `query(api_method="QUERY")` (#1723)

* perf: use the first page a results when `query(api_method="QUERY")`

* add tests

* respect max_results with cached page

* respect page_size, also avoid bqstorage if almost fully downloaded

* skip true test if bqstorage not installed

* coverage

* fix: ensure query job retry has longer deadline than API request deadline (#1734)

In cases where we can't disambiguate API failure from job failure,
this ensures we can still retry the job at least once.

* fix: `load_table_from_dataframe` now assumes there may be local null values (#1735)

Even if the remote schema is REQUIRED

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
- [ ] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-bigquery/issues/new/choose) before writing your code!  That way we can discuss the change, evaluate designs, and agree on the general idea
- [ ] Ensure the tests and linter pass
- [ ] Code coverage does not decrease (if any source code was changed)
- [ ] Appropriate docs were updated (if necessary)

Fixes #1692 🦕

* chore: standardize samples directory - delete unneeded dependencies (#1732)

* chore: standardize samples directory = delete unneeded dependencies

* Removed unused import for linter

* fix: move grpc, proto-plus and protobuf packages to extras (#1721)

* chore: move grpc, proto-plus and protobuff packages to extras

* formatted with black

* feat: add `job_timeout_ms` to job configuration classes (#1675)

* fix: adds new property and tests

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

* updates docs to correct a sphinx failure

* Updates formatting

* Update tests/system/test_query.py

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

* Update google/cloud/bigquery/job/base.py

* updates one test and uses int_or_none

* Update tests/system/test_query.py

testing something.

* Update tests/system/test_query.py

* testing coverage feature

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

* minor edits

* tweaks to noxfile for testing purposes

* add new test to base as experiment

* adds a test, updates import statements

* add another test

* edit to tests

* formatting fixes

* update noxfile to correct debug code

* removes unneeded comments.

---------

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>

---------

Co-authored-by: Chalmer Lowe <chalmerlowe@google.com>
Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Co-authored-by: Tim Swast <swast@google.com>

* remove unnecessary version checks

* undo bad commit, remove unneeded version checks

* Revert "undo bad commit, remove unneeded version checks"

This reverts commit 5c82dcf.

* Revert "remove unnecessary version checks"

This reverts commit 9331a7e.

* revert bad changes, remove pkg_resources from file

* after clarification, reimplement changes and ignore 3.12 tests

* reformatted with black

* removed minimum check

* updated pandas installed version check

---------

Co-authored-by: Anthonios Partheniou <partheniou@google.com>
Co-authored-by: Tim Swast <swast@google.com>
Co-authored-by: Chalmer Lowe <chalmerlowe@google.com>
Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
5 participants