-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_table_from_dataframe does not error out when nan
in a required column - Million dollar bug
#1692
Comments
seems like there is an issue with pyarrow library. Following code show the reproduction of the same bug. import pyarrow as pa
import pyarrow.parquet
import pandas as pd
fields = [pa.field('id', pa.string(), nullable=False),
pa.field('name', pa.string(), nullable=False)]
array = [pa.array(['10', '11', '12', '13']),
pa.array(['AAA', None, 'BBB', 'CCC'])]
table = pa.Table.from_arrays(array, schema=pa.schema(fields))
pyarrow.parquet.write_table(table, 'test_arrow.parquet', compression='SNAPPY', use_compliant_nested_type=True)
df = pd.read_parquet("/Users/fki/Documents/git/Demo/bq_api/test_arrow.parquet", engine='pyarrow')
print("\n\n\n")
print(df) Possible solutionuntil issue is fixed in Pyarrow. Disable the nullability for schema in pyarrow table. This will throw unhandled exception though. _pandas_helper.py def bq_to_arrow_field(bq_field, array_type=None):
"""Return the Arrow field, corresponding to a given BigQuery column.
Returns:
None: if the Arrow type cannot be determined.
"""
arrow_type = bq_to_arrow_data_type(bq_field)
if arrow_type is not None:
if array_type is not None:
arrow_type = array_type # For GEOGRAPHY, at least initially
# is_nullable = bq_field.mode.upper() == "NULLABLE"
is_nullable = True Exception if remove nullability
|
other way to fix this is to verify if data has null when column has data type as required. created the pull request for that. |
@Gaurang033 Thank you so much for your investigation, and proposing a solution! It looks like pyarrow did not check whether the arrays are consistent with schema when
It feels like a major neligence on the pyarrow side. I wonder if it's a bug, or it's done by design. At the same time, we have an umbrella issue #1646, which covers a bunch of issues related to the |
@Linchin totally agree. Seems like a bug from pyarrow library as I can reproduce it without using bigquery. if you think we need this patch until pyarrow fix this, let's merge this. I fixed the failing test case. when we provided unknown data type arrow field was coming as null. and so method I added was throwing exception. |
I added a similar comment on the PR #1698:
I am inclined to close this issue as out of scope for us. |
Given that arrow is intended for fast conversions, I think the correct fix for us is to stop using REQUIRED/NULLABLE from BigQuery to determine our local arrow schema. Instead, always set nullable=True locally. |
…values (#1735) Even if the remote schema is REQUIRED Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [ ] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-bigquery/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [ ] Ensure the tests and linter pass - [ ] Code coverage does not decrease (if any source code was changed) - [ ] Appropriate docs were updated (if necessary) Fixes #1692 🦕
…values (googleapis#1735) Even if the remote schema is REQUIRED Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [ ] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-bigquery/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [ ] Ensure the tests and linter pass - [ ] Code coverage does not decrease (if any source code was changed) - [ ] Appropriate docs were updated (if necessary) Fixes googleapis#1692 🦕
…nto pandas extra (#1726) * feat: Introduce compatibility with native namespace packages * Update copyright year * removed pkg_resources from all test files and moved importlib into pandas extra * feat: removed pkg_resources from all test files and moved importlib into pandas extra * Adding no cover tag to test code * reformatted with black * undo revert * perf: use the first page a results when `query(api_method="QUERY")` (#1723) * perf: use the first page a results when `query(api_method="QUERY")` * add tests * respect max_results with cached page * respect page_size, also avoid bqstorage if almost fully downloaded * skip true test if bqstorage not installed * coverage * fix: ensure query job retry has longer deadline than API request deadline (#1734) In cases where we can't disambiguate API failure from job failure, this ensures we can still retry the job at least once. * fix: `load_table_from_dataframe` now assumes there may be local null values (#1735) Even if the remote schema is REQUIRED Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [ ] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-bigquery/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [ ] Ensure the tests and linter pass - [ ] Code coverage does not decrease (if any source code was changed) - [ ] Appropriate docs were updated (if necessary) Fixes #1692 🦕 * chore: standardize samples directory - delete unneeded dependencies (#1732) * chore: standardize samples directory = delete unneeded dependencies * Removed unused import for linter * fix: move grpc, proto-plus and protobuf packages to extras (#1721) * chore: move grpc, proto-plus and protobuff packages to extras * formatted with black * feat: add `job_timeout_ms` to job configuration classes (#1675) * fix: adds new property and tests * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * updates docs to correct a sphinx failure * Updates formatting * Update tests/system/test_query.py * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * Update google/cloud/bigquery/job/base.py * updates one test and uses int_or_none * Update tests/system/test_query.py testing something. * Update tests/system/test_query.py * testing coverage feature * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * minor edits * tweaks to noxfile for testing purposes * add new test to base as experiment * adds a test, updates import statements * add another test * edit to tests * formatting fixes * update noxfile to correct debug code * removes unneeded comments. --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com> --------- Co-authored-by: Chalmer Lowe <chalmerlowe@google.com> Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com> Co-authored-by: Tim Swast <swast@google.com> * remove unnecessary version checks * undo bad commit, remove unneeded version checks * Revert "undo bad commit, remove unneeded version checks" This reverts commit 5c82dcf. * Revert "remove unnecessary version checks" This reverts commit 9331a7e. * revert bad changes, remove pkg_resources from file * after clarification, reimplement changes and ignore 3.12 tests * reformatted with black * removed minimum check * updated pandas installed version check --------- Co-authored-by: Anthonios Partheniou <partheniou@google.com> Co-authored-by: Tim Swast <swast@google.com> Co-authored-by: Chalmer Lowe <chalmerlowe@google.com> Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
When loading a pandas dataframe into Bigquery that contains a nan in a required column, the upload succeeds but the resulting table is not representative of the dataframe. The values in the column containing the
nan
are unordered and the user is not aware of it.Environment details
google-cloud-bigquery
version:Steps to reproduce
Code example
The text was updated successfully, but these errors were encountered: