Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: use the first page a results when query(api_method="QUERY") #1723

Merged
merged 8 commits into from
Nov 21, 2023

Conversation

tswast
Copy link
Contributor

@tswast tswast commented Nov 15, 2023

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

This is a restoration of part of #374

TODO:

  • Unit tests to verify that no extra jobs.getQueryResults or jobs.get calls happen when iterating over rows that are fully returned from jobs.query.
  • Ensure to_pandas and to_arrow work with this code path.

Closes #589 🦕

@product-auto-label product-auto-label bot added size: s Pull request size is small. api: bigquery Issues related to the googleapis/python-bigquery API. labels Nov 15, 2023
@product-auto-label product-auto-label bot added size: m Pull request size is medium. and removed size: s Pull request size is small. labels Nov 15, 2023
@tswast tswast marked this pull request as ready for review November 15, 2023 22:54
@tswast tswast requested review from a team as code owners November 15, 2023 22:54
@tswast tswast requested a review from farhan0102 November 15, 2023 22:54
@shollyman shollyman requested review from chalmerlowe and Linchin and removed request for farhan0102 November 15, 2023 22:56
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Nov 16, 2023
@Linchin
Copy link
Contributor

Linchin commented Nov 16, 2023

I'm not sure I understand. Could you explain why doing this will improve query performance, and only applies when query(api_method="QUERY")?

@tswast
Copy link
Contributor Author

tswast commented Nov 16, 2023

Could you explain why doing this will improve query performance, and only applies when query(api_method="QUERY")?

Great question @Linchin.

There are two APIs for issuing a query in BigQuery: jobs.insert and jobs.query. There are some key differences between these APIs.

  • jobs.insert: creates and returns a query job unless it's considered a "bad request" or insufficient permissions or similar. In general this API returns relatively quickly, but only includes job metadata, no query results.

    When running a query this way, we call jobs.getQueryResults with the page size set to 0 results. This is purely to find out when the job finishes. The reason for 0 results is that for large rows, getQueryResults can actually hang well past the expected 10 seconds, causing issues such as simple query hangs in 0.14.1, works in 0.13.3 python-bigquery-pandas#343.

    Once the query has finished, we call jobs.get to refresh the job metadata. This ensures we have a destination table, which can be used with tabledata.list or the BQ Storage Read API to fetch the results. Many years ago, it used to be much faster to use tabledata.list instead of jobs.getQueryResults to download data, but that changed in late 2019 when many optimizations were made to getQueryResults on the backend. As of perf: use jobs.getQueryResults to download result sets #363, we use either jobs.getQueryResults or the BQ Storage Read API to download results.

    In general, this path works best for large query results, which we expect from folks using pandas, for example.

    Edit: To clarify, after this PR, I think it'd be OK to use jobs.query even for pandas, but since the cached first page of results increases memory usage (increased memory usage in 2.4.0 #394), I don't think it makes sense to change the default. But definitely something to consider since users who want less memory usage could explicitly set api_method="INSERT".

  • jobs.query: starts the query and waits. If it finishes within about 10 seconds, it returns the results in that same API call. If the query has not finished, we use the job ID from the response to call jobs.getQueryResults. The subsequent steps are the same as with jobs.insert.

    The key difference here is that jobs.query can actually return results in the same request/response as the one to start the query, whereas jobs.insert always requires a subsequent API call to fetch the results.

This PR does a few things: (1) if the job has finished, don't make the unnecessary call to getQueryResults (2), actually use the rows returned from jobs.query in RowIterator, and (3) avoid using the BQ Storage Read API if we know only 1 or 2 more API requests are needed to download the rest of the data.

# This also requires updates to `to_dataframe` and the DB API connector
# so that they don't try to read from a destination table if all the
# results are present.
query_job._query_results = google.cloud.bigquery.query._QueryResults(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change mean we also load the query results when job is complete, when query(api_method="INSERT")?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I just double-checked that this method is only called from query_jobs_query, so it won't affect when api_method="INSERT"

Copy link
Contributor

@Linchin Linchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for helping me understand this PR :)

@tswast tswast merged commit 6290517 into main Nov 21, 2023
20 checks passed
@tswast tswast deleted the issue589-RowIterator._is_completely_cached branch November 21, 2023 15:13
kiraksi pushed a commit to kiraksi/python-bigquery that referenced this pull request Nov 27, 2023
…oogleapis#1723)

* perf: use the first page a results when `query(api_method="QUERY")`

* add tests

* respect max_results with cached page

* respect page_size, also avoid bqstorage if almost fully downloaded

* skip true test if bqstorage not installed

* coverage
kiraksi added a commit that referenced this pull request Nov 28, 2023
…nto pandas extra (#1726)

* feat: Introduce compatibility with native namespace packages

* Update copyright year

* removed pkg_resources from all test files and moved importlib into pandas extra

* feat: removed pkg_resources from all test files and moved importlib into pandas extra

* Adding no cover tag to test code

* reformatted with black

* undo revert

* perf: use the first page a results when `query(api_method="QUERY")` (#1723)

* perf: use the first page a results when `query(api_method="QUERY")`

* add tests

* respect max_results with cached page

* respect page_size, also avoid bqstorage if almost fully downloaded

* skip true test if bqstorage not installed

* coverage

* fix: ensure query job retry has longer deadline than API request deadline (#1734)

In cases where we can't disambiguate API failure from job failure,
this ensures we can still retry the job at least once.

* fix: `load_table_from_dataframe` now assumes there may be local null values (#1735)

Even if the remote schema is REQUIRED

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
- [ ] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-bigquery/issues/new/choose) before writing your code!  That way we can discuss the change, evaluate designs, and agree on the general idea
- [ ] Ensure the tests and linter pass
- [ ] Code coverage does not decrease (if any source code was changed)
- [ ] Appropriate docs were updated (if necessary)

Fixes #1692 🦕

* chore: standardize samples directory - delete unneeded dependencies (#1732)

* chore: standardize samples directory = delete unneeded dependencies

* Removed unused import for linter

* fix: move grpc, proto-plus and protobuf packages to extras (#1721)

* chore: move grpc, proto-plus and protobuff packages to extras

* formatted with black

* feat: add `job_timeout_ms` to job configuration classes (#1675)

* fix: adds new property and tests

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

* updates docs to correct a sphinx failure

* Updates formatting

* Update tests/system/test_query.py

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

* Update google/cloud/bigquery/job/base.py

* updates one test and uses int_or_none

* Update tests/system/test_query.py

testing something.

* Update tests/system/test_query.py

* testing coverage feature

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

* minor edits

* tweaks to noxfile for testing purposes

* add new test to base as experiment

* adds a test, updates import statements

* add another test

* edit to tests

* formatting fixes

* update noxfile to correct debug code

* removes unneeded comments.

---------

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>

---------

Co-authored-by: Chalmer Lowe <chalmerlowe@google.com>
Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Co-authored-by: Tim Swast <swast@google.com>

* remove unnecessary version checks

* undo bad commit, remove unneeded version checks

* Revert "undo bad commit, remove unneeded version checks"

This reverts commit 5c82dcf.

* Revert "remove unnecessary version checks"

This reverts commit 9331a7e.

* revert bad changes, remove pkg_resources from file

* after clarification, reimplement changes and ignore 3.12 tests

* reformatted with black

* removed minimum check

* updated pandas installed version check

---------

Co-authored-by: Anthonios Partheniou <partheniou@google.com>
Co-authored-by: Tim Swast <swast@google.com>
Co-authored-by: Chalmer Lowe <chalmerlowe@google.com>
Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide support for synchronous queries through the v2/projects/{projectId}/queries endpoint
2 participants