-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: use jobs.getQueryResults
to download result sets
#347
Conversation
Based on #341 |
google/cloud/bigquery/job.py
Outdated
@@ -2646,6 +2649,7 @@ def __init__(self, job_id, query, client, job_config=None): | |||
) | |||
|
|||
self._query_results = None | |||
self._get_query_results_kwargs = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a thread-local variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the cached query results might need to be thread-local too. Imagine if two threads called result
with different starting indexes and/or max results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll also need some logic like
to see if we can use the cached page if result
is called more than once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in latest commit.
Since `getQueryResults` was already used to wait for the job to finish, this avoids an additional call to `tabledata.list`. The first page of results are cached in-memory. Additional changes will come in the future to avoid calling the BQ Storage API when the cached results contain the full result set.
7364196
to
983c8d2
Compare
Also, move to thread-local variables for values that were intended to track parameters across methods.
startIndex is no longer passed to the iterator It is used in the initial (cached) call to getQueryResults
Iterator of row data | ||
:class:`~google.cloud.bigquery.table.Row`-s. | ||
""" | ||
row_iterator = RowIterator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be sure to populate extra args with the field projection. We only need rows and page token.
Per our discussion, I'll be splitting this into 2 PRs:
I'll base them on the refactoring to split up the giant job module here: #361 |
Since
getQueryResults
was already used to wait for the job to finish,this avoids an additional call to
tabledata.list
. The first page ofresults are cached in-memory.
Additional changes will come in the future to avoid calling the BQ
Storage API when the cached results contain the full result set.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Towards #362