Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide support for synchronous queries through the v2/projects/{projectId}/queries endpoint #589

Closed
kdeggelman opened this issue Apr 6, 2021 · 16 comments · Fixed by #967, #1723 or #1748
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@kdeggelman
Copy link
Contributor

Is your feature request related to a problem? Please describe.
We are part of the BI Engine preview so we're trying to minimize the latency between our python application and BigQuery. The current approach of creating an asynchronous job and then waiting for it to complete adds significant latency and hurts the value proposition of BI Engine.
Describe the solution you'd like
We'd like a way to utilize the v2/projects/{projectId}/queries endpoint to execute a query and wait for the response.
Describe alternatives you've considered
We could call the REST API directly. The main drawbacks are 1) authentication with a service account 2) the existing python client does a nice job of creating Row objects with the results.
I've also used the following snippet:

bq_client = bigquery.Client(project=project_id)
synchronous_api_response = bq_client._connection.api_request(
    "POST", 
    f"/projects/{project_id}/queries",
    data={"query": query, "useLegacySql": False},
)
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Apr 6, 2021
@plamut plamut added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Apr 7, 2021
@plamut
Copy link
Contributor

plamut commented Apr 9, 2021

@kdeggelman I'm not sure it would be feasible to change how the said endpoint behaves on the backend. Even if POST-ing to it directly, it just creates a new job to run the actual query, and the code would still need to fetch query results using the job ID obtained from the first call's response.

As an alternative, have you maybe considered using the faster BigQuery Storage API? It's more performant than the REST API, allows streaming the rows over multiple streams, etc.

Could that be something that can help with your use case?

@kdeggelman
Copy link
Contributor Author

Thanks for the help @plamut!

and the code would still need to fetch query results using the job ID obtained from the first call's response.

The doc you linked to for jobs.query says "Runs a BigQuery SQL query synchronously and returns query results if the query completes within a specified timeout." which is the behavior that I've observed. I am able to grab the rows straight from the response body of that endpoint without hitting getQueryResults.

As an alternative, have you maybe considered using the faster BigQuery Storage API? It's more performant than the REST API, allows streaming the rows over multiple streams, etc.

Could that be something that can help with your use case?

The Storage API definitely looks promising, as I'm sure we'll run into cases where our queries go beyond the timeout of the jobs.query call, or return a large dataset that would benefit from the storage API.

That being said, I think it could still make sense to support the jobs.query method given its speed improvement over jobs.insert + jobs.getQueryResults and the simplicity over the Storage API.

@tswast
Copy link
Contributor

tswast commented Apr 13, 2021

This is a tricky one. jobs.query can be faster for some result sizes, but there are a great deal of result sizes where it is much much slower and causes problems like #403

@tswast
Copy link
Contributor

tswast commented Apr 13, 2021

We got partway through implementing this in #362 before reverting a few of those PRs.

The other problem we encountered is that to do this right, we'd have to cache that first page of results, which was unexpected and undesired for a large class of use cases: #394

That said, I think we could support this ask, but only if it's behind a feature flag. Perhaps we add a use_query_api flag to query that runs jobs.query and caches the first page of results if set?

@kdeggelman
Copy link
Contributor Author

@tswast thanks for providing all those links for context. It's clear that there is no one-size-fits-all answer to the problem of optimally retrieving query results. I think adding some sort of explicit flag to use jobs.query makes sense.

In the meantime, I'm pretty happy with my workaround:

from google.cloud.bigquery.query import _QueryResults

synchronous_api_response = bq_client._connection.api_request(
    "POST",
    f"/projects/{bq_client.project}/queries",
    data={"query": query, "useLegacySql": False},
)
return _QueryResults.from_api_repr(synchronous_api_response).rows

@plamut
Copy link
Contributor

plamut commented Apr 20, 2021

@tswast Could you elaborate on how you imagine this addition should look like from the user perspective?

Client.query() returns a QueryJob, but if taking a shortcut and using the jobs.query endpoint directly, the response is not a QueryJob object. The response only includes a job reference, but fetching the job itself to return it would kind of beat the purpose of making fewer API requests?

If Client.query() is changed to optionally return query results instead, that would seem convoluted - perhaps a new specialized method would be a cleaner approach?

Also asking because if the existing method is updated, we need to think about what to do with parameters such as job_config that are kind of irrelevant when sending a QueryRequest?

@tswast
Copy link
Contributor

tswast commented Apr 21, 2021

Could you elaborate on how you imagine this addition should look like from the user perspective?

@plamut I have a sketch of a plan in #362

Client.query() returns a QueryJob, but if taking a shortcut and using the jobs.query endpoint directly, the response is not a QueryJob object.

Very true! I'd still imagine returning a QueryJob object, but it contains very minimal information. For one thing, the query might not complete within the jobs.query() timeout, so we'd still need logic to support waiting for these jobs to finish. QueryJob already does this.

Whether the job is finished or not, we'd return a QueryJob that is basically just the job reference and a cached _query_results object. To avoid reloading the job information unnecessarily, we might also set the job state to DONE if the query is "completed".

Also asking because if the existing method is updated, we need to think about what to do with parameters such as job_config that are kind of irrelevant when sending a QueryRequest?

These are not irrelevant. Many (but not all, unfortunately) of the parameters in job_config are also supported by the jobs.query() API. In #362 I was thinking we'd autoselect jobs.query or jobs.insert based on how complex the job_config argument is. But now I think that's too "magical" / hard to debug. I think just a flag to call jobs.query is all we desire.

As you can see from https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query#queryrequest the request object to jobs.query is kind of a flattened and smaller version of https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfigurationquery A potential implementation is to use the "api representation" of job_config as a starting point and rearrange a few keys such as dryRun. A benefit of this is that as the backend folks add parameters to jobs.query to bring it into parity with jobs.insert, we won't have to update the client to support them.

@tswast
Copy link
Contributor

tswast commented Apr 21, 2021

Oh, and the other thing we'll require is the "first page" hack that I initially implemented in RowIterator. We might need some additional hooks there because we won't have the destination table ID available. This means that to_dataframe and to_arrow's BigQuery Storage API implementation can't be used when jobs.query is used.

@tswast
Copy link
Contributor

tswast commented Mar 25, 2022

This is 1/2 done on the v3 branch, with the new api_method="QUERY" parameter to the query() method.

That said, it's missing all the optimizations to make it useful. It calls jobs.query, but then it still calls the same set of functions as api_method="INSERT" to wait for the query to finish, populate job stats, and download results (to iterable and to dataframe/arrow).

@chalmerlowe
Copy link
Collaborator

This issue has been open for three years and while some effort has been made there appears to be a significant amount of work that needs to be done. Besides the initial request three years ago, there does not appear to be a large user base requesting this feature. Closing due to workload and other priorities.

@chalmerlowe chalmerlowe added the status: will not fix Invalid (untrue/unsound/erroneous), inconsistent with product, not on roadmap. label Sep 19, 2023
@tswast tswast reopened this Nov 14, 2023
@tswast tswast removed the status: will not fix Invalid (untrue/unsound/erroneous), inconsistent with product, not on roadmap. label Nov 14, 2023
@tswast tswast self-assigned this Nov 14, 2023
@tswast
Copy link
Contributor

tswast commented Nov 16, 2023

#1723 finished up the work planned for this project. With that change, when you run Client.query(..., api_method="QUERY") it does a call to jobs.query REST API and uses the results in RowIterator.

Note: it still does a call to the jobs.get API to get the job metadata (importantly, the destination table) so that we can fall back to fetching large results with the BQ Storage API. In #1722 I plan to introduce a new method which can avoid that jobs.get API call in some cases.

@nickmarx12345678
Copy link
Contributor

nickmarx12345678 commented Dec 3, 2023

@tswast thank you for working on this! I am reading your PR and wanted to clarify the usage, as we're working to reduce the latency between query creation time and serving of results from (typically) single page queries.

Our usage at the moment looks roughy like:

job = client.query(query=sql, job_config=job_config)
return [result for result in job.result(
                start_index=start_index, max_results=limit, retry=retry, timeout=timeout)
]

My understanding is we will need to update our query call to set api_method="QUERY", after which, if the job created by the query call completes in less time than timeout, we will pull back the first page immediately, instead of issuing a second getQueryResults request under the hood?

@tswast
Copy link
Contributor

tswast commented Dec 6, 2023

@nickmarx12345678 Yes, to avoid an extra call to getQueryResults, use api_method="QUERY". Note: setting a start index and max results might invalidate what we have cached depending on how many results were originally returned.

Also note: my in flight PR #1722 optimized this further by introducing a query_and_wait method that can avoid an unnecessary call to get the job metadata as well.

@nickmarx12345678
Copy link
Contributor

nickmarx12345678 commented Dec 6, 2023

Thanks! We have already started testing w/ prerelease version and can see the difference in traces. One observation I wanted to share:

It appears in the optimized case, the resultant row_iterator may have its total_rows field set to None, where previously it would be the aggregate number of all results. Still working through the code to understand if this is expected, but wanted to share in case it is not. (Noticed as we use this field to prepare pagination metadata)

e.g.

job = client.query(query=sql, job_config=job_config, api_method="QUERY")
result = job.result(start_index=start_index, max_results=limit, retry=retry, timeout=timeout)
rows = [row for rows in result]
print(result.total_rows)
# None

@tswast
Copy link
Contributor

tswast commented Dec 8, 2023

@nickmarx12345678 I noticed that issue too in https://github.com/googleapis/python-bigquery/pull/1722/files#diff-1172a63bba9d23e3a24e43736ef1275452ef2ef2d4a50d054c209bdd2792fdf5 and have a fix included in that PR. I can pull that fix out into a separate PR so it's not blocked if you prefer.

@tswast
Copy link
Contributor

tswast commented Dec 8, 2023

#1748 fix for total_rows is awaiting review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
5 participants