-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query History list() does not pass filter_by correctly #99
Comments
That’s a good writeup. Will add an integration test for it. |
looked into it in more detail today - the fix is not exactly straightforward. thank you for providing such detailed explanation of the problem. |
We are running into this issue as well. If this is still not fixed, is the best workaround to implement this look ourselves passing it as the body? databricks-sdk-py/databricks/sdk/service/sql.py Lines 3110 to 3118 in 9aa7ee5
Are there any concerns with this workaround? |
In my testing, there is also odd interaction with max_results. For example, if you provide a filter_by with max_results=1, then the first request (no page_token, yes filter_by) does indeed return 1 result. But, it also includes a next_page_token, so the loop will want to continue, each one returning a single result. @nfx, can we get clarification about what the REST API is actually doing? Is this odd interaction with filter_by and max_results the expected behavior? If so, it seems likely we need to implement max_results within the client SDK, if the server is not going to respect the combination. |
Upon further testing, it seems the filter_by works, but then the max_results only applies for a specific request. E.g., if I apply a filter_by clause which filters down to 10 requests, and set max_results=1, I end up making 10 requests, each of which has a single result. It seems the current code does not compensate for the fact that the REST API may respect max_results for a particular request, but still give a next_page_token. While callers of the Python API almost certainly expect max_results to affect the total number of results returned. I also notice that results are given in reverse chronological order. E.g., if we have the following queries
And ask for a filter with start_time_ms=3 and end_time_ms=5, the results are given in the order [q5, q4, q3]. So, if we apply a max_results limit in Python here in the client, it may also be an unexpected order. Continuing the example, if we say max_results=1, a user might expect q3 (the earliest query in the range), but if we limit here in Python the natural result would be q5 (the first result returned by the API). To match the intuitive order would require getting ALL result from the server, reversing them, and then limiting, which would seem to defeat the resource-saving purpose of the parameter. |
Per the API and docstring:
So I suppose |
For future reference: We will need to change the program used to generate these clients to handle this case. You can see which files are auto-generated in our SDK by looking at the first line of the file. If you want to address this change, the right place to do it is actually in the ApiClient implementation itself. You can inspect the request endpoint and modify it as needed there. Feel free to give that a shot! FYI: internally, we're working on fixing the handling of parameters for GET and DELETE APIs across the entire API surface of Databricks. Most endpoints do handle parameters in a compliant manner, but this endpoint doesn't yet. I'll double check on the progress of this. However, the issue with filter_by and next_page_token will still need separate handling. It may be possible to relax this constraint from the server side, as other APIs support filtering with tokens. |
## Changes The Query History list API filter_by query parameter is modeled by a dictionary, rather than a primitive type, defining the allowed filters including `query_start_time_range`, `statuses`, `user_ids`, and `warehouse_ids`. To be compatible with gRPC transcoding, query parameters modeled by message types as opposed to primitives need to be separated into one query parameter per nested field, where the key is the path to that query parameter. For example: ``` {'filter_by': {'user_ids': [123, 456]}} ``` becomes ``` filter_by.user_ids=123&filter_by.user_ids=456 ``` For this to be compatible with the requests library we use today, we need to convert the first dictionary to ``` {'filter_by.user_ids': [123, 456]} ``` This resolves one of the problems from #99. The issue with the conflict between filter_by and next_page_token will be resolved by the backend service. ## Tests Added an integration test that covers query history listing with a filter_by parameter. - [x] `make test` run locally - [x] `make fmt` applied - [x] relevant integration tests applied
Actually, the core issue was that we need to preprocess the query parameters before handing them to the requests library. See #249 for how I handled this. For the other issue, we're still following up with the backend team. |
## Changes The Query History list API filter_by query parameter is modeled by a dictionary, rather than a primitive type, defining the allowed filters including query_start_time_range, statuses, user_ids, and warehouse_ids. To be compatible with gRPC transcoding, query parameters modeled by message types as opposed to primitives need to be separated into one query parameter per nested field, where the key is the path to that query parameter. For example: ``` ListQueryHistoryRequest{ FilterBy: QueryFilter{ UserIds: []int{123, 456} } } ``` becomes ``` filter_by.user_ids=123&filter_by.user_ids=456 ``` go-querystring already flattens this but to a different form: ``` filter_by[user_ids]=123&filter_by[user_ids]=456 ``` To fix this, we replace `[` with `.` and `]` with empty string. This resolves one of the problems from databricks/databricks-sdk-py#99. The issue with the conflict between filter_by and next_page_token will be resolved by the backend service. Separately, there is an underlying issue affecting the Go SDK specifically, as we delegate query parameter serialization to the go-querystring library. This library looks at the `url` struct tag to name the query parameters appropriately. However, schemas defined in `Components` in OpenAPI are initially cached with IsQuery for all fields = false. As we parse operations, we incrementally discover which entities are present in query parameters, and we need to update the underlying types recursively. Lastly, this includes a bugfix to support the newest version of Delve. ## Tests Added an integration test for ListQueryHistory. - [x] `make test` passing - [x] `make fmt` applied - [x] relevant integration tests applied
## Changes The Query History list API filter_by query parameter is modeled by a dictionary, rather than a primitive type, defining the allowed filters including query_start_time_range, statuses, user_ids, and warehouse_ids. To be compatible with gRPC transcoding, query parameters modeled by message types as opposed to primitives need to be separated into one query parameter per nested field, where the key is the path to that query parameter. For example: ``` new ListQueryHistoryRequest() .setFilterBy(new QueryFilter().setUserIds(Arrays.asList(123L, 456L))) ``` becomes ``` filter_by.user_ids=123&filter_by.user_ids=456 ``` For this to be compatible with the requests library we use today, we need to recursively compute the path of each field in the request object that is annotated with the `QueryParam` annotation, then serialize the value according to Jackson. As part of this, I've also generalized the `Request` class to support repeated query parameter values for lists (see the added integration test). This resolves one of the problems from databricks/databricks-sdk-py#99. The issue with the conflict between filter_by and next_page_token will be resolved by the backend service. ## Tests Added an integration test for SQL Query History API.
The backend team mentioned that their fix was released. As for max_results, we are planning to remove this field from the SDK entirely. Instead, you'll be able to iterate through lists just like with any generator, and you can stop iterating when you've consumed the resources you need. Under the hood, the SDK will handle pagination for you. See #279 for this change. |
Is this resolved in version 0.6.0? When I run code similar to the integration test: from databricks.sdk.service.sql import QueryFilter, TimeRange
def date_to_ms(date):
return int(datetime.strptime(date, '%Y-%m-%d').timestamp() * 1000)
filter = QueryFilter(
query_start_time_range=TimeRange(
start_time_ms=date_to_ms("2023-08-20"), end_time_ms=date_to_ms("2023-08-21")
)
)
queries = w.query_history.list(filter_by=filter)
for q in queries:
print(q) I get the same error that was initially reported:
|
Although it's not documented, the query history list endpoint seems to handle
filter_by
properly only if it is passed in the request body rather than as a query param:Ran using databricks-sdk v0.1.5, requests v2.28.2:
but if I call:
the API works as expected.
Second, when using pagination with the query history endpoint, it doesn't seem to allow specifying
page_token
andfilter_by
at the same time:The current implementation doesn't remove
filter_by
on subsequent calls and thus doesn't paginate correctly whenfilter_by
is passed in. Here, I patch thedo
call to passquery
params asbody
, but this time we fail when we attempt to get the second page.To summarize (sorry for the long post!), if my assumptions about the query history API are correct, then:
filter_by
param should be removed when querying withpage_token
The text was updated successfully, but these errors were encountered: