Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query History list() does not pass filter_by correctly #99

Closed
asikowitz opened this issue May 8, 2023 · 10 comments
Closed

Query History list() does not pass filter_by correctly #99

asikowitz opened this issue May 8, 2023 · 10 comments
Labels
api client issues related to API client codegen issues related to generated code OpenAPI issues related to metadata across all SDKs

Comments

@asikowitz
Copy link

Although it's not documented, the query history list endpoint seems to handle filter_by properly only if it is passed in the request body rather than as a query param:

Ran using databricks-sdk v0.1.5, requests v2.28.2:

In [2]: from databricks.sdk import WorkspaceClient
In [3]: client = WorkspaceClient(host=..., token=...)
In [4]: from databricks.sdk.service.sql import QueryFilter
In [6]: filter_by = QueryFilter.from_dict(
   ...:             {
   ...:                 "query_start_time_range": {
   ...:                     "start_time_ms": 0,
   ...:                     "end_time_ms": int(time.time() * 1000),
   ...:                 }
   ...:             }
   ...:         )

In [7]: filter_by
Out[7]: QueryFilter(query_start_time_range=TimeRange(end_time_ms=1683569801679, start_time_ms=0), statuses=None, user_ids=None, warehouse_ids=None)

In [9]: next(client.query_history.list(filter_by=filter_by))
---------------------------------------------------------------------------
DatabricksError                           Traceback (most recent call last)
Cell In[9], line 1
----> 1 next(client.query_history.list(filter_by=filter_by))

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/databricks/sdk/service/sql.py:2824, in QueryHistoryAPI.list(self, filter_by, include_metrics, max_results, page_token, **kwargs)
   2821 if page_token: query['page_token'] = request.page_token
   2823 while True:
-> 2824     json = self._api.do('GET', '/api/2.0/sql/history/queries', query=query)
   2825     if 'res' not in json or not json['res']:
   2826         return

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/databricks/sdk/core.py:753, in ApiClient.do(self, method, path, query, body)
    749 if not response.ok:
    750     # TODO: experiment with traceback pruning for better readability
    751     # See https://stackoverflow.com/a/58821552/277035
    752     payload = response.json()
--> 753     raise self._make_nicer_error(status_code=response.status_code, **payload) from None
    754 if not len(response.content):
    755     return {}

DatabricksError: Could not parse request object: Expected 'START_OBJECT' not 'VALUE_STRING'
 at [Source: (ByteArrayInputStream); line: 1, column: 15]
 at [Source: java.io.ByteArrayInputStream@794544b2; line: 1, column: 15]

but if I call:

In [53]: "res" in client.query_history._api.do('GET', '/api/2.0/sql/history/queries', body={"filter_by": filter_by.as_dict()})
Out[53]: True

the API works as expected.

Second, when using pagination with the query history endpoint, it doesn't seem to allow specifying page_token and filter_by at the same time:

In [55]: client.query_history._api.do('GET', '/api/2.0/sql/history/queries', body={"filter_by": filter_by.as_dict(), "page_token": "abc"})
---------------------------------------------------------------------------
DatabricksError                           Traceback (most recent call last)
Cell In[55], line 1
----> 1 client.query_history._api.do('GET', '/api/2.0/sql/history/queries', body={"filter_by": filter_by.as_dict(), "page_token": "abc"})

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/databricks/sdk/core.py:753, in ApiClient.do(self, method, path, query, body)
    749 if not response.ok:
    750     # TODO: experiment with traceback pruning for better readability
    751     # See https://stackoverflow.com/a/58821552/277035
    752     payload = response.json()
--> 753     raise self._make_nicer_error(status_code=response.status_code, **payload) from None
    754 if not len(response.content):
    755     return {}

DatabricksError: You can provide only one of 'page_token' or 'filter_by'

The current implementation doesn't remove filter_by on subsequent calls and thus doesn't paginate correctly when filter_by is passed in. Here, I patch the do call to pass query params as body, but this time we fail when we attempt to get the second page.

In [56]: original_do = client.api_client.do

In [57]: from unittest.mock import patch

In [62]: def patch_do(method, path, query = None, body = None):
    ...:     print(method,path,query,body)
    ...:     res = original_do(method, path, query = None, body=query)
    ...:     print("RES", res.keys())
    ...:     return res
    ...: 

In [64]: with patch.object(ApiClient, "do") as mock_do:
    ...:     mock_do.side_effect = patch_do
    ...:     it = client.query_history.list(filter_by=filter_by, max_results=1)
    ...:     next(it)
    ...:     next(it)
    ...: 
GET /api/2.0/sql/history/queries {'filter_by': {'query_start_time_range': {'end_time_ms': 1683569801679}}, 'max_results': 1} None
RES dict_keys(['next_page_token', 'has_next_page', 'res'])
GET /api/2.0/sql/history/queries {'filter_by': {'query_start_time_range': {'end_time_ms': 1683569801679}}, 'max_results': 1, 'page_token': 'CkwKJDAxZWRlZGFjLWJjNDktMTgyOS1hM2UwLTYwNDYwZTVkNjU4MBCD2ZHe/zAY09WStKW5tAUiEDcyNTJjYmE1NTlmNDhkZjQo+JgQEgkSBxDP89Hk/zAYAQ=='} None
---------------------------------------------------------------------------
DatabricksError                           Traceback (most recent call last)
Cell In[64], line 5
      3 it = client.query_history.list(filter_by=filter_by, max_results=1)
      4 next(it)
----> 5 next(it)

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/databricks/sdk/service/sql.py:2824, in QueryHistoryAPI.list(self, filter_by, include_metrics, max_results, page_token, **kwargs)
   2821 if page_token: query['page_token'] = request.page_token
   2823 while True:
-> 2824     json = self._api.do('GET', '/api/2.0/sql/history/queries', query=query)
   2825     if 'res' not in json or not json['res']:
   2826         return

File ~/.pyenv/versions/3.10.9/lib/python3.10/unittest/mock.py:1114, in CallableMixin.__call__(self, *args, **kwargs)
   1112 self._mock_check_sig(*args, **kwargs)
   1113 self._increment_mock_call(*args, **kwargs)
-> 1114 return self._mock_call(*args, **kwargs)

File ~/.pyenv/versions/3.10.9/lib/python3.10/unittest/mock.py:1118, in CallableMixin._mock_call(self, *args, **kwargs)
   1117 def _mock_call(self, /, *args, **kwargs):
-> 1118     return self._execute_mock_call(*args, **kwargs)

File ~/.pyenv/versions/3.10.9/lib/python3.10/unittest/mock.py:1179, in CallableMixin._execute_mock_call(self, *args, **kwargs)
   1177         raise result
   1178 else:
-> 1179     result = effect(*args, **kwargs)
   1181 if result is not DEFAULT:
   1182     return result

Cell In[62], line 3, in patch_do(method, path, query, body)
      1 def patch_do(method, path, query = None, body = None):
      2     print(method,path,query,body)
----> 3     res = original_do(method, path, query = None, body=query)
      4     print("RES", res.keys())
      5     return res

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/databricks/sdk/core.py:753, in ApiClient.do(self, method, path, query, body)
    749 if not response.ok:
    750     # TODO: experiment with traceback pruning for better readability
    751     # See https://stackoverflow.com/a/58821552/277035
    752     payload = response.json()
--> 753     raise self._make_nicer_error(status_code=response.status_code, **payload) from None
    754 if not len(response.content):
    755     return {}

DatabricksError: You can provide only one of 'page_token' or 'filter_by'

To summarize (sorry for the long post!), if my assumptions about the query history API are correct, then:

  1. All params should be passed in the body, even though it's a GET request
  2. The filter_by param should be removed when querying with page_token
Copy link
Contributor

nfx commented May 8, 2023

That’s a good writeup. Will add an integration test for it.

@nfx
Copy link
Contributor

nfx commented May 10, 2023

looked into it in more detail today - the fix is not exactly straightforward. thank you for providing such detailed explanation of the problem.

@nfx nfx added codegen issues related to generated code api client issues related to API client labels May 10, 2023
@nfx nfx added the OpenAPI issues related to metadata across all SDKs label May 22, 2023
@lukehsiao
Copy link

We are running into this issue as well.

If this is still not fixed, is the best workaround to implement this look ourselves passing it as the body?

while True:
json = self._api.do('GET', '/api/2.0/sql/history/queries', query=query)
if 'res' not in json or not json['res']:
return
for v in json['res']:
yield QueryInfo.from_dict(v)
if 'next_page_token' not in json or not json['next_page_token']:
return
query['page_token'] = json['next_page_token']

looked into it in more detail today - the fix is not exactly straightforward. thank you for providing such detailed explanation of the problem.

Are there any concerns with this workaround?

@lukehsiao
Copy link

To summarize (sorry for the long post!), if my assumptions about the query history API are correct, then:

  1. All params should be passed in the body, even though it's a GET request
  2. The filter_by param should be removed when querying with page_token

In my testing, there is also odd interaction with max_results.

For example, if you provide a filter_by with max_results=1, then the first request (no page_token, yes filter_by) does indeed return 1 result.

But, it also includes a next_page_token, so the loop will want to continue, each one returning a single result.

@nfx, can we get clarification about what the REST API is actually doing? Is this odd interaction with filter_by and max_results the expected behavior?

If so, it seems likely we need to implement max_results within the client SDK, if the server is not going to respect the combination.

@lukehsiao
Copy link

Upon further testing, it seems the filter_by works, but then the max_results only applies for a specific request.

E.g., if I apply a filter_by clause which filters down to 10 requests, and set max_results=1, I end up making 10 requests, each of which has a single result.

It seems the current code does not compensate for the fact that the REST API may respect max_results for a particular request, but still give a next_page_token. While callers of the Python API almost certainly expect max_results to affect the total number of results returned.

I also notice that results are given in reverse chronological order. E.g., if we have the following queries

t=1, q1
t=2, q2
t=3, q3
t=4, q4
t=5, q5

And ask for a filter with start_time_ms=3 and end_time_ms=5, the results are given in the order [q5, q4, q3]. So, if we apply a max_results limit in Python here in the client, it may also be an unexpected order.

Continuing the example, if we say max_results=1, a user might expect q3 (the earliest query in the range), but if we limit here in Python the natural result would be q5 (the first result returned by the API). To match the intuitive order would require getting ALL result from the server, reversing them, and then limiting, which would seem to defeat the resource-saving purpose of the parameter.

@lukehsiao
Copy link

Per the API and docstring:

Limit the number of results returned in one page.

So I suppose max_results is actually behaving as intended here, and I had an incorrect understanding.

@mgyucht
Copy link
Contributor

mgyucht commented Jul 21, 2023

For future reference:

We will need to change the program used to generate these clients to handle this case. You can see which files are auto-generated in our SDK by looking at the first line of the file. If you want to address this change, the right place to do it is actually in the ApiClient implementation itself. You can inspect the request endpoint and modify it as needed there. Feel free to give that a shot!

FYI: internally, we're working on fixing the handling of parameters for GET and DELETE APIs across the entire API surface of Databricks. Most endpoints do handle parameters in a compliant manner, but this endpoint doesn't yet. I'll double check on the progress of this.

However, the issue with filter_by and next_page_token will still need separate handling. It may be possible to relax this constraint from the server side, as other APIs support filtering with tokens.

mgyucht added a commit that referenced this issue Jul 25, 2023
## Changes
The Query History list API filter_by query parameter is modeled by a
dictionary, rather than a primitive type, defining the allowed filters
including `query_start_time_range`, `statuses`, `user_ids`, and
`warehouse_ids`. To be compatible with gRPC transcoding, query
parameters modeled by message types as opposed to primitives need to be
separated into one query parameter per nested field, where the key is
the path to that query parameter. For example:

```
{'filter_by': {'user_ids': [123, 456]}}
```
becomes
```
filter_by.user_ids=123&filter_by.user_ids=456
```
For this to be compatible with the requests library we use today, we
need to convert the first dictionary to
```
{'filter_by.user_ids': [123, 456]}
```

This resolves one of the problems from #99. The issue with the conflict
between filter_by and next_page_token will be resolved by the backend
service.

## Tests
Added an integration test that covers query history listing with a
filter_by parameter.

- [x] `make test` run locally
- [x] `make fmt` applied
- [x] relevant integration tests applied
@mgyucht
Copy link
Contributor

mgyucht commented Jul 25, 2023

Actually, the core issue was that we need to preprocess the query parameters before handing them to the requests library. See #249 for how I handled this. For the other issue, we're still following up with the backend team.

mgyucht added a commit to databricks/databricks-sdk-go that referenced this issue Jul 27, 2023
## Changes
The Query History list API filter_by query parameter is modeled by a
dictionary, rather than a primitive type, defining the allowed filters
including query_start_time_range, statuses, user_ids, and warehouse_ids.
To be compatible with gRPC transcoding, query parameters modeled by
message types as opposed to primitives need to be separated into one
query parameter per nested field, where the key is the path to that
query parameter. For example:

```
ListQueryHistoryRequest{
  FilterBy: QueryFilter{
    UserIds: []int{123, 456}
  }
}
```
becomes
```
filter_by.user_ids=123&filter_by.user_ids=456
```
go-querystring already flattens this but to a different form:
```
filter_by[user_ids]=123&filter_by[user_ids]=456
```
To fix this, we replace `[` with `.` and `]` with empty string.

This resolves one of the problems from
databricks/databricks-sdk-py#99. The issue
with the conflict between filter_by and next_page_token will be resolved
by the backend service.

Separately, there is an underlying issue affecting the Go SDK
specifically, as we delegate query parameter serialization to the
go-querystring library. This library looks at the `url` struct tag to
name the query parameters appropriately. However, schemas defined in
`Components` in OpenAPI are initially cached with IsQuery for all fields
= false. As we parse operations, we incrementally discover which
entities are present in query parameters, and we need to update the
underlying types recursively.

Lastly, this includes a bugfix to support the newest version of Delve.

## Tests
Added an integration test for ListQueryHistory.

- [x] `make test` passing
- [x] `make fmt` applied
- [x] relevant integration tests applied
mgyucht added a commit to databricks/databricks-sdk-java that referenced this issue Jul 27, 2023
## Changes
The Query History list API filter_by query parameter is modeled by a
dictionary, rather than a primitive type, defining the allowed filters
including query_start_time_range, statuses, user_ids, and warehouse_ids.
To be compatible with gRPC transcoding, query parameters modeled by
message types as opposed to primitives need to be separated into one
query parameter per nested field, where the key is the path to that
query parameter. For example:
```
new ListQueryHistoryRequest()
  .setFilterBy(new QueryFilter().setUserIds(Arrays.asList(123L, 456L)))
```
becomes
```
filter_by.user_ids=123&filter_by.user_ids=456
```
For this to be compatible with the requests library we use today, we
need to recursively compute the path of each field in the request object
that is annotated with the `QueryParam` annotation, then serialize the
value according to Jackson.

As part of this, I've also generalized the `Request` class to support
repeated query parameter values for lists (see the added integration
test).

This resolves one of the problems from
databricks/databricks-sdk-py#99. The issue
with the conflict between filter_by and next_page_token will be resolved
by the backend service.

## Tests
Added an integration test for SQL Query History API.
@mgyucht
Copy link
Contributor

mgyucht commented Aug 14, 2023

The backend team mentioned that their fix was released.

As for max_results, we are planning to remove this field from the SDK entirely. Instead, you'll be able to iterate through lists just like with any generator, and you can stop iterating when you've consumed the resources you need. Under the hood, the SDK will handle pagination for you. See #279 for this change.

@mgyucht mgyucht closed this as completed Aug 14, 2023
@chris-koester-db
Copy link

Is this resolved in version 0.6.0? When I run code similar to the integration test:

from databricks.sdk.service.sql import QueryFilter, TimeRange

def date_to_ms(date):
  return int(datetime.strptime(date, '%Y-%m-%d').timestamp() * 1000)
      
filter = QueryFilter(
    query_start_time_range=TimeRange(
        start_time_ms=date_to_ms("2023-08-20"), end_time_ms=date_to_ms("2023-08-21")
    )
)
queries = w.query_history.list(filter_by=filter)
for q in queries:
    print(q)

I get the same error that was initially reported:

DatabricksError: Could not parse request object: Expected 'START_OBJECT' not 'VALUE_STRING'
 at [Source: (ByteArrayInputStream); line: 1, column: 15]
 at [Source: java.io.ByteArrayInputStream@267ff0fe; line: 1, column: 15]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api client issues related to API client codegen issues related to generated code OpenAPI issues related to metadata across all SDKs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants