Query History list() does not pass filter_by correctly #99

asikowitz · 2023-05-08T19:13:13Z

Although it's not documented, the query history list endpoint seems to handle filter_by properly only if it is passed in the request body rather than as a query param:

Ran using databricks-sdk v0.1.5, requests v2.28.2:

In [2]: from databricks.sdk import WorkspaceClient
In [3]: client = WorkspaceClient(host=..., token=...)
In [4]: from databricks.sdk.service.sql import QueryFilter
In [6]: filter_by = QueryFilter.from_dict(
   ...:             {
   ...:                 "query_start_time_range": {
   ...:                     "start_time_ms": 0,
   ...:                     "end_time_ms": int(time.time() * 1000),
   ...:                 }
   ...:             }
   ...:         )

In [7]: filter_by
Out[7]: QueryFilter(query_start_time_range=TimeRange(end_time_ms=1683569801679, start_time_ms=0), statuses=None, user_ids=None, warehouse_ids=None)

In [9]: next(client.query_history.list(filter_by=filter_by))
---------------------------------------------------------------------------
DatabricksError                           Traceback (most recent call last)
Cell In[9], line 1
----> 1 next(client.query_history.list(filter_by=filter_by))

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/databricks/sdk/service/sql.py:2824, in QueryHistoryAPI.list(self, filter_by, include_metrics, max_results, page_token, **kwargs)
   2821 if page_token: query['page_token'] = request.page_token
   2823 while True:
-> 2824     json = self._api.do('GET', '/api/2.0/sql/history/queries', query=query)
   2825     if 'res' not in json or not json['res']:
   2826         return

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/databricks/sdk/core.py:753, in ApiClient.do(self, method, path, query, body)
    749 if not response.ok:
    750     # TODO: experiment with traceback pruning for better readability
    751     # See https://stackoverflow.com/a/58821552/277035
    752     payload = response.json()
--> 753     raise self._make_nicer_error(status_code=response.status_code, **payload) from None
    754 if not len(response.content):
    755     return {}

DatabricksError: Could not parse request object: Expected 'START_OBJECT' not 'VALUE_STRING'
 at [Source: (ByteArrayInputStream); line: 1, column: 15]
 at [Source: java.io.ByteArrayInputStream@794544b2; line: 1, column: 15]

but if I call:

In [53]: "res" in client.query_history._api.do('GET', '/api/2.0/sql/history/queries', body={"filter_by": filter_by.as_dict()})
Out[53]: True

the API works as expected.

Second, when using pagination with the query history endpoint, it doesn't seem to allow specifying page_token and filter_by at the same time:

In [55]: client.query_history._api.do('GET', '/api/2.0/sql/history/queries', body={"filter_by": filter_by.as_dict(), "page_token": "abc"})
---------------------------------------------------------------------------
DatabricksError                           Traceback (most recent call last)
Cell In[55], line 1
----> 1 client.query_history._api.do('GET', '/api/2.0/sql/history/queries', body={"filter_by": filter_by.as_dict(), "page_token": "abc"})

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/databricks/sdk/core.py:753, in ApiClient.do(self, method, path, query, body)
    749 if not response.ok:
    750     # TODO: experiment with traceback pruning for better readability
    751     # See https://stackoverflow.com/a/58821552/277035
    752     payload = response.json()
--> 753     raise self._make_nicer_error(status_code=response.status_code, **payload) from None
    754 if not len(response.content):
    755     return {}

DatabricksError: You can provide only one of 'page_token' or 'filter_by'

The current implementation doesn't remove filter_by on subsequent calls and thus doesn't paginate correctly when filter_by is passed in. Here, I patch the do call to pass query params as body, but this time we fail when we attempt to get the second page.

In [56]: original_do = client.api_client.do

In [57]: from unittest.mock import patch

In [62]: def patch_do(method, path, query = None, body = None):
    ...:     print(method,path,query,body)
    ...:     res = original_do(method, path, query = None, body=query)
    ...:     print("RES", res.keys())
    ...:     return res
    ...: 

In [64]: with patch.object(ApiClient, "do") as mock_do:
    ...:     mock_do.side_effect = patch_do
    ...:     it = client.query_history.list(filter_by=filter_by, max_results=1)
    ...:     next(it)
    ...:     next(it)
    ...: 
GET /api/2.0/sql/history/queries {'filter_by': {'query_start_time_range': {'end_time_ms': 1683569801679}}, 'max_results': 1} None
RES dict_keys(['next_page_token', 'has_next_page', 'res'])
GET /api/2.0/sql/history/queries {'filter_by': {'query_start_time_range': {'end_time_ms': 1683569801679}}, 'max_results': 1, 'page_token': 'CkwKJDAxZWRlZGFjLWJjNDktMTgyOS1hM2UwLTYwNDYwZTVkNjU4MBCD2ZHe/zAY09WStKW5tAUiEDcyNTJjYmE1NTlmNDhkZjQo+JgQEgkSBxDP89Hk/zAYAQ=='} None
---------------------------------------------------------------------------
DatabricksError                           Traceback (most recent call last)
Cell In[64], line 5
      3 it = client.query_history.list(filter_by=filter_by, max_results=1)
      4 next(it)
----> 5 next(it)

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/databricks/sdk/service/sql.py:2824, in QueryHistoryAPI.list(self, filter_by, include_metrics, max_results, page_token, **kwargs)
   2821 if page_token: query['page_token'] = request.page_token
   2823 while True:
-> 2824     json = self._api.do('GET', '/api/2.0/sql/history/queries', query=query)
   2825     if 'res' not in json or not json['res']:
   2826         return

File ~/.pyenv/versions/3.10.9/lib/python3.10/unittest/mock.py:1114, in CallableMixin.__call__(self, *args, **kwargs)
   1112 self._mock_check_sig(*args, **kwargs)
   1113 self._increment_mock_call(*args, **kwargs)
-> 1114 return self._mock_call(*args, **kwargs)

File ~/.pyenv/versions/3.10.9/lib/python3.10/unittest/mock.py:1118, in CallableMixin._mock_call(self, *args, **kwargs)
   1117 def _mock_call(self, /, *args, **kwargs):
-> 1118     return self._execute_mock_call(*args, **kwargs)

File ~/.pyenv/versions/3.10.9/lib/python3.10/unittest/mock.py:1179, in CallableMixin._execute_mock_call(self, *args, **kwargs)
   1177         raise result
   1178 else:
-> 1179     result = effect(*args, **kwargs)
   1181 if result is not DEFAULT:
   1182     return result

Cell In[62], line 3, in patch_do(method, path, query, body)
      1 def patch_do(method, path, query = None, body = None):
      2     print(method,path,query,body)
----> 3     res = original_do(method, path, query = None, body=query)
      4     print("RES", res.keys())
      5     return res

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/databricks/sdk/core.py:753, in ApiClient.do(self, method, path, query, body)
    749 if not response.ok:
    750     # TODO: experiment with traceback pruning for better readability
    751     # See https://stackoverflow.com/a/58821552/277035
    752     payload = response.json()
--> 753     raise self._make_nicer_error(status_code=response.status_code, **payload) from None
    754 if not len(response.content):
    755     return {}

DatabricksError: You can provide only one of 'page_token' or 'filter_by'

To summarize (sorry for the long post!), if my assumptions about the query history API are correct, then:

All params should be passed in the body, even though it's a GET request
The filter_by param should be removed when querying with page_token

The text was updated successfully, but these errors were encountered:

nfx · 2023-05-08T19:43:46Z

That’s a good writeup. Will add an integration test for it.

nfx · 2023-05-10T17:59:45Z

looked into it in more detail today - the fix is not exactly straightforward. thank you for providing such detailed explanation of the problem.

lukehsiao · 2023-07-20T18:53:53Z

We are running into this issue as well.

If this is still not fixed, is the best workaround to implement this look ourselves passing it as the body?

databricks-sdk-py/databricks/sdk/service/sql.py

Lines 3110 to 3118 in 9aa7ee5

    
           while True: 
        
               json = self._api.do('GET', '/api/2.0/sql/history/queries', query=query) 
        
               if 'res' not in json or not json['res']: 
        
                   return 
        
               for v in json['res']: 
        
                   yield QueryInfo.from_dict(v) 
        
               if 'next_page_token' not in json or not json['next_page_token']: 
        
                   return 
        
               query['page_token'] = json['next_page_token']

looked into it in more detail today - the fix is not exactly straightforward. thank you for providing such detailed explanation of the problem.

Are there any concerns with this workaround?

lukehsiao · 2023-07-20T19:59:52Z

To summarize (sorry for the long post!), if my assumptions about the query history API are correct, then:

All params should be passed in the body, even though it's a GET request

The filter_by param should be removed when querying with page_token

In my testing, there is also odd interaction with max_results.

For example, if you provide a filter_by with max_results=1, then the first request (no page_token, yes filter_by) does indeed return 1 result.

But, it also includes a next_page_token, so the loop will want to continue, each one returning a single result.

@nfx, can we get clarification about what the REST API is actually doing? Is this odd interaction with filter_by and max_results the expected behavior?

If so, it seems likely we need to implement max_results within the client SDK, if the server is not going to respect the combination.

lukehsiao · 2023-07-20T20:18:31Z

Upon further testing, it seems the filter_by works, but then the max_results only applies for a specific request.

E.g., if I apply a filter_by clause which filters down to 10 requests, and set max_results=1, I end up making 10 requests, each of which has a single result.

It seems the current code does not compensate for the fact that the REST API may respect max_results for a particular request, but still give a next_page_token. While callers of the Python API almost certainly expect max_results to affect the total number of results returned.

I also notice that results are given in reverse chronological order. E.g., if we have the following queries

t=1, q1
t=2, q2
t=3, q3
t=4, q4
t=5, q5

And ask for a filter with start_time_ms=3 and end_time_ms=5, the results are given in the order [q5, q4, q3]. So, if we apply a max_results limit in Python here in the client, it may also be an unexpected order.

Continuing the example, if we say max_results=1, a user might expect q3 (the earliest query in the range), but if we limit here in Python the natural result would be q5 (the first result returned by the API). To match the intuitive order would require getting ALL result from the server, reversing them, and then limiting, which would seem to defeat the resource-saving purpose of the parameter.

lukehsiao · 2023-07-20T20:26:36Z

Per the API and docstring:

Limit the number of results returned in one page.

So I suppose max_results is actually behaving as intended here, and I had an incorrect understanding.

mgyucht · 2023-07-21T08:06:11Z

For future reference:

We will need to change the program used to generate these clients to handle this case. You can see which files are auto-generated in our SDK by looking at the first line of the file. If you want to address this change, the right place to do it is actually in the ApiClient implementation itself. You can inspect the request endpoint and modify it as needed there. Feel free to give that a shot!

FYI: internally, we're working on fixing the handling of parameters for GET and DELETE APIs across the entire API surface of Databricks. Most endpoints do handle parameters in a compliant manner, but this endpoint doesn't yet. I'll double check on the progress of this.

However, the issue with filter_by and next_page_token will still need separate handling. It may be possible to relax this constraint from the server side, as other APIs support filtering with tokens.

## Changes The Query History list API filter_by query parameter is modeled by a dictionary, rather than a primitive type, defining the allowed filters including `query_start_time_range`, `statuses`, `user_ids`, and `warehouse_ids`. To be compatible with gRPC transcoding, query parameters modeled by message types as opposed to primitives need to be separated into one query parameter per nested field, where the key is the path to that query parameter. For example: ``` {'filter_by': {'user_ids': [123, 456]}} ``` becomes ``` filter_by.user_ids=123&filter_by.user_ids=456 ``` For this to be compatible with the requests library we use today, we need to convert the first dictionary to ``` {'filter_by.user_ids': [123, 456]} ``` This resolves one of the problems from #99. The issue with the conflict between filter_by and next_page_token will be resolved by the backend service. ## Tests Added an integration test that covers query history listing with a filter_by parameter. - [x] `make test` run locally - [x] `make fmt` applied - [x] relevant integration tests applied

mgyucht · 2023-07-25T10:53:38Z

Actually, the core issue was that we need to preprocess the query parameters before handing them to the requests library. See #249 for how I handled this. For the other issue, we're still following up with the backend team.

## Changes The Query History list API filter_by query parameter is modeled by a dictionary, rather than a primitive type, defining the allowed filters including query_start_time_range, statuses, user_ids, and warehouse_ids. To be compatible with gRPC transcoding, query parameters modeled by message types as opposed to primitives need to be separated into one query parameter per nested field, where the key is the path to that query parameter. For example: ``` ListQueryHistoryRequest{ FilterBy: QueryFilter{ UserIds: []int{123, 456} } } ``` becomes ``` filter_by.user_ids=123&filter_by.user_ids=456 ``` go-querystring already flattens this but to a different form: ``` filter_by[user_ids]=123&filter_by[user_ids]=456 ``` To fix this, we replace `[` with `.` and `]` with empty string. This resolves one of the problems from databricks/databricks-sdk-py#99. The issue with the conflict between filter_by and next_page_token will be resolved by the backend service. Separately, there is an underlying issue affecting the Go SDK specifically, as we delegate query parameter serialization to the go-querystring library. This library looks at the `url` struct tag to name the query parameters appropriately. However, schemas defined in `Components` in OpenAPI are initially cached with IsQuery for all fields = false. As we parse operations, we incrementally discover which entities are present in query parameters, and we need to update the underlying types recursively. Lastly, this includes a bugfix to support the newest version of Delve. ## Tests Added an integration test for ListQueryHistory. - [x] `make test` passing - [x] `make fmt` applied - [x] relevant integration tests applied

## Changes The Query History list API filter_by query parameter is modeled by a dictionary, rather than a primitive type, defining the allowed filters including query_start_time_range, statuses, user_ids, and warehouse_ids. To be compatible with gRPC transcoding, query parameters modeled by message types as opposed to primitives need to be separated into one query parameter per nested field, where the key is the path to that query parameter. For example: ``` new ListQueryHistoryRequest() .setFilterBy(new QueryFilter().setUserIds(Arrays.asList(123L, 456L))) ``` becomes ``` filter_by.user_ids=123&filter_by.user_ids=456 ``` For this to be compatible with the requests library we use today, we need to recursively compute the path of each field in the request object that is annotated with the `QueryParam` annotation, then serialize the value according to Jackson. As part of this, I've also generalized the `Request` class to support repeated query parameter values for lists (see the added integration test). This resolves one of the problems from databricks/databricks-sdk-py#99. The issue with the conflict between filter_by and next_page_token will be resolved by the backend service. ## Tests Added an integration test for SQL Query History API.

mgyucht · 2023-08-14T15:56:47Z

The backend team mentioned that their fix was released.

As for max_results, we are planning to remove this field from the SDK entirely. Instead, you'll be able to iterate through lists just like with any generator, and you can stop iterating when you've consumed the resources you need. Under the hood, the SDK will handle pagination for you. See #279 for this change.

chris-koester-db · 2023-08-21T12:14:38Z

Is this resolved in version 0.6.0? When I run code similar to the integration test:

from databricks.sdk.service.sql import QueryFilter, TimeRange

def date_to_ms(date):
  return int(datetime.strptime(date, '%Y-%m-%d').timestamp() * 1000)
      
filter = QueryFilter(
    query_start_time_range=TimeRange(
        start_time_ms=date_to_ms("2023-08-20"), end_time_ms=date_to_ms("2023-08-21")
    )
)
queries = w.query_history.list(filter_by=filter)
for q in queries:
    print(q)

I get the same error that was initially reported:

DatabricksError: Could not parse request object: Expected 'START_OBJECT' not 'VALUE_STRING'
 at [Source: (ByteArrayInputStream); line: 1, column: 15]
 at [Source: java.io.ByteArrayInputStream@267ff0fe; line: 1, column: 15]

nfx added codegen issues related to generated code api client issues related to API client labels May 10, 2023

nfx added the OpenAPI issues related to metadata across all SDKs label May 22, 2023

nfx mentioned this issue Jun 3, 2023

Query History examples missing #133

Open

lukehsiao mentioned this issue Jul 20, 2023

fix(sql): workaround for filtering in query history api #244

Closed

3 tasks

This was referenced Jul 25, 2023

Handle nested query parameters in ApiClient.do() #249

Merged

Handle nested query parameters in Client.attempt databricks/databricks-sdk-go#559

Merged

Handle nested query parameters in ApiClient.withQuery databricks/databricks-sdk-java#125

Merged

mgyucht closed this as completed Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query History list() does not pass filter_by correctly #99

Query History list() does not pass filter_by correctly #99

asikowitz commented May 8, 2023

nfx commented May 8, 2023

nfx commented May 10, 2023

lukehsiao commented Jul 20, 2023

lukehsiao commented Jul 20, 2023

lukehsiao commented Jul 20, 2023

lukehsiao commented Jul 20, 2023

mgyucht commented Jul 21, 2023

mgyucht commented Jul 25, 2023

mgyucht commented Aug 14, 2023

chris-koester-db commented Aug 21, 2023

Query History list() does not pass filter_by correctly #99

Query History list() does not pass filter_by correctly #99

Comments

asikowitz commented May 8, 2023

nfx commented May 8, 2023

nfx commented May 10, 2023

lukehsiao commented Jul 20, 2023

lukehsiao commented Jul 20, 2023

lukehsiao commented Jul 20, 2023

lukehsiao commented Jul 20, 2023

mgyucht commented Jul 21, 2023

mgyucht commented Jul 25, 2023

mgyucht commented Aug 14, 2023

chris-koester-db commented Aug 21, 2023