Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP API for query #2241

Closed
Tracked by #1359
youngsofun opened this issue Oct 14, 2021 · 5 comments · Fixed by #2379
Closed
Tracked by #1359

HTTP API for query #2241

youngsofun opened this issue Oct 14, 2021 · 5 comments · Fixed by #2379
Labels
C-feature Category: feature

Comments

@youngsofun
Copy link
Member

youngsofun commented Oct 14, 2021

Summary

Support query with sql over http API like clickhouse HTTP Interface.

curl 'http://localhost:8123/?query=SELECT%201'

echo 'SELECT 1' | curl 'http://localhost:8123/?query=' --data-binary @-

echo -ne '10\n11\n12\n' | curl 'http://localhost:8123/?query=INSERT%20INTO%20t%20FORMAT%20TabSeparated' --data-binary @

why we need it

more deal coming soon,after discussing face to face

what is NOT discussed here

  • Api for management: get sys info and change settings.
  • rest query api without sql:
    • /warehouse/database/table/row can map to Resource hierarchy, especially for simple OLTP queries.
    • some standards, e.g. Odata, postRest
    • Why not OData?
  • data import/load api (except for "insert into")

the first 2 cases need many rest endpoints, it can take full advantage of rest api. we can discuss it in another issue if needed.

query api need a few endspoints and is more complicated.

a brief summary of clickhouse http API

clickhouse HTTP Interface.

  • 2 kind of endpoints

    • sql endpoint : send sql string in query parameters and body
    • Predefined endpoints : define in config file, including endpoint and corresponding handler.
  • http body is raw data (or sql with data in request), with specified format

    • req
      • behavior like clickhouse-client
        • $ cat xxx.csv | clickhouse-client --query="INSERT INTO stock FORMAT CSVWithNames";
    • resp
      • meta/summary/Progress in header
  • sync response

  • begin receiving data when buffer_size is reached

    • unless wait_end_of_query is set
  • session

    • send the same session id in each request
    • End session after idle for some time.
  • other features

    • auth
    • timeout
    • compression

The problem with clickhouse http API

clickhouse HTTP apiis special since it sends raw data in the body instead of embeding in json. This is efficient, straight forward, and works well in most queries, but leads to some problems.

https://github.com/ClickHouse/ClickHouse/labels/comp-http

if we want to get result early(!wait_end_of_query), error may occur after the response code 200 is sent. currently err message is appended to data. solutions in issues:

  1. http trailer a standalone trailer part after
    1. most of http clients don't have a good support of trailers
  2. SSE (server-sent events) seems most promising, limitions:
    1. not support POST
    2. Text only
  3. multipart/form-data
  4. web socket

from the discussions in issues they are swing to grpc instead of solving these problems.

possible metadatas

  • req
    • session parameters
    • warehouse/database/schema/role
    • timeout
    • required resource
  • resp
    • Error message
    • progress, summary,estimate time
    • ”1 rows affect“
    • Warning
    • resource usage

http api of other databases

Most db only use http api for management, only a few support query with sql via http.

very different features from clickhouse:

  • body is json and data as its field, both in req and resp

    • pros:
      • more restful, flexible and expressiveness
      • seperate error/stats/result, right response code with
    • cons:
      • overhead to embed/extract data
      • client code need to know to interpret it.
        • most apis response with the schema
  • async and pagination (both return a handler for further fetch)

    • Async if long time no result

      • Big shuffle
      • insert into ...select ..
    • pros

      • async enable 1. poll for statuse/progress and 2. cancel
    • cons:

      • more round trip for large result.
      • result set may be buffered in initial server? need keep-alive
  • multi statements in a request (snowflake only) i.e. batching

  • example scenarios

  • pros

    • Play a row of session in some scenario?
      • More clear when used to set up context
        • session variables
        • many session setting/parameters
        • Change the database, schema, warehouse, or role for statements in a request
    • save round trip overhead
  • cons:

    • Need to fetch result of each statement async with corresponding handler
      • or use sth like SSE
      • clickhouse support it in Predefined endpoints, which just concatenate the results
  • unique requestID to ensure idempotent behavior for queries

features for discussion

we can start with a MVP, but need to make these decisions soon or later.

Some already discussed above.

  1. choose how to report error after response header is sent.
    1. Except for solutions in clickhouse issues, is pagination worth considering?
      1. It can be understood as using batching result instead of streaming chucks, each batch is either a good data or error.
  2. do we need Predefined http interface?
  3. Async for Long running query for ELT
  4. batching/multi-query
  5. Support api for delete session to release resources quickly.
  6. unique requestID
  7. Put sql in a json?
    1. except for "insert into ", we don`t need to send data.
      1. Handle large inserts with dedicated endpoint
    2. seems better place then http query params and headers
    3. but need POST, which SSE does not support, if we choose it.
@youngsofun youngsofun added the C-feature Category: feature label Oct 14, 2021
@flaneur2020
Copy link
Member

nice RFC!

a few small things:

  1. could we cancel a running query?
  • on production systems we sometimes runs into a situation the system is heavy overloaded, we can choose kill the heavy queries to save our cluster on emergency.
  • on adhoc querying, there might some queries in trial & error, the data scientist might want choose quiting the running query & start a new one repeatly. we could auto cancel the existed running query on starting the new query.
  1. could we streaming the results?
  • presto has a selling point about this: it can returns the query result as soon as possible, sometimes the data scientist could already get what he want even on the queries are not completed, maybe some guy writed a query to get all the uv every day in this month to explore some outliers, luckly she discovered the outliers in yesterday, so she could stop the query to calculate the eariler days as she already got what she wants.

Regards!

@sundy-li
Copy link
Member

sundy-li commented Oct 14, 2021

could we cancel a running query

One common way to cancel a http query is to send another kill query command with the previous query id.

@flaneur2020
Copy link
Member

flaneur2020 commented Oct 14, 2021

  1. Put sql in a json?
    except for "insert into ", we don`t need to send data.
    Handle large inserts with dedicated endpoint
    seems better place then http query params and headers
    but need POST, which SSE does not support, if we choose it.

+1, it would be more restful: creating a query by POST verb created a Query resource like POST /queries or PUT /queries/:UUID, then we can list all the running queries by GET /queries, and cancel this query by running DELETE /queries/23.

@youngsofun
Copy link
Member Author

youngsofun commented Oct 14, 2021

could we streaming the results?

clickhouse support it by default (if wait_end_of_query is not set)by return chunked body,for user it is like reading a socket file.
but has errorhanding problem.

pagination is another way

@youngsofun
Copy link
Member Author

return raw data as a whole is convenient when use the data to draw graph, no/less additional processing in middle.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature Category: feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants