Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is the Query Frontend supposed to be configured? #3430

Closed
jakubgs opened this issue Oct 30, 2020 · 35 comments
Closed

How is the Query Frontend supposed to be configured? #3430

jakubgs opened this issue Oct 30, 2020 · 35 comments
Labels

Comments

@jakubgs
Copy link
Contributor

jakubgs commented Oct 30, 2020

Description

I'm running a 3 node Cortex 1.4.0 cluster with -target=all and I'm seeing pretty bad query performance in Grafana.
I figured my issue is not using the Query Frontend to parallelize the queries. But the documentation is quite confusing.

You can find a config of one of my nodes here.

Details

Based on the docs:

The query frontend is an optional service providing the querier’s API endpoints and can be used to accelerate the read path.

But if we check -modules we see that frontend is not optional, but rather included in the all target:

 > cortex -modules | grep frontend
query-frontend *

Which means I'm already running a query-frontend service on each node:

 > curl -s 'http://localhost:9092/services' | grep -A1 query-frontend
					<td>query-frontend</td>
					<td>Running</td>

But my query performance is very bad, so I thought that maybe I'm using the wrong endpoint.
But when I checked the codebase I could not identify any special path prefix for the query-frontend:

cortex/pkg/api/api.go

Lines 414 to 420 in 23554ce

// RegisterQueryFrontend registers the Prometheus routes supported by the
// Cortex querier service. Currently this can not be registered simultaneously
// with the Querier.
func (a *API) RegisterQueryFrontend(f *frontend.Frontend) {
frontend.RegisterFrontendServer(a.server.GRPC, f)
a.registerQueryAPI(f.Handler())
}

It seems to me like query-frontend is already available under the PrometheusHTTPPrefix path which is /prometheus.

I found this comment: #2921 (comment)

If you're running the query-frontend in front of a Cortex cluster, the suggested way is not using the downstream URL but configuring the querier worker to connect to the query-frontend (and here we do support SRV records).

Which suggests that my configuration should have the querier talk to the query-frontend. But how is that supposed to work if I have multiple query-frontends, one for each Cortex instance? Should each Cortex instance querier have it's own query-frontend configured as frontend_worker.frontend_address?

Another thing is, why is the flag called -querier.frontend-address but the config option is frontend_worker.frontend_address?

Or should I run a separate -target=query-frontend instance of Cortex on a separate host(probably same as my Grafana) and have the querier services connect to that single query-frontend?

@pracucci
Copy link
Contributor

Based on the docs:

The query frontend is an optional service providing the querier’s API endpoints and can be used to accelerate the read path.

But if we check -modules we see that frontend is not optional, but rather included in the all target:

Correct. It's optional (not strictly required) but suggested, so we've enabled it in the target all (which is the one used when running Cortex in single-binary mode.

It seems to me like query-frontend is already available under the PrometheusHTTPPrefix path which is /prometheus.

Correct. It's already enabled.

You configured the query-frontend address like this:

frontend_worker:
  frontend_address: '10.1.31.155:9095'

Keep in mind that if you're running multiple Cortex replicas, all queriers will connect to 1 single query-frontend (10.1.31.155:9095). Ideally, the frontend_address should be set to <name>:9095, where <name> is a DNS record resolving to all Cortex replica IPs (if running in single-binary mode) or all query-frontend IPs (if running in microservices mode).

Another thing is, why is the flag called -querier.frontend-address but the config option is frontend_worker.frontend_address?

Tech debt, but since Cortex 1.0 we can't introduce config breaking changes, so we can't fix it until Cortex 2.0.

Or should I run a separate -target=query-frontend instance of Cortex on a separate host(probably same as my Grafana) and have the querier services connect to that single query-frontend?

See above

But my query performance is very bad, so I thought that maybe I'm using the wrong endpoint.

Query-frontend is not black magic. It does 2 things:

  1. Results caching (make sure you enable it, see results_cache in https://cortexmetrics.io/docs/configuration/configuration-file/#query_range_config)
  2. Split a queries whose time range covers multiple days into N 1-day queries, but you need to enable split_queries_by_interval: 24h (see https://cortexmetrics.io/docs/configuration/configuration-file/#query_range_config)

@jakubgs
Copy link
Contributor Author

jakubgs commented Oct 30, 2020

Thanks for your explanation. I'll try your suggestion with the multiple frontend addresses and see what happens.

One note tho, I did find it weird how split_queries_by_interval accepts ?h but doesn't accept other formats like 1w.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 4, 2020

I have a process running with -target=query-frontend with the following config:

---
auth_enabled: false

# ---------------------- MemberList -----------------------
memberlist:
  node_name: 'cortex-query-frontend'

# ---------------------- Server ---------------------------
server:
  http_listen_address: '0.0.0.0'
  http_listen_port: 9092
  grpc_listen_address: '10.1.31.152'
  grpc_listen_port: 9095
  log_level: 'debug'

# ---------------------- Query Frontend -------------------
frontend:
  max_outstanding_per_tenant: 200

But when I try to access to /prometheus API it it times out:

 > curl -sv http://10.1.31.152:9092/prometheus/api/v1/labels --max-time 10
*   Trying 10.1.31.152:9092...
* TCP_NODELAY set
* Connected to 10.1.31.152 (10.1.31.152) port 9092 (#0)
> GET /prometheus/api/v1/labels HTTP/1.1
> Host: 10.1.31.152:9092
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Operation timed out after 10001 milliseconds with 0 bytes received
* Closing connection 0

And the logs show it fails with 499:

caller=logging.go:66 traceID=2790c310606cc5fc msg="GET /prometheus/api/v1/labels (499) 10.001303726s"

My 3 Cortex cluster nodes have this in their config:

frontend_worker:
  frontend_address: '10.1.31.152:9095'

Based on #2921 (comment) comment my understanding is that query_frontend config does not need to have downstream_url set. Correct?

If you're running the query-frontend in front of a Cortex cluster, the suggested way is not using the downstream URL but configuring the querier worker to connect to the query-frontend

But it doesn't seem to be connecting.

@pstibrany
Copy link
Contributor

Query frontend can be used with downstream_url, or with queriers connecting to it. In the second case, queriers need to be told where to find query frontend by using -querier.frontend-address option.

(Btw, query-frontend doesn't use memberlist for anything).

@pstibrany
Copy link
Contributor

Query frontend can be used with downstream_url, or with queriers connecting to it. In the second case, queriers need to be told where to find query frontend by using -querier.frontend-address option.

Which is exactly this:

frontend_worker:
  frontend_address: '10.1.31.152:9095'

What do queriers say? They should be connecting to this, although it is supposed to work with hostname.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 4, 2020

What do queriers say?

When I do the /prometheus/api/v1/labels to the query-frontend nothing shows up in the Cortex logs. They are on INFO leve.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 4, 2020

Btw, I see there's something in the config called query_scheduler. Do I need that?

@pstibrany
Copy link
Contributor

Btw, I see there's something in the config called query_scheduler. Do I need that?

No. Query scheduler is a new component, but you don't need to use it.

When I do the /prometheus/api/v1/labels to the query-frontend nothing shows up in the Cortex logs. They are on INFO leve.

Try using -log.level=debug and see if they connect to the query frontend.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 4, 2020

All I'm seeing is:

level=debug ts=2020-11-04T15:17:41.393096272Z caller=module_service.go:48 msg="module waiting for initialization" module=querier waiting_for=store
level=debug ts=2020-11-04T15:17:41.393238713Z caller=module_service.go:48 msg="module waiting for initialization" module=querier waiting_for=ring
level=debug ts=2020-11-04T15:17:41.393352796Z caller=module_service.go:48 msg="module waiting for initialization" module=querier waiting_for=memberlist-kv
level=info ts=2020-11-04T15:17:41.393458804Z caller=module_service.go:58 msg=initialising module=querier

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 4, 2020

It's hard to find anything in logs when every time I restart cortex I get tons of:

sample timestamp out of order; last timestamp: 1604503300, incoming timestamp: 1604503270

I reported this but no response so far: #3411

@pstibrany
Copy link
Contributor

Message you're looking for looks like this:

level=debug ts=2020-11-04T15:29:18.645382807Z caller=worker.go:132 msg="adding connection" addr=192.168.208.18:9007

(with your IP)

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 4, 2020

Ah, thanks, found one:

 > sudo journalctl -a -u cortex | grep 'adding connection'
level=debug ts=2020-11-04T15:17:41.396203168Z caller=worker.go:120 msg="adding connection" addr=10.1.33.0:9095

But that's from before I changed the config to connect to my separate query-frontend service. No entry for current one.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 5, 2020

How could I query the GRPC port of the query-frontend to check if it's properly listening?

Is there some kind of GRPC ping command I could send using curl or something to verify it's actually listening?

@pstibrany
Copy link
Contributor

pstibrany commented Nov 5, 2020

There is https://github.com/fullstorydev/grpcurl that could possibly be used. You need to give it proto files from Cortex (https://github.com/cortexproject/cortex/blob/587883140307c5f47455223e8dbcf4e265b7ba5e/pkg/frontend/v1/frontendv1pb/frontend.proto). Querier calls /frontend.Frontend/Process method on query-frontend to receive requests.

(First "request" that frontend sends is just "GET_ID" -- asking querier to identify itself. After each "request", querier is expected to send a single response).

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 5, 2020

Thanks, I'll try it out.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 5, 2020

It seems to get stuck:

 > ./grpcurl -v -plaintext -import-path ~/go/src -proto github.com/cortexproject/cortex/pkg/frontend/v1/frontendv1pb/frontend.proto 10.1.31.152:9095 frontend.Frontend/Process

Resolved method descriptor:
// After calling this method, client enters a loop, in which it waits for
// a "FrontendToClient" message and replies with single "ClientToFrontend" message.
rpc Process ( stream .frontend.ClientToFrontend ) returns ( stream .frontend.FrontendToClient );

Request metadata to send:
(empty)

Does this look correct?

@pstibrany
Copy link
Contributor

Don't you get more than that? When I try it, I see that frontend sends back this message:

{
  "httpRequest": {
    "method": "GET",
    "url": "/invalid_request_sent_by_frontend"
  },
  "type": "GET_ID"
}

Which is how it asks querier to identify itself.

Full interaction:

$ grpcurl -v -plaintext -import-path ./vendor/ -import-path . -proto pkg/frontend/v1/frontendv1pb/frontend.proto localhost:9007 frontend.Frontend/Process

Resolved method descriptor:
// After calling this method, client enters a loop, in which it waits for
// a "FrontendToClient" message and replies with single "ClientToFrontend" message.
rpc Process ( stream .frontend.ClientToFrontend ) returns ( stream .frontend.FrontendToClient );

Request metadata to send:
(empty)

Response headers received:
content-type: application/grpc

Response contents:
{
  "httpRequest": {
    "method": "GET",
    "url": "/invalid_request_sent_by_frontend"
  },
  "type": "GET_ID"
}

Response trailers received:
(empty)
Sent 0 requests and received 1 response
ERROR:
  Code: Unknown
  Message: EOF

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 6, 2020

No, it just gets stuck after the first (empty). So I guess the issue is with my query-frontend config and how it's not handling GRPC requests. When you look at my config in #3430 (comment) am I missing anything?

Should I for example have a storage config? Or maybe configs?

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 6, 2020

When I make the request the logs show:

caller=grpc_logging.go:53 method=/frontend.Frontend/Process duration=8m48.122423207s err="context canceled" msg="gRPC\n"

What does context cancelled mean?

@pstibrany
Copy link
Contributor

What does context cancelled mean?

That client (grpcurl, or querier) has disconnected.

@pstibrany
Copy link
Contributor

No, it just gets stuck after the first (empty). So I guess the issue is with my query-frontend config and how it's not handling GRPC requests. When you look at my config in #3430 (comment) am I missing anything?

Should I for example have a storage config? Or maybe configs?

Query-frontend doesn't need storage nor configs. It can be configured with caching or query-splitting (see query_range section of YAML), but that's it.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 6, 2020

Okay, then my minimal config should work. How can I diagnose why it's not accepting the GRPC commands?
I'm at a loss on how I can debug this.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 15, 2020

I upgraded to 1.5.0 binary release and now it works:

 > curl -si 10.1.31.152:9092/prometheus/api/v1/labels
HTTP/1.1 200 OK
Content-Type: application/json
Results-Cache-Gen-Number: 
Date: Sun, 15 Nov 2020 09:26:51 GMT
Content-Length: 965

{"status":"success","data":["__name__","alertmanager","alertname","alertstate","application","area","branch","breaker","build_date","build_hash","cache","call","chart","client","cluster","cluster_name","cluster_uuid","code","color","component","config","consumer","container","datacenter","db","device","dialer_name","dimension","dir","endpoint","engine","es_client_node","es_data_node","es_ingest_node","es_master_node","event","family","fleet","gc","gitCommit","goversion","group","handler","host","index","ingester","instance","interval","job","key","keyspace","kv_name","le","level","listener_name","lucene_version","member","method","mount","name","op","operation","outcome","path","platform","pool","quantile","reason","remote_name","remote_peer","result","revision","role","route","rule_group","sampled","scrape_job","server_go_version","server_version","sha256","slice","source","state","status_code","table","type","type_name","url","user","version","ws"]}

And I changed nothing except the version of the binary. So as far as I can tell query-frontend is broken in 1.4.0.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 15, 2020

Except it's much slower than just calling my old Prometheus instance and fails constantly with:

expanding series: gocql: no response received from cassandra within timeout period

I will fiddle with some query settings tomorrow. I assume it needs more configuration.

@pstibrany
Copy link
Contributor

So as far as I can tell query-frontend is broken in 1.4.0.

I can assure you that we have used 1.4.0 in our production without problem. (We also update production to latest master every week.) There may be a bug that you’re hitting, and it would be good to understand what went wrong to make sure that the bug is fixed.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 15, 2020

And I agree. So if you can give me some way to debug why it wasn't working before I'm happy to roll back to 1.4.0 for the query-frontend and try to find out what's happening.

@pstibrany
Copy link
Contributor

It should be possible to get more details about gRPC calls by using env. variables.

GRPC_GO_LOG_VERBOSITY_LEVEL=99
GRPC_GO_LOG_SEVERITY_LEVEL=info

That could reveal some more details. I would try to set this both on server side (query-frontend) and client side (querier).

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 15, 2020

Thx. I'll take a look tomorrow.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 16, 2020

Before I downgraded anything I verified that the GRPC port works:

 > ./grpcurl -v -plaintext -import-path ~/go/src -proto github.com/cortexproject/cortex/pkg/frontend/v1/frontendv1pb/frontend.proto 10.1.31.152:9095 frontend.Frontend/Process

Resolved method descriptor:
// After calling this method, client enters a loop, in which it waits for
// a "FrontendToClient" message and replies with single "ClientToFrontend" message.
rpc Process ( stream .frontend.ClientToFrontend ) returns ( stream .frontend.FrontendToClient );

Request metadata to send:
(empty)

Response headers received:
content-type: application/grpc

Response contents:
{
  "httpRequest": {
    "method": "GET",
    "url": "/invalid_request_sent_by_frontend"
  },
  "type": "GET_ID"
}

Response trailers received:
(empty)
Sent 0 requests and received 1 response
ERROR:
  Code: Unknown
  Message: EOF

Then I downgraded ONLY the query-frontend from 1.5.0 to 1.4.0 and it stopped working:

 > export GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info
 > ./grpcurl -v -max-time 5 -plaintext -import-path ~/go/src -proto github.com/cortexproject/cortex/pkg/frontend/v1/frontendv1pb/frontend.proto 10.1.31.152:9095 frontend.Frontend/Process

INFO: 2020/11/16 10:41:26 parsed scheme: ""
INFO: 2020/11/16 10:41:26 scheme "" not registered, fallback to default scheme
INFO: 2020/11/16 10:41:26 ccResolverWrapper: sending update to cc: {[{10.1.31.152:9095  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/11/16 10:41:26 ClientConn switching balancer to "pick_first"
INFO: 2020/11/16 10:41:26 Channel switches to new LB policy "pick_first"
INFO: 2020/11/16 10:41:26 Subchannel Connectivity change to CONNECTING
INFO: 2020/11/16 10:41:26 Subchannel picks a new address "10.1.31.152:9095" to connect
INFO: 2020/11/16 10:41:26 pickfirstBalancer: UpdateSubConnState: 0xc0004588e0, {CONNECTING <nil>}
INFO: 2020/11/16 10:41:26 Channel Connectivity change to CONNECTING
INFO: 2020/11/16 10:41:26 Subchannel Connectivity change to READY
INFO: 2020/11/16 10:41:26 pickfirstBalancer: UpdateSubConnState: 0xc0004588e0, {READY <nil>}
INFO: 2020/11/16 10:41:26 Channel Connectivity change to READY

Resolved method descriptor:
// After calling this method, client enters a loop, in which it waits for
// a "FrontendToClient" message and replies with single "ClientToFrontend" message.
rpc Process ( stream .frontend.ClientToFrontend ) returns ( stream .frontend.FrontendToClient );

Request metadata to send:
(empty)

Response trailers received:
(empty)
Sent 0 requests and received 0 responses
ERROR:
  Code: DeadlineExceeded
  Message: context deadline exceeded
INFO: 2020/11/16 10:41:31 Channel Connectivity change to SHUTDOWN
INFO: 2020/11/16 10:41:31 Subchannel Connectivity change to SHUTDOWN

And it started failing.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 16, 2020

On debug level logs the query-frontend shows nothing except for:

caller=logging.go:66 traceID=33c759cf7030775a msg="GET / (200) 369.02µs"

When the connection times out.

@jakubgs
Copy link
Contributor Author

jakubgs commented Jan 4, 2021

I've upgraded to 1.6.0 and now the Query Frontend is not working again. But the GRPC test with grpcurl works fine:

 > ./grpcurl -v -max-time 5 -plaintext -import-path ~/go/src -proto github.com/cortexproject/cortex/pkg/frontend/v1/frontendv1pb/frontend.proto 10.1.31.152:9095 frontend.Frontend/Process

Resolved method descriptor:
// After calling this method, client enters a loop, in which it waits for
// a "FrontendToClient" message and replies with single "ClientToFrontend" message.
rpc Process ( stream .frontend.ClientToFrontend ) returns ( stream .frontend.FrontendToClient );

Request metadata to send:
(empty)

Response headers received:
content-type: application/grpc

Response contents:
{
  "httpRequest": {
    "method": "GET",
    "url": "/invalid_request_sent_by_frontend"
  },
  "type": "GET_ID"
}

Response trailers received:
(empty)
Sent 0 requests and received 1 response
ERROR:
  Code: Unknown
  Message: EOF

But when I query for anything it just gets stuck and times out:

 > curl -sv --max-time 60 'http://localhost:9092/prometheus/api/v1/labels'
*   Trying 127.0.0.1:9092...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 9092 (#0)
> GET /prometheus/api/v1/labels HTTP/1.1
> Host: localhost:9092
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Operation timed out after 60000 milliseconds with 0 bytes received
* Closing connection 0

What's interesting is that when I upgraded my Cortex cluster I saw it trying to connect to query frontend at 127.0.0.1:9095 even though I clearly specified a different address with frontend_worker.frontend_address.

level=error ts=2021-01-04T12:41:24.988285233Z caller=frontend_processor.go:55
msg="error contacting frontend" address=127.0.0.1:9095
err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:9095: connect: connection refused\"

Maybe it's ignoring the config?

@jakubgs
Copy link
Contributor Author

jakubgs commented Jan 4, 2021

As far as I can tell the value of frontend_worker.frontend_address is in effect:

 > curl -s http://10.1.33.0:9092/config | grep -A5 frontend_worker
frontend_worker:
  frontend_address: 10.1.31.152:9095
  scheduler_address: ""
  dns_lookup_duration: 10s
  parallelism: 10
  match_max_concurrent: true

But when I check existing connections I see:

 > sudo netstat -pnt | grep 10.1.31.152:9095

 > sudo netstat -pnt | grep 127.0.0.1:9095  
tcp        0      0 127.0.0.1:48792         127.0.0.1:9095          ESTABLISHED 1510541/cortex      
tcp6       0      0 127.0.0.1:9095          127.0.0.1:48792         ESTABLISHED 1510541/cortex

So as far as I can tell Cortex is ignoring my config and connecting to 127.0.0.1:9095. Why?

@jakubgs
Copy link
Contributor Author

jakubgs commented Jan 5, 2021

I have downgraded only the cluster to 1.5.0 and querying works now:

 > curl -s --max-time 60 'http://localhost:9092/prometheus/api/v1/labels' 
{"status":"success","data":["__name__","alertmanager","alertname","alertstate","application","area","branch","breaker","build_date","build_hash","cache","call","chart","client","cluster","cluster_name","cluster_uuid","code","color","component","config","consumer","container","datacenter","db","device","dialer_name","dimension","dir","endpoint","engine","es_client_node","es_data_node","es_ingest_node","es_master_node","event","family","fleet","gc","gitCommit","goversion","group","handler","host","index","ingester","instance","interval","job","key","keyspace","kind","kv_name","le","level","listener_name","lucene_version","member","method","mount","name","op","operation","outcome","path","platform","pool","pubkey","quantile","reason","remote_name","remote_peer","response","result","revision","role","route","rule_group","sampled","scrape_job","server_go_version","server_version","sha256","slice","source","state","status_code","table","type","type_name","url","user","version","ws"]}

And I can see the connections to my query frontend:

 > sudo netstat -pnt | grep 127.0.0.1:9095

 > sudo netstat -pnt | grep 10.1.31.152:9095
tcp        0      0 10.1.31.155:44850       10.1.31.152:9095        ESTABLISHED 519751/cortex       

So as far as I can tell frontend_worker.frontend_address is being ignored in 1.6.0.

@jakubgs
Copy link
Contributor Author

jakubgs commented Jan 5, 2021

I opened a separate issue: #3644

@stale
Copy link

stale bot commented Apr 5, 2021

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 5, 2021
@stale stale bot closed this as completed Apr 22, 2021
wilfriedroset added a commit to wilfriedroset/remote-storage-wars that referenced this issue Dec 19, 2021
As explain in the issue on github, when using a full cortex infra
querier should register themselves to the query-frontend.

See: cortexproject/cortex#3430

Signed-off-by: Wilfried Roset <wilfriedroset@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants