Promxy returns no rows when using long range #537

sc0rp10 · 2023-02-18T22:23:58Z

Hi, trying to use promxy as a frontend in front of two VictoriaMetrics using a very simple config:

promxy:
    server_groups:
        -
            consul_sd_configs:
                -
                    services:
                        - victoriametrics

CLI arguments are

--query.max-samples=150000000
--access-log-destination=none
--log-level=debug

Everything works like a charm except strange behavior when I use some long timerange. For example, I have a query up == 1 - it works fine with any time ranges within my date retention. But when I querying node_filesystem_size_bytes{hostname="foobar"} I had no results for timeranges >= 43 days.
Response is 200 OK {"status":"success","data":{"resultType":"matrix","result":[]}} debug logs said nothing special.

Could anybody point me what's wrong?

Thanks!

The text was updated successfully, but these errors were encountered:

jacksontj · 2023-03-17T20:15:44Z

I did some poking around on this and was unable to reproduce this issue. If you could provide some more information to reproduce the issue (ideally some example pointing at http://demo.robustperception.io:9090 (or some other public prometheus API) -- alternatively a tcpdump (or tracelog) of the issue occuring.

D13410N3 · 2023-04-02T14:41:03Z

@jacksontj I've collected tcpdump for you
http://ux.ci/promxy-2.pcap

In this case promxy is trying to collect data with query up{job="victoriametrics"} just for "now - 7 days" from two victoriametrics instances.

No issues exists when request is done directly from one of instances

jacksontj · 2023-04-03T04:01:16Z

Thanks for the pcap, that definitely clears things up quite a bit!

Detailed Explanation

In this pcap there are 4 entities:

requestor (10.7.168.245)
Promxy (10.192.4.164)
VM A (10.192.4.92)
VM B (10.192.4.60)

In the pcap we can see the client send the following query to promxy:

GET /api/v1/query_range?query=up%7Bjob%3D%22victoriametrics%22%7D&start=1679827467.432&end=1680432267.432&step=2419 HTTP/1.1\r\n

All good so far. When we look at the queries that promxy sends to the VM downstreams we see similar data:

# downstream A
	HTML Form URL Encoded: application/x-www-form-urlencoded
		Form item: "end" = "1680432267.432"
		Form item: "query" = "up{job="victoriametrics"}"
		Form item: "start" = "1679827467.432"
		Form item: "step" = "2419"

# downstream B
Query from Promxy to B (10.192.4.60)
	HTML Form URL Encoded: application/x-www-form-urlencoded
		Form item: "end" = "1680432267.432"
		Form item: "query" = "up{job="victoriametrics"}"
		Form item: "start" = "1679827467.432"
		Form item: "step" = "2419"

At this point things still look good, the query was effectively just passed down to the downstream VM boxes -- which is what we expect. Next if we check the response from either (looking at "A" here):

{
	"status": "success",
	"data": {
		"resultType": "matrix",
		"result": [{
			"metric": {
				"__name__": "up",
				"env": "prod",
				"hostname": "vmetrics-2.node.eu.consul",
				"instance": "10.192.4.60:8428",
				"ip": "10.192.4.60",
				"job": "victoriametrics"
			},
			"values": [
				[1679826170, "1"],
				[1679828589, "1"],
				[1679831008, "1"],
				[1679833427, "1"],
				[1679835846, "1"],
				[1679838265, "1"],
				[1679840684, "1"],
...

Which at first glance seems reasonable, but upon further inspection we note that something is off with the times:

Who	Start	End	Duration
Downstream Request	1679827467.432	1680432267.432	604800
A Response	1679826170	1680430920	604750
B Response	1679826170	1680430920	604750

So to put that in more concrete language, this shows that the downstream VictoriaMetrics boxes are returning data for a time close to what was requested, but not what was actually requested. This is actually an (unfortunately) already known behavior with VictoriaMetrics that was captured in #202. The short version there is that VM has some internal caching -- but the way the prometheus API contract/iterators/promql-engine/etc. work -- the times aren't "close enough" (in this case "A"'s response was 1297.432s (~22 minutes) off of the requested time. This ends up working for shorter time ranges as the "incorrect" times are "close enough". Thankfully since this is a known issue, this can be configured around by adding nocache (https://github.com/jacksontj/promxy/blob/master/cmd/promxy/config.yaml#L54-L58).

TLDR

This appears to be some VictoriaMetrics caching behavior causing issues -- which can be configured around by adding nocache (https://github.com/jacksontj/promxy/blob/master/cmd/promxy/config.yaml#L54-L58)

D13410N3 · 2023-04-03T14:00:58Z

Thanks for your detailed response!
I'll try this solution

D13410N3 · 2023-04-04T11:14:09Z

@jacksontj thanks, your solution is working.

But, unfortunately, it causes high CPU utilization on VM instances, even though it's expected behaviour

Is there any way to configure Promxy to make parallel requests to each instance?
In my case there's two identical VM instances (they store the same metrics). I.e. I'm requesting up{foo="bar"} with range now - 100 days - can Promxy split it to two different requests? In this case the first VM instance will receive request for range [now-100 days : now-50 days], the second one [now-50 days : now] and then Promxy will join it as one result.

Can it be realised at this moment?

jacksontj · 2023-04-04T21:52:59Z

can Promxy split it to two different requests?

Sorta, this can be achieved with relative or absolute time filters -- basically creating 2 server groups one for "recent' and one for "old". But this wouldn't be HA -- it would just "shard" the query across those 2 nodes for performance improvement at the expense of redundancy (think RAID 0 vs RAID1 -- if thats a helpful analogy).

Can it be realised at this moment?

The only other hacky solution I can think of is to mess with the LookbackDelta (I think they changed the name, but that option) to have promql honor those incorrect timestamps from VM. The only other idea I have is we could hack in the same time adjustment into promxy (basically implement the logic here) but I'm not so sure about that as we'd start breaking the API contract ourselves... have to think about that some.

it causes high CPU utilization on VM instances

Out of curiosity, do you have some data on how much increase (how much QPS and what CPU util before and after was). In general this is unfortunate as there seemingly no way to enable caching but still honor the API contract. (Remember the issue here was that the VM response didn't adhere to the start/end defined in the API call). If the caller is something like grafana you could consider using trickster to cache at the API layer -- that does add complexity but may be a reasonable approach? I do have an issue to add caching to promxy but that is a relatively large lift and hasn't been a major priority for most.

jacksontj · 2023-04-16T05:18:32Z

After some consideration I don't think its a good plan to implement the same adjust logic within promxy (as it does break the prometheus API contract). That being said, making a middleware/proxy to do the VMAdjust for the query_range method should be pretty trivial to do (I hacked it up and it adds ~200k lines of dependencies, so just a bit much to add to this project -- which already has so many dependencies). IMO this sort of Timestamp adjusting would make sense as a VM middleware proxy -- since its their custom logic (which is non-standard and technically a violation of the API contract).

sc0rp10 changed the title ~~Promxy returns to rows when using long range~~ Promxy returns no rows when using long range Apr 2, 2023

jacksontj closed this as not planned Won't fix, can't repro, duplicate, stale Apr 16, 2023

joanmp-ndtx mentioned this issue Sep 9, 2024

Mismatch between Promxy and Trickster times returning empty responses #677

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Promxy returns no rows when using long range #537

Promxy returns no rows when using long range #537

sc0rp10 commented Feb 18, 2023

jacksontj commented Mar 17, 2023

D13410N3 commented Apr 2, 2023 •

edited

Loading

jacksontj commented Apr 3, 2023

D13410N3 commented Apr 3, 2023

D13410N3 commented Apr 4, 2023

jacksontj commented Apr 4, 2023

jacksontj commented Apr 16, 2023

Promxy returns no rows when using long range #537

Promxy returns no rows when using long range #537

Comments

sc0rp10 commented Feb 18, 2023

jacksontj commented Mar 17, 2023

D13410N3 commented Apr 2, 2023 • edited Loading

jacksontj commented Apr 3, 2023

Detailed Explanation

TLDR

D13410N3 commented Apr 3, 2023

D13410N3 commented Apr 4, 2023

jacksontj commented Apr 4, 2023

jacksontj commented Apr 16, 2023

D13410N3 commented Apr 2, 2023 •

edited

Loading