Mismatch between Promxy and Trickster times returning empty responses #677

joanmp-ndtx · 2024-09-09T21:41:02Z

Hi!

We are exploring a new setup for promxy and trickster, but we are facing time shifting issues and empty responses. Maybe with some similarity to #537

With our current setup (centralized cache):

trickster   ---> promxy ---> Prometheus A (ignore_error=false)
                        ---> Prometheus B (ignore_error=true)
                        ---> Prometheus C (ignore_error=true)

Assuming that we are requesting data ingested only in B, but B is returning a punctual error. A and C are returning a valid empty response.
Promxy, as configured, returns an empty response ignoring the B error.
But then trickster caches the empty response, returning it until the cache expires.

We can assume a punctual no data due to the punctual error, but not the long-term cached no data.

New explored setup (cache per servergroup):

promxy ---> trickster A ---> Prom A
       ---> trickster B ---> Prom B
       ---> trickster C ---> Prom C

With this new setup, errors on B are not cached by trickster but we've realised that some queries are returning unexpected empty responses:

{
    "status": "success",
    "data": {
        "resultType": "matrix",
        "result": []
    }
}

After some investigation, we are pointing to the time mismatch between Trickster and Promxy:

Trickster aligns (truncates) the timerange to the step in order to increase the cacheability. here
For example, a query range with the following params to trickster:

- step: 5m
- start: t1
- end: t21

Truncates the times and returns samples on the following timestamps:

t0,t5,t10,t15,t20

Given this trickster response, promxy "fixes" the samples to the requested times:

trickster responses: t0, t5, t10, t15, t20
promxy responses:    t1, t6, t11, t16, t21

So, in our responses:

when the time diff is <=5m: we are returning wrong values: value of t0 in t1, value of t5 in t6, ...
when the time diff is >5m: Promxy discards the samples from trickster and returns an empty response.

At this point, truncating the time range (as trickster does) before promxy, could fix our value shifting and empty responses.

But we've realized that promxy can adapts the start and end time in some cases, for example, when the query includes an offset.

promxy/pkg/proxystorage/proxy.go

Lines 388 to 395 in 7c5eda6

    
           // Function to recursivelt remove offset. This is needed as we're using 
        
           // the node API to String() the query to downstreams. Promql's iterators require 
        
           // that the time be the absolute time, whereas the API returns them based on the 
        
           // range you ask for (with the offset being implicit) 
        
           removeOffsetFn := func() error { 
        
           	_, err := parser.Walk(ctx, &OffsetRemover{}, s, node, nil, nil) 
        
           	return err 
        
           }

So we are exposed again to the shifted values and empty responses...

and here some questions...

Why is promxy substracting the offset to the timestamp?
Is it needed internally by prom or is a promxy optimization?
We are not able to found where this offset is "restored" (added again).
Apart of the offset, are there more cases where the time range is modified by promxy? "@" modifier?

The text was updated successfully, but these errors were encountered:

jacksontj · 2024-09-10T04:12:31Z

First off, thanks for the question! And a nicely presented one at that! I'll go ahead and jump into it:

But then trickster caches the empty response, returning it until the cache expires.

To me this seems like "expected behavior" since promxy is configured to ignore the error; so it is "as good" as if it returned nothing.

But we've realized that promxy can adapts the start and end time in some cases, for example, when the query includes an offset.

So AFAIK promxy doesn't mess with the "step" of a query. The example you posted there is for handling the case where someone does a query like foo offset 12h -- and in that case promxy needs to make the query downstream without the offset (so that the timestamps align for promql math). The only other ones that come to top-of-mind are the absolute/relative time_range -- but those just truncate the range (no re-alignment).

At this point, truncating the time range (as trickster does) before promxy, could fix our value shifting and empty responses.

That is my expected behavior as well. When I've run trickster + promxy I have always run it as trickster -> promxy and haven't had issues. TBF I have never used ignore_error in a production environment -- so I haven't had to deal with those issues.

Why is promxy substracting the offset to the timestamp?

I answered this above, so I'll consider this answered :)

Is it needed internally by prom or is a promxy optimization?

To clarify a bit; this is a promxy thing. Promql has some internal logic for altering the timestamp in the offset -- but when promxy does nodereplacer it doesn't have that same logic -- so we have to adjust the offset to get a timeseries back at the correct time.

We are not able to found where this offset is "restored" (added again).

The offset is restored actually a couple lines above. TLDR we effectively extract the offset (assuming it matches within the subtree), remove it from the subtree, then "add" it to the range start/end.

Apart of the offset, are there more cases where the time range is modified by promxy? "@" modifier?

I covered this a bit earlier; but I don't think there are times where promxy re-aligns the step like that. As a matter of fact there was a bug earlier where that was happening when it shouldn't (#663).

So overall for your use-case putting something in front of promxy to handle the "extent alignment" fixes the "shifting" and your approach of caching pre-prometheus API seems to address that side.

This also reminds me of an old issue I created for adding cache-control headers in responses (trickstercache/trickster#234) -- the idea here was that when we have a partial response we can return a value (since we merged what we have) and give context to the cache on how long to hold it. Upstream trickster did add the functionality but I never got around to implementing this entirely in promxy (some initial hacks of it from 5 years ago -- https://github.com/jacksontj/promxy/tree/etag). So if this seems like a better approach we could look into finally implementing the etag support (a bit more complicated with ignore_error nowadays).

joanmp-ndtx · 2024-09-11T09:00:19Z

@jacksontj thanks for your time and detailed answers!

As per your comments, our approach will be returning to the trickster -> promxy architecture, with all server groups as mandatory.

(Our "optionals" server groups had "non-critical" metrics (eg: non-productive env's), but now seems reasonable configuring them as mandatory)

jacksontj · 2024-09-14T05:05:28Z

I'm glad that we were able to sort that out! Metrics systems seem simple; but can easily get quite complex :)

I'll go ahead and close this out for now; but if you have any further questions feel free to re-open or create a new issue; happy promxy-ing :)

jacksontj added the question label Sep 10, 2024

jacksontj closed this as completed Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch between Promxy and Trickster times returning empty responses #677

Mismatch between Promxy and Trickster times returning empty responses #677

joanmp-ndtx commented Sep 9, 2024

jacksontj commented Sep 10, 2024

joanmp-ndtx commented Sep 11, 2024

jacksontj commented Sep 14, 2024

Mismatch between Promxy and Trickster times returning empty responses #677

Mismatch between Promxy and Trickster times returning empty responses #677

Comments

joanmp-ndtx commented Sep 9, 2024

jacksontj commented Sep 10, 2024

joanmp-ndtx commented Sep 11, 2024

jacksontj commented Sep 14, 2024