-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query/querier: fix sum() inflated values problem #1278
Conversation
Add a test for a typical setup of one Sidecar connected + one or more Thanos Store nodes. Testing how the whole thing really works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, looking good so far.
It is to be expected that Prometheus code will select the latest value in any time window because otherwise the implicit conversion between raw and pre-aggregated would not work.
This is not needed.
Nice find. lgtm 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Where is my approve button? I want to hit it. :-) Thanks! |
I think we've reached sufficient consensus that this is correct. @bwplotka feel free to still review, but I'll go ahead and merge :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think this fix makes sense but I am worried there is more to it. E.g I am not sure if sum_
shouldn't be the same here. I need to dive into overall caller logic as well to tell.
@@ -157,7 +157,8 @@ func aggrsFromFunc(f string) ([]storepb.Aggr, resAggr) { | |||
if f == "count" || strings.HasPrefix(f, "count_") { | |||
return []storepb.Aggr{storepb.Aggr_COUNT}, resAggrCount | |||
} | |||
if f == "sum" || strings.HasPrefix(f, "sum_") { | |||
// f == "sum" falls through here since we want the actual samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing trailing period.
Also I don't understand this comment itself - it makes sense after reading this PR, but it's otherwise not clear. I would add more explanation here. (:
return time.Unix(int64(s), int64(ns*float64(time.Second))) | ||
} | ||
|
||
st := ptm("0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be clear in variable names here.
Fixes the problem with
sum
and inflated values as outlined here: #922.The problem was the following:
sum
selects the last value of each time series in each window and adds the different dimensions up however our code asks for the downsampled sum value which is equal to all samples added up in either a 5m or 1h window, and adding those up, obviously, results in an inflated value. The practical result was that as if we appliedsum_over_time(...[5m])
on top. So the fix is to ask for the last sample in each window in the case ofsum
instead of the aggregated value.Also adds an E2E test for a typical setup of one Sidecar connected + one or more
Thanos Store nodes. Tests and shows how the whole thing with
sum
really works.Testing: wrote a query like
sum(kafka_log_log_value{topic="iam"})
with identical Sidecar/Store nodes and selectedMax 5m/1h downsampling
. With this fix the values look sane, without - not (it jumps up very high after the downsampled data comes into play).