Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query/querier: fix sum() inflated values problem #1278

Merged
merged 7 commits into from
Jun 27, 2019

Conversation

GiedriusS
Copy link
Member

@GiedriusS GiedriusS commented Jun 25, 2019

Fixes the problem with sum and inflated values as outlined here: #922.

The problem was the following: sum selects the last value of each time series in each window and adds the different dimensions up however our code asks for the downsampled sum value which is equal to all samples added up in either a 5m or 1h window, and adding those up, obviously, results in an inflated value. The practical result was that as if we applied sum_over_time(...[5m]) on top. So the fix is to ask for the last sample in each window in the case of sum instead of the aggregated value.

Also adds an E2E test for a typical setup of one Sidecar connected + one or more
Thanos Store nodes. Tests and shows how the whole thing with sum really works.

Testing: wrote a query like sum(kafka_log_log_value{topic="iam"}) with identical Sidecar/Store nodes and selected Max 5m/1h downsampling. With this fix the values look sane, without - not (it jumps up very high after the downsampled data comes into play).

Screenshot from 2019-06-26 10-43-51
Screenshot from 2019-06-26 10-44-01

Add a test for a typical setup of one Sidecar connected + one or more
Thanos Store nodes. Testing how the whole thing really works.
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, looking good so far.

Giedrius Statkevičius added 3 commits June 25, 2019 16:15
It is to be expected that Prometheus code will select the latest value
in any time window because otherwise the implicit conversion between raw
and pre-aggregated would not work.
@GiedriusS GiedriusS changed the title query/querier_test: add minimal test for typical setup query/querier: fix sum() inflated values problem Jun 26, 2019
This is not needed.
@GiedriusS GiedriusS marked this pull request as ready for review June 26, 2019 07:40
@GiedriusS
Copy link
Member Author

GiedriusS commented Jun 26, 2019

cc @jjneely. It's funny that we've arrived that the same conclusion even though I haven't looked at the comment that you've made in #922.

@brancz
Copy link
Member

brancz commented Jun 26, 2019

Nice find. lgtm 👍

Copy link
Member

@povilasv povilasv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jjneely
Copy link
Contributor

jjneely commented Jun 27, 2019

Where is my approve button? I want to hit it. :-) Thanks!

@brancz
Copy link
Member

brancz commented Jun 27, 2019

I think we've reached sufficient consensus that this is correct. @bwplotka feel free to still review, but I'll go ahead and merge :)

@brancz brancz merged commit 19a51b1 into thanos-io:master Jun 27, 2019
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think this fix makes sense but I am worried there is more to it. E.g I am not sure if sum_ shouldn't be the same here. I need to dive into overall caller logic as well to tell.

@@ -157,7 +157,8 @@ func aggrsFromFunc(f string) ([]storepb.Aggr, resAggr) {
if f == "count" || strings.HasPrefix(f, "count_") {
return []storepb.Aggr{storepb.Aggr_COUNT}, resAggrCount
}
if f == "sum" || strings.HasPrefix(f, "sum_") {
// f == "sum" falls through here since we want the actual samples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing trailing period.

Also I don't understand this comment itself - it makes sense after reading this PR, but it's otherwise not clear. I would add more explanation here. (:

return time.Unix(int64(s), int64(ns*float64(time.Second)))
}

st := ptm("0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be clear in variable names here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants