Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'sum by' inconsistent when grouped by multiple labels (Loki as Prometheus data source) #2334

Closed
Kayakflo opened this issue Jul 10, 2020 · 7 comments · Fixed by #2346
Closed
Assignees

Comments

@Kayakflo
Copy link

Kayakflo commented Jul 10, 2020

Environment:

OS: Ubuntu 18.04
Docker: 5:19.03.12
Loki: 1.4.1
Promtail: 1.4.1
Grafana: 6.7.3

Describe the bug

Disclaimer: Loki is the first time I get in touch with LogQL and Prometheus functions.
This might not be a bug after all, but I have not found any online resource that helped me understand the observed behavior.


I am using Loki as a Prometheus data source in Grafana to display the rate of certain messages and add an alert to it.
The following behavior was visible in multiple different queries, so here is just one example query in which I want to monitor the rate of messages that include the label rh_unknown="true":

sum by (rh_customer, rh_stage, severity) (rate({app="my-app",rh_unknown="true",rh_customer=~"customerA|customerB",severity=~"$Severity", rh_stage=~"$Stage"}[15s])) * 15

For this example you can assume, that all variables were set to "All".
The query above works fine, you can find its output in the attached file 'customer-stage-severity.json'.
The data displayed matches the actual log lines.

If I change the order inside sum by to the following, the displayed amount of messages per group change everytime I run the query:
sum by (rh_stage, severity, rh_customer) (rate({app="my-app",rh_unknown="true",rh_customer="customerA|customerB",severity=~"$Severity", rh_stage=~"$Stage"}[15s])) * 15

It does look as if Loki (or Grafana?) do no longer know for sure which type of severity each message has, which leads to changed distributions every time. The variation happens more frequently, the more severity types are present.
You can find the output in the attached files stage-severity-customer-1 and stage-severity-customer-2 which is the same query run twice in a row.

Things look even more concerning if I change the order to this:
sum by (severity, rh_stage, rh_customer) (rate({app="my-app",rh_unknown="true",rh_customer="customerA|customerB",severity=~"$Severity", rh_stage=~"$Stage"}[15s])) * 15

This time, data is not only switching beween severities among the same customer but also between customers.
In that sense it does look like totally randomized data on each query, which is supported by colors changes in Grafana.
The variation happens more frequently, the more customers are selected.
You can find the output in the attached files severity-stage-customer-1 and severity-stage-customer-2 which is the same query run twice in a row.

The total sum of all counters remains steady and is correct, so no data is added or removed between queries.
Also the distribution over time remains steady and is correct.

To Reproduce

  • Add Loki as a Prometheus datasource in Grafana
  • Add a query following the schema shown above ('best' results with at least 2 options per variable selected)
  • Change order of labels in sum by
  • Watch colors, numbers and distributions switch (more often with bigger dataset / more variables) when running the same query multiple times

Expected behavior

Unless I have just not found the right documentation, order or aggretation by sum by should have no influence on results and remain constant after changing.

Screenshots, Promtail config, or terminal output

I have attached named responses in a ZIP archive.
Can provide screenshots if needed.
20200709 - Loki Findings.zip

@cyriltovena
Copy link
Contributor

Hello @Kayakflo !

Let's start with updating your Loki instance to latest, we have fixed couple of those bugs recently.

If you can still the same, I'll dig into it more.

Thanks !

@Kayakflo
Copy link
Author

Hi @cyriltovena ,

Thank you very much for your fast response!
I have updated Loki to 1.5.0 and went throught the scenario again.
Unfortunately the issue still persists.

Please let me know if I can provide more information here!

@cyriltovena
Copy link
Contributor

Well it's on my list now. If you get a chance try latest too.

@Kayakflo
Copy link
Author

Sorry for misreading your first response.
Tried latest as well, issue was still present.

Thank you for noting it down!

@cyriltovena
Copy link
Contributor

cyriltovena commented Jul 13, 2020

Sorry about that, I guess I always ordered all my labels correctly and never stumble upon this one.

@slim-bean
Copy link
Collaborator

Thanks again @Kayakflo for the detailed issue! This was a great bug to find and fix!

@Kayakflo
Copy link
Author

Awesome, thanks a lot for your help and the quick responses!
I just redeployed the latest version and can confirm the fix on our side as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants