Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weekly Emails - generating reports is sometimes broken #2472

Closed
fbarl opened this issue Jan 17, 2019 · 14 comments
Closed

Weekly Emails - generating reports is sometimes broken #2472

fbarl opened this issue Jan 17, 2019 · 14 comments
Assignees
Labels
bug broken end user functionality; not working as the developers intended it component/users stale Bulk closing old, stale issues

Comments

@fbarl
Copy link

fbarl commented Jan 17, 2019

I just tried to generate a report preview in https://frontend.dev.weave.works/admin/users/weeklyreports for our Weave Cloud (Dev) instance and got {"errors":[{"message":"An internal server error occurred"}]} in the browser.

A closer inspection into the users service shows:

2019-01-17T15:00:34.030603718Z time="2019-01-17T15:00:34Z" level=error msg="POST /admin/users/weeklyreports/preview: execution: multiple matches for labels: grouping labels must ensure unique matches"

The error seems to occur with Prometheus queries and points at this line of code: https://github.com/prometheus/prometheus/blob/a1f34bec2e6584a2fee9aec901f3157e3e12cbaa/promql/engine.go#L1498

It probably somehow links to:

func buildWorkloadsResourceConsumptionQuery(resourceQuery string) string {

The scope of the issue is unclear.

@fbarl fbarl added bug broken end user functionality; not working as the developers intended it component/users labels Jan 17, 2019
@ngehani
Copy link

ngehani commented Jan 17, 2019

oy! who is working on deciphering it? @guyfedwards @foot while @fbarl is on vacation next week

@fbarl
Copy link
Author

fbarl commented Jan 17, 2019

FYI running the same queries as weekly reporter on our Weave Cloud (Dev) instance notebooks does result in the same errors for the period of last week: https://frontend.dev.weave.works/proud-wind-05/monitor/notebook/931f18f1-5516-4f40-bdd9-a03aa3f24f60?timestamp=2019-01-14T00:00:00Z

image

The same query passes if we shift the window 3 days later (https://frontend.dev.weave.works/proud-wind-05/monitor/notebook/931f18f1-5516-4f40-bdd9-a03aa3f24f60?timestamp=2019-01-17T00:00:00Z), so I wonder if some sort of outage or corrupted data is to blame.

In any case, we should probably edit the queries to make them more robust (after we pin down the exact issue).

@foot
Copy link
Contributor

foot commented Jan 18, 2019

Yep, I can have a look on Monday!

@foot foot closed this as completed Jan 21, 2019
@foot
Copy link
Contributor

foot commented Jan 21, 2019

Didn't mean to close this..

@foot foot reopened this Jan 21, 2019
@foot
Copy link
Contributor

foot commented Jan 21, 2019

This seems to be the worst point, where you cannot get a table of first query: https://frontend.dev.weave.works/proud-wind-05/monitor/notebook/39882902-c2f5-4030-af6c-92aeda4f7e1d?timestamp=2019-01-07T18:00:00Z

@foot
Copy link
Contributor

foot commented Jan 21, 2019

@foot
Copy link
Contributor

foot commented Jan 22, 2019

I'm not getting very far w/ this. Comparing the sum by (namespace, pod_name) ... side of the join when its good and when its bad looks incredibly similar so I'm really not sure what the error message means.

screenshot 2019-01-22 at 12 44 28

@dlespiau any ideas about making this query more robust?

We could still roll this out. Some users might not get an error report one week...

@foot foot self-assigned this Jan 23, 2019
@foot
Copy link
Contributor

foot commented Jan 30, 2019

Some more poking around here: https://frontend.dev.weave.works/proud-wind-05/monitor/notebook/ddd09f7e-17e4-4ca2-8017-043d3f463353?range=15m&timestamp=2019-01-07T17:42:49Z

I can make it work by excluding a particular container (dbshell.*), but I haven't figured out what it is about that vector that clashes with the other one...

@foot
Copy link
Contributor

foot commented Jan 30, 2019

@bboreham any thoughts on where the Error: multiple matches for labels: grouping labels must ensure unique matches message might be coming from in the above notebook?

My next step would be to try and dump out that time block into a local prom instance that I could perhaps adds additional debugging code into. I will read about exporting in a bit..

@foot
Copy link
Contributor

foot commented Jan 30, 2019

Alrighty updated notebook again w/ another variation that works down the very bottom:

sum by (namespace, pod_name, job) (rate(container_cpu_usage_seconds_total{image!=''}[1m])) / ignoring(namespace, pod_name, job) group_left count(node_cpu{mode='idle'})

Will job always be cadvisor?

@foot
Copy link
Contributor

foot commented Feb 25, 2019

Opened an issue in cortexproject/cortex#1245

@ozamosi ozamosi added the stale Bulk closing old, stale issues label Nov 4, 2021
@ozamosi ozamosi closed this as completed Nov 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug broken end user functionality; not working as the developers intended it component/users stale Bulk closing old, stale issues
Projects
None yet
Development

No branches or pull requests

4 participants