Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: change query for uptime stat panel #840

Merged
merged 4 commits into from
Aug 23, 2024
Merged

fix: change query for uptime stat panel #840

merged 4 commits into from
Aug 23, 2024

Conversation

VikaCep
Copy link
Contributor

@VikaCep VikaCep commented Jun 27, 2024

What this PR does / why we need it:

This PR fixes the query that is used to calculate the uptime stat panel for each check. As described in this escalation, this panel was displaying a value of 100% even when the probes had 100% errors reaching the target.

Which issue(s) this PR fixes:

Part of https://github.com/grafana/support-escalations/issues/11197. This solution addresses the first problem described here https://github.com/grafana/support-escalations/issues/11197#issuecomment-2195657543

@VikaCep VikaCep self-assigned this Jun 27, 2024
@VikaCep
Copy link
Contributor Author

VikaCep commented Jun 27, 2024

@mem I changed the query according to your comment's solution.

@VikaCep VikaCep requested a review from mem June 27, 2024 21:23
@VikaCep VikaCep marked this pull request as ready for review June 27, 2024 21:24
@VikaCep VikaCep requested a review from a team as a code owner June 27, 2024 21:24
@VikaCep VikaCep requested a review from ckbedwell June 27, 2024 21:24
sum_over_time(
# the inner query is going to produce a non-zero value if there was at least one successful check during the 5 minute window
# so make it a 1 if there was at least one success and a 0 otherwise
ceil(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this side of the query is missing a sum_over_time.

The idea is to compute the number of times at least one probe reported success and add that over the entire range (sum_over_range), and that number should be divided by the number of observations made (count_over_time).

@mem
Copy link
Contributor

mem commented Jun 27, 2024

I'm sorry, I just realized this is the query we had to fix for a different reason.

I think the query itself is correct. We had to add a transformation that collects the data for the entire range because Mimir has a limit on the amount of data that it's willing to retrieve.

I was confused because the "explore" link in the panel drops that transformation.

Let me take another looks.

@ckbedwell
Copy link
Contributor

@VikaCep
Copy link
Contributor Author

VikaCep commented Aug 23, 2024

I added a new uptime query version from @mem's proposal:

floor(
      # Report a 1 if there's a location where most observations were successful and 0 if most observations failed for all probes.
      max by (instance, job) (
        round(
          # the number of successes for each probe
          (increase(probe_all_success_sum{instance="$instance", job="$job"}[$__rate_interval]))
          /
          # the total number of times we checked for each probe
          ((increase(probe_all_success_count{instance="$instance", job="$job"}[$__rate_interval])))
        )
      )
    )

It is under a feature flag named uptime-query-v2. By default, we're using the original query.

@VikaCep VikaCep requested review from mem and peterschretlen August 23, 2024 17:45
Copy link
Contributor

@peterschretlen peterschretlen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉 , thanks @VikaCep!

@VikaCep VikaCep merged commit 90b81dd into main Aug 23, 2024
5 checks passed
@VikaCep VikaCep deleted the fix-uptime-query branch August 23, 2024 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants