Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana agent 0.16.0 reporting incorrect metric values #675

Closed
charlie-pisuraj opened this issue Jun 21, 2021 · 4 comments · Fixed by #676
Closed

Grafana agent 0.16.0 reporting incorrect metric values #675

charlie-pisuraj opened this issue Jun 21, 2021 · 4 comments · Fixed by #676
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.

Comments

@charlie-pisuraj
Copy link

charlie-pisuraj commented Jun 21, 2021

Our setup:
We have a blue-green Grafana Agent cluster doing remote writes to Cortex.

We upgraded our Grafana Agent deployment to 0.16.0 from 0.13.1. We deployed 0.16.0 to our blue cluster which consisted of brand new machines; no WALs from the previous installation. During the blue cluster deployment, our cortex cluster were entirely accepting metrics from out green cluster; so the metrics are still correct at this point. Once all of our blue cluster machines were running we triggered the upgrade on our green cluster; at this point cortex now starts accepting writes from out 0.16.0 GAs which are reporting incorrect values (ex: we expect 64 for certain values but are getting 7 Billion). All metrics being reported were significantly incorrect. At this point we will immediately reverted to 0.13.1, which again reported the correct values.

Attached is one of our dashboard showing the number of agents we expect in one of our clusters. Please note I've converted the axis to Log10 as linear scale makes the original value looks like its 0

Screen Shot 2021-06-21 at 2 06 52 PM

@rfratto
Copy link
Member

rfratto commented Jun 21, 2021

Hey there, do you have any logs from when you upgraded to 0.16.0?

I've been able to find a bug in the WAL replay code that can cause this situation, but it does depend on the WAL being replayed with at least one checkpoint or one segment. This doesn't necessarily require a process restart, but may happen if the config is reloaded through /-/reload or if a scraper crashes.

If you have logs, look for WAL checkpoint loaded and WAL segment loaded. If there's both, or more than one, it's likely to be the trigger for the issue you saw.

For details on the bug: we recently replaced the series ID tracker to use an atomic variable instead of a mutex protected uint64 in #660. As part of this change, we had to change how the ID is initialized when replaying the WAL. The idea is to find the highest ID across the entire WAL and initialize the series ID to that, so the next new series is given that ID + 1. An overlap of IDs will cause incorrect metrics to be pushed, as the ID is used for looking up labels for a sample in remote_write.

However, #660 incorrectly assumed that the series ID initialization was happening for the entire WAL, where it's actually happening per segment (and checkpoint). This means if there is a segment with no new series records in it that is replayed, the ID will be initialized to 0 regardless of the previous non-zero value. We'll write a test to check for this scenario as part of our bug fix.

If your WAL didn't replay, more investigative work is needed to figure out what might've happened here :) (Though obviously what I found is definitely is a bug and will be fixed in #676)

@charlie-pisuraj
Copy link
Author

Hey @rfratto as mentioned above; the 0.16.0 installs were on fresh machines; so there are no WALs to replay.

@rfratto
Copy link
Member

rfratto commented Jun 21, 2021

Hey @charlie-pisuraj! Understood, but like I mentioned there are circumstances where WALs can replayed even on a fresh machine:

This doesn't necessarily require a process restart, but may happen if the config is reloaded through /-/reload or if a scraper crashes.

(This can also happen if the process itself crashes or if any metrics-related component fails at runtime)

@charlie-pisuraj
Copy link
Author

Looking through our setup code, we do call the /-/reload endpoint so I'll start looking for logs.

@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 23, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants