-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grafana agent 0.16.0 reporting incorrect metric values #675
Comments
Hey there, do you have any logs from when you upgraded to 0.16.0? I've been able to find a bug in the WAL replay code that can cause this situation, but it does depend on the WAL being replayed with at least one checkpoint or one segment. This doesn't necessarily require a process restart, but may happen if the config is reloaded through /-/reload or if a scraper crashes. If you have logs, look for For details on the bug: we recently replaced the series ID tracker to use an atomic variable instead of a mutex protected uint64 in #660. As part of this change, we had to change how the ID is initialized when replaying the WAL. The idea is to find the highest ID across the entire WAL and initialize the series ID to that, so the next new series is given that ID + 1. An overlap of IDs will cause incorrect metrics to be pushed, as the ID is used for looking up labels for a sample in remote_write. However, #660 incorrectly assumed that the series ID initialization was happening for the entire WAL, where it's actually happening per segment (and checkpoint). This means if there is a segment with no new series records in it that is replayed, the ID will be initialized to 0 regardless of the previous non-zero value. We'll write a test to check for this scenario as part of our bug fix. If your WAL didn't replay, more investigative work is needed to figure out what might've happened here :) (Though obviously what I found is definitely is a bug and will be fixed in #676) |
Hey @rfratto as mentioned above; the 0.16.0 installs were on fresh machines; so there are no WALs to replay. |
Hey @charlie-pisuraj! Understood, but like I mentioned there are circumstances where WALs can replayed even on a fresh machine:
(This can also happen if the process itself crashes or if any metrics-related component fails at runtime) |
Looking through our setup code, we do call the /-/reload endpoint so I'll start looking for logs. |
Our setup:
We have a blue-green Grafana Agent cluster doing remote writes to Cortex.
We upgraded our Grafana Agent deployment to 0.16.0 from 0.13.1. We deployed 0.16.0 to our blue cluster which consisted of brand new machines; no WALs from the previous installation. During the blue cluster deployment, our cortex cluster were entirely accepting metrics from out green cluster; so the metrics are still correct at this point. Once all of our blue cluster machines were running we triggered the upgrade on our green cluster; at this point cortex now starts accepting writes from out 0.16.0 GAs which are reporting incorrect values (ex: we expect 64 for certain values but are getting 7 Billion). All metrics being reported were significantly incorrect. At this point we will immediately reverted to 0.13.1, which again reported the correct values.
Attached is one of our dashboard showing the number of agents we expect in one of our clusters. Please note I've converted the axis to Log10 as linear scale makes the original value looks like its 0
The text was updated successfully, but these errors were encountered: