Grafana agent 0.16.0 reporting incorrect metric values #675

charlie-pisuraj · 2021-06-21T20:50:07Z

Our setup:
We have a blue-green Grafana Agent cluster doing remote writes to Cortex.

We upgraded our Grafana Agent deployment to 0.16.0 from 0.13.1. We deployed 0.16.0 to our blue cluster which consisted of brand new machines; no WALs from the previous installation. During the blue cluster deployment, our cortex cluster were entirely accepting metrics from out green cluster; so the metrics are still correct at this point. Once all of our blue cluster machines were running we triggered the upgrade on our green cluster; at this point cortex now starts accepting writes from out 0.16.0 GAs which are reporting incorrect values (ex: we expect 64 for certain values but are getting 7 Billion). All metrics being reported were significantly incorrect. At this point we will immediately reverted to 0.13.1, which again reported the correct values.

Attached is one of our dashboard showing the number of agents we expect in one of our clusters. Please note I've converted the axis to Log10 as linear scale makes the original value looks like its 0

rfratto · 2021-06-21T21:13:49Z

Hey there, do you have any logs from when you upgraded to 0.16.0?

I've been able to find a bug in the WAL replay code that can cause this situation, but it does depend on the WAL being replayed with at least one checkpoint or one segment. This doesn't necessarily require a process restart, but may happen if the config is reloaded through /-/reload or if a scraper crashes.

If you have logs, look for WAL checkpoint loaded and WAL segment loaded. If there's both, or more than one, it's likely to be the trigger for the issue you saw.

For details on the bug: we recently replaced the series ID tracker to use an atomic variable instead of a mutex protected uint64 in #660. As part of this change, we had to change how the ID is initialized when replaying the WAL. The idea is to find the highest ID across the entire WAL and initialize the series ID to that, so the next new series is given that ID + 1. An overlap of IDs will cause incorrect metrics to be pushed, as the ID is used for looking up labels for a sample in remote_write.

However, #660 incorrectly assumed that the series ID initialization was happening for the entire WAL, where it's actually happening per segment (and checkpoint). This means if there is a segment with no new series records in it that is replayed, the ID will be initialized to 0 regardless of the previous non-zero value. We'll write a test to check for this scenario as part of our bug fix.

If your WAL didn't replay, more investigative work is needed to figure out what might've happened here :) (Though obviously what I found is definitely is a bug and will be fixed in #676)

charlie-pisuraj · 2021-06-21T21:53:12Z

Hey @rfratto as mentioned above; the 0.16.0 installs were on fresh machines; so there are no WALs to replay.

rfratto · 2021-06-21T21:55:51Z

Hey @charlie-pisuraj! Understood, but like I mentioned there are circumstances where WALs can replayed even on a fresh machine:

This doesn't necessarily require a process restart, but may happen if the config is reloaded through /-/reload or if a scraper crashes.

(This can also happen if the process itself crashes or if any metrics-related component fails at runtime)

charlie-pisuraj · 2021-06-21T22:05:26Z

Looking through our setup code, we do call the /-/reload endpoint so I'll start looking for logs.

rfratto mentioned this issue Jun 21, 2021

Initialize biggest ref to existing ref when reading a segment #676

Merged

3 tasks

rfratto closed this as completed in #676 Jun 22, 2021

rfratto mentioned this issue Apr 13, 2022

tsdb/agent: port grafana/agent#676 prometheus/prometheus#10587

Merged

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 23, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grafana agent 0.16.0 reporting incorrect metric values #675

Grafana agent 0.16.0 reporting incorrect metric values #675

charlie-pisuraj commented Jun 21, 2021 •

edited

Loading

rfratto commented Jun 21, 2021 •

edited

Loading

charlie-pisuraj commented Jun 21, 2021

rfratto commented Jun 21, 2021 •

edited

Loading

charlie-pisuraj commented Jun 21, 2021

Grafana agent 0.16.0 reporting incorrect metric values #675

Grafana agent 0.16.0 reporting incorrect metric values #675

Comments

charlie-pisuraj commented Jun 21, 2021 • edited Loading

rfratto commented Jun 21, 2021 • edited Loading

charlie-pisuraj commented Jun 21, 2021

rfratto commented Jun 21, 2021 • edited Loading

charlie-pisuraj commented Jun 21, 2021

charlie-pisuraj commented Jun 21, 2021 •

edited

Loading

rfratto commented Jun 21, 2021 •

edited

Loading

rfratto commented Jun 21, 2021 •

edited

Loading