Data loss in the remote write when the remote endpoint has an outage #400

pracucci · 2021-02-09T13:59:56Z

We had a relatively short Cortex outage (10m) on the write path, where the server returned 5xx errors and we've noticed missing samples when the agent was catching up once the outage has been resolved.

Suspected root cause

After some investigation and discussion with @rfratto, it looks like the cause is the aggressive checkpointing done by the agent.

The agent tries to truncate the WAL every wal_truncate_frequency period (defaults to 1 minute):
https://github.com/grafana/agent/blob/master/pkg/prom/instance/instance.go#L658

The Storage.Truncate() starts a new WAL segment and then creates a new checkpoint if the number of WAL segments is >= 3 (since a segment is created every 1m by default, there will be at least 3 segments every 2 minutes):
https://github.com/grafana/agent/blob/master/pkg/prom/wal/wal.go#L414

The wal.Checkpoint() moves older segments to the checkpoint. This means that segments are removed from the wal/ directory and moved inside the wal/checkpoint.xxxxx/ directory.

The remote write replays samples only from the WAL segments (not the checkpoint, because readSegment() is called with tail=false), so when an outage occurs on the remote endpoint (eg. Cortex) all samples contained in segments which have been moved to the checkpoint will be skipped (not remote written) once the remote endpoint recovers from the outage.

Reproduced in the local env

I setup a local Cortex cluster and simulated the outage we had in production and I've verified the agent effectively loose data once Cortex gets back online. To easily show it, I've added a log to the Cortex distributor to print the min/max timestamp of samples within each remote write, I've simulated the outage and seen out the agent behaves:
https://gist.github.com/pracucci/b83c192d253e730b2cf59adeb0fc9e50

Once Cortex is back online, the TSDB wal.Watcher (which was paused because all remote write shards queues were full due to the outage) continues to relay the WAL and fails once it tries to read from the current segment, which has already been moved to the checkpoint:

ts=2021-02-08T22:32:38.768506Z caller=dedupe.go:112 agent=prometheus instance=485e88be7f25c7cd3e3dadfc714dde9b component=remote level=error remote_name=485e88-a3ff21 url=http://distributor:8001/api/prom/push msg="error tailing WAL" err="open /tmp/485e88be7f25c7cd3e3dadfc714dde9b/wal/00000001: no such file or directory"

This leads the wal.Watcher to replay the WAL again. It replays the checkpoint first (but it skips samples, it only reads series) and then starts replaying segments. The samples which have been moved to the checkpoint are skipped and this leads to the data loss.

The text was updated successfully, but these errors were encountered:

rfratto mentioned this issue Feb 9, 2021

Reduce likelihood of data loss when remote endpoint has an outage #401

Merged

3 tasks

rfratto closed this as completed in #401 Feb 17, 2021

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 24, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data loss in the remote write when the remote endpoint has an outage #400

Data loss in the remote write when the remote endpoint has an outage #400

pracucci commented Feb 9, 2021 •

edited

Loading

Data loss in the remote write when the remote endpoint has an outage #400

Data loss in the remote write when the remote endpoint has an outage #400

Comments

pracucci commented Feb 9, 2021 • edited Loading

Suspected root cause

Reproduced in the local env

pracucci commented Feb 9, 2021 •

edited

Loading