Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data loss in the remote write when the remote endpoint has an outage #400

Closed
pracucci opened this issue Feb 9, 2021 · 0 comments · Fixed by #401
Closed

Data loss in the remote write when the remote endpoint has an outage #400

pracucci opened this issue Feb 9, 2021 · 0 comments · Fixed by #401
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.

Comments

@pracucci
Copy link

pracucci commented Feb 9, 2021

We had a relatively short Cortex outage (10m) on the write path, where the server returned 5xx errors and we've noticed missing samples when the agent was catching up once the outage has been resolved.

Suspected root cause

After some investigation and discussion with @rfratto, it looks like the cause is the aggressive checkpointing done by the agent.

The agent tries to truncate the WAL every wal_truncate_frequency period (defaults to 1 minute):
https://github.com/grafana/agent/blob/master/pkg/prom/instance/instance.go#L658

The Storage.Truncate() starts a new WAL segment and then creates a new checkpoint if the number of WAL segments is >= 3 (since a segment is created every 1m by default, there will be at least 3 segments every 2 minutes):
https://github.com/grafana/agent/blob/master/pkg/prom/wal/wal.go#L414

The wal.Checkpoint() moves older segments to the checkpoint. This means that segments are removed from the wal/ directory and moved inside the wal/checkpoint.xxxxx/ directory.

The remote write replays samples only from the WAL segments (not the checkpoint, because readSegment() is called with tail=false), so when an outage occurs on the remote endpoint (eg. Cortex) all samples contained in segments which have been moved to the checkpoint will be skipped (not remote written) once the remote endpoint recovers from the outage.

Reproduced in the local env

I setup a local Cortex cluster and simulated the outage we had in production and I've verified the agent effectively loose data once Cortex gets back online. To easily show it, I've added a log to the Cortex distributor to print the min/max timestamp of samples within each remote write, I've simulated the outage and seen out the agent behaves:
https://gist.github.com/pracucci/b83c192d253e730b2cf59adeb0fc9e50

Once Cortex is back online, the TSDB wal.Watcher (which was paused because all remote write shards queues were full due to the outage) continues to relay the WAL and fails once it tries to read from the current segment, which has already been moved to the checkpoint:

ts=2021-02-08T22:32:38.768506Z caller=dedupe.go:112 agent=prometheus instance=485e88be7f25c7cd3e3dadfc714dde9b component=remote level=error remote_name=485e88-a3ff21 url=http://distributor:8001/api/prom/push msg="error tailing WAL" err="open /tmp/485e88be7f25c7cd3e3dadfc714dde9b/wal/00000001: no such file or directory"

This leads the wal.Watcher to replay the WAL again. It replays the checkpoint first (but it skips samples, it only reads series) and then starts replaying segments. The samples which have been moved to the checkpoint are skipped and this leads to the data loss.

@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 24, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant