Reduce likelihood of data loss when remote endpoint has an outage #401

rfratto · 2021-02-09T14:35:58Z

PR Description

This PR increases the WAL truncation frequency default to 60 minutes to reduce the likelihood of samples being moved to a checkpoint when remote_writes are failing.

An extra measure has been added to the truncate loop where a WAL truncate will be skipped if the last remote_write timestamp hasn't changed. This applies after the min_wal_time and max_wal_time checks so data older than max_wal_time should continue to be deleted.

Which issue(s) this PR fixes

Fixes #400 (kind of, it's not perfect, but it's significantly better).

Notes to the Reviewer

PR Checklist

CHANGELOG updated
Documentation added
Tests updated

pracucci

Thanks for working on a improvement. I agree it's not a proper fix, but should definitely improve it.

pracucci · 2021-02-09T17:21:53Z

pkg/prom/instance/instance.go

@@ -44,7 +44,7 @@ var (
 var (
 	DefaultConfig = Config{
 		HostFilter:           false,
-		WALTruncateFrequency: 1 * time.Minute,
+		WALTruncateFrequency: 60 * time.Minute,


Some thoughts

Switching the frequency from 1m to 60m it means in agent with low/moderate traffic you will have few segments over the last 60m.

Before this change you were creating a segment every 1m, while after this change a new segment is created once the previous one hit 128MB size or after 60m.

The Storage.Truncate() creates a checkpoint if there are at least 3 segments and the checkpoint will contain the low 1/3 of segments. Let's consider two opposite scenarios:

High volume (1 segment / minute): every 60m we create a checkpoint. The checkpoint contains the oldest 20m and the WAL segments contain the newest 40m. The longest outage we can tolerate is 40m.

Low volume (1 segment / hour): every 60m we create a checkpoint. Since we have 1 segment / hour, the checkpoint contains the oldest 1h and the WAL segments contain the newest 2h. The longest outage we can tolerate is 2h.

@rfratto @codesome is my analysis ☝️ correct? I'm wondering if:

the agent should reduce the max segment size or make it configurable

we should document the longest outage without data loss is two third of wal_truncate_frequency

rfratto · 2021-02-17T13:52:52Z

I am going to merge this, even though it only increases remote_write outage tolerance to 39 minutes or 80 minutes, depending on how long it's been since the last checkpoint. A more permanent solution is being discussed in a design doc.

* reduce likelihood of data loss when remote_write has an outage * take out old config block * s/some/some new

rfratto requested review from pracucci and codesome February 9, 2021 14:35

rfratto changed the title ~~Reduce likelihood of data loss when remote_write has an outage~~ Reduce likelihood of data loss when remote endpoint has an outage Feb 9, 2021

codesome approved these changes Feb 9, 2021

View reviewed changes

rfratto mentioned this pull request Feb 9, 2021

Allow specifying wal truncate frequencies per integration #403

Merged

3 tasks

pracucci approved these changes Feb 9, 2021

View reviewed changes

rfratto added 3 commits February 17, 2021 08:51

reduce likelihood of data loss when remote_write has an outage

c23eac0

take out old config block

d5fb268

s/some/some new

fe4754a

rfratto force-pushed the reduce-dataloss-likelihood branch from 4251833 to fe4754a Compare February 17, 2021 13:51

rfratto merged commit d8d85f0 into grafana:master Feb 17, 2021

rfratto deleted the reduce-dataloss-likelihood branch February 17, 2021 14:30

mattdurham mentioned this pull request Sep 7, 2021

crow doc rfratto/agent#8

Closed

3 tasks

mattdurham pushed a commit that referenced this pull request Nov 11, 2021

Reduce likelihood of data loss when remote endpoint has an outage (#401)

23f037c

* reduce likelihood of data loss when remote_write has an outage * take out old config block * s/some/some new

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Apr 21, 2024

github-actions bot locked as resolved and limited conversation to collaborators Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce likelihood of data loss when remote endpoint has an outage #401

Reduce likelihood of data loss when remote endpoint has an outage #401

rfratto commented Feb 9, 2021

pracucci left a comment

pracucci Feb 9, 2021

rfratto commented Feb 17, 2021

Reduce likelihood of data loss when remote endpoint has an outage #401

Reduce likelihood of data loss when remote endpoint has an outage #401

Conversation

rfratto commented Feb 9, 2021

PR Description

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

pracucci left a comment

Choose a reason for hiding this comment

pracucci Feb 9, 2021

Choose a reason for hiding this comment

Some thoughts

rfratto commented Feb 17, 2021