Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce likelihood of data loss when remote endpoint has an outage #401

Merged
merged 3 commits into from
Feb 17, 2021

Conversation

rfratto
Copy link
Member

@rfratto rfratto commented Feb 9, 2021

PR Description

This PR increases the WAL truncation frequency default to 60 minutes to reduce the likelihood of samples being moved to a checkpoint when remote_writes are failing.

An extra measure has been added to the truncate loop where a WAL truncate will be skipped if the last remote_write timestamp hasn't changed. This applies after the min_wal_time and max_wal_time checks so data older than max_wal_time should continue to be deleted.

Which issue(s) this PR fixes

Fixes #400 (kind of, it's not perfect, but it's significantly better).

Notes to the Reviewer

PR Checklist

  • CHANGELOG updated
  • Documentation added
  • Tests updated

@rfratto rfratto changed the title Reduce likelihood of data loss when remote_write has an outage Reduce likelihood of data loss when remote endpoint has an outage Feb 9, 2021
Copy link

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on a improvement. I agree it's not a proper fix, but should definitely improve it.

@@ -44,7 +44,7 @@ var (
var (
DefaultConfig = Config{
HostFilter: false,
WALTruncateFrequency: 1 * time.Minute,
WALTruncateFrequency: 60 * time.Minute,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts

Switching the frequency from 1m to 60m it means in agent with low/moderate traffic you will have few segments over the last 60m.

Before this change you were creating a segment every 1m, while after this change a new segment is created once the previous one hit 128MB size or after 60m.

The Storage.Truncate() creates a checkpoint if there are at least 3 segments and the checkpoint will contain the low 1/3 of segments. Let's consider two opposite scenarios:

  • High volume (1 segment / minute): every 60m we create a checkpoint. The checkpoint contains the oldest 20m and the WAL segments contain the newest 40m. The longest outage we can tolerate is 40m.
  • Low volume (1 segment / hour): every 60m we create a checkpoint. Since we have 1 segment / hour, the checkpoint contains the oldest 1h and the WAL segments contain the newest 2h. The longest outage we can tolerate is 2h.

@rfratto @codesome is my analysis ☝️ correct? I'm wondering if:

  1. the agent should reduce the max segment size or make it configurable
  2. we should document the longest outage without data loss is two third of wal_truncate_frequency

@rfratto
Copy link
Member Author

rfratto commented Feb 17, 2021

I am going to merge this, even though it only increases remote_write outage tolerance to 39 minutes or 80 minutes, depending on how long it's been since the last checkpoint. A more permanent solution is being discussed in a design doc.

@rfratto rfratto merged commit d8d85f0 into grafana:master Feb 17, 2021
@rfratto rfratto deleted the reduce-dataloss-likelihood branch February 17, 2021 14:30
@mattdurham mattdurham mentioned this pull request Sep 7, 2021
3 tasks
mattdurham pushed a commit that referenced this pull request Nov 11, 2021
* reduce likelihood of data loss when remote_write has an outage

* take out old config block

* s/some/some new
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Apr 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data loss in the remote write when the remote endpoint has an outage
3 participants