-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce likelihood of data loss when remote endpoint has an outage #401
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on a improvement. I agree it's not a proper fix, but should definitely improve it.
@@ -44,7 +44,7 @@ var ( | |||
var ( | |||
DefaultConfig = Config{ | |||
HostFilter: false, | |||
WALTruncateFrequency: 1 * time.Minute, | |||
WALTruncateFrequency: 60 * time.Minute, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts
Switching the frequency from 1m to 60m it means in agent with low/moderate traffic you will have few segments over the last 60m.
Before this change you were creating a segment every 1m, while after this change a new segment is created once the previous one hit 128MB size or after 60m.
The Storage.Truncate()
creates a checkpoint if there are at least 3 segments and the checkpoint will contain the low 1/3 of segments. Let's consider two opposite scenarios:
- High volume (1 segment / minute): every 60m we create a checkpoint. The checkpoint contains the oldest 20m and the WAL segments contain the newest 40m. The longest outage we can tolerate is 40m.
- Low volume (1 segment / hour): every 60m we create a checkpoint. Since we have 1 segment / hour, the checkpoint contains the oldest 1h and the WAL segments contain the newest 2h. The longest outage we can tolerate is 2h.
@rfratto @codesome is my analysis ☝️ correct? I'm wondering if:
- the agent should reduce the max segment size or make it configurable
- we should document the longest outage without data loss is two third of
wal_truncate_frequency
4251833
to
fe4754a
Compare
I am going to merge this, even though it only increases remote_write outage tolerance to 39 minutes or 80 minutes, depending on how long it's been since the last checkpoint. A more permanent solution is being discussed in a design doc. |
* reduce likelihood of data loss when remote_write has an outage * take out old config block * s/some/some new
PR Description
This PR increases the WAL truncation frequency default to 60 minutes to reduce the likelihood of samples being moved to a checkpoint when remote_writes are failing.
An extra measure has been added to the truncate loop where a WAL truncate will be skipped if the last remote_write timestamp hasn't changed. This applies after the
min_wal_time
andmax_wal_time
checks so data older thanmax_wal_time
should continue to be deleted.Which issue(s) this PR fixes
Fixes #400 (kind of, it's not perfect, but it's significantly better).
Notes to the Reviewer
PR Checklist