Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce likelihood of data loss when remote endpoint has an outage #401

Merged
merged 3 commits into from
Feb 17, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,13 @@ can be found at [#317](https://github.com/grafana/agent/issues/317).
- [BUGFIX] Fixed a bug from v0.12.0 where the Loki installation script failed
because positions_directory was not set. (@rfratto)

- [BUGFIX] (#400) Reduce the likelihood of dataloss during a remote_write-side
outage by increasing the default wal_truncation_frequency to 60m and preventing
the WAL from being truncated if the last truncation timestamp hasn't changed.
This change increases the size of the WAL on average, and users may configure
a lower wal_truncation_frequency to deliberately choose a smaller WAL over
write guarantees. (@rfratto)

# v0.12.0 (2021-02-05)

BREAKING CHANGES: This release has two breaking changes in the configuration
Expand Down
15 changes: 8 additions & 7 deletions docs/configuration-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -372,13 +372,14 @@ host_filter_relabel_configs:
[ - <relabel_config> ... ]

# How frequently the WAL truncation process should run. Every iteration of
# truncation will checkpoint old series, create a new segment for new samples,
# and remove old samples that have been succesfully sent via remote_write.
# If there are are multiple remote_write endpoints, the endpoint with the
# earliest timestamp is used for the cutoff period, ensuring that no data
# gets truncated until all remote_write configurations have been able to
# send the data.
[wal_truncate_frequency: <duration> | default = "1m"]
# the truncation will checkpoint old series and remove old samples. If data
# has not been sent within this window, some of it may be lost.
#
# The size of the WAL will increase with less frequent truncations. Making
# truncations more frequent reduces the size of the WAL but increases the
# chances of data loss when remote_write is failing for longer than the
# specified frequency.
[wal_truncate_frequency: <duration> | default = "60m"]

# The minimum amount of time that series and samples should exist in the WAL
# before being considered for deletion. The consumed disk space of the WAL will
Expand Down
12 changes: 11 additions & 1 deletion pkg/prom/instance/instance.go
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ var (
var (
DefaultConfig = Config{
HostFilter: false,
WALTruncateFrequency: 1 * time.Minute,
WALTruncateFrequency: 60 * time.Minute,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts

Switching the frequency from 1m to 60m it means in agent with low/moderate traffic you will have few segments over the last 60m.

Before this change you were creating a segment every 1m, while after this change a new segment is created once the previous one hit 128MB size or after 60m.

The Storage.Truncate() creates a checkpoint if there are at least 3 segments and the checkpoint will contain the low 1/3 of segments. Let's consider two opposite scenarios:

  • High volume (1 segment / minute): every 60m we create a checkpoint. The checkpoint contains the oldest 20m and the WAL segments contain the newest 40m. The longest outage we can tolerate is 40m.
  • Low volume (1 segment / hour): every 60m we create a checkpoint. Since we have 1 segment / hour, the checkpoint contains the oldest 1h and the WAL segments contain the newest 2h. The longest outage we can tolerate is 2h.

@rfratto @codesome is my analysis ☝️ correct? I'm wondering if:

  1. the agent should reduce the max segment size or make it configurable
  2. we should document the longest outage without data loss is two third of wal_truncate_frequency

MinWALTime: 5 * time.Minute,
MaxWALTime: 4 * time.Hour,
RemoteFlushDeadline: 1 * time.Minute,
Expand Down Expand Up @@ -627,6 +627,10 @@ func (i *Instance) newDiscoveryManager(ctx context.Context, cfg *Config) (*disco
}

func (i *Instance) truncateLoop(ctx context.Context, wal walStorage, cfg *Config) {
// Track the last timestamp we truncated for to prevent segments from getting
// deleted until at least some new data has been sent.
var lastTs int64 = math.MinInt64

for {
select {
case <-ctx.Done():
Expand Down Expand Up @@ -654,6 +658,12 @@ func (i *Instance) truncateLoop(ctx context.Context, wal walStorage, cfg *Config
ts = maxTS
}

if ts == lastTs {
level.Debug(i.logger).Log("msg", "not truncating the WAL, remote_write timestamp is unchanged", "ts", ts)
continue
}
lastTs = ts

level.Debug(i.logger).Log("msg", "truncating the WAL", "ts", ts)
err := wal.Truncate(ts)
if err != nil {
Expand Down