re-add the missing prometheus_tsdb_wal_corruptions_total #473

krasi-georgiev · 2018-12-12T13:53:18Z

closes #471

after implementing the new WAL this metric was missing so adding it again.
Also added it in a test to make sure it works as expected.

Signed-off-by: Krasi Georgiev kgeorgie@redhat.com

krasi-georgiev · 2018-12-12T13:55:32Z

@codesome mind having a quick look?

after implementing the new WAL this metric was missing so adding it again. Also added it in a test to make sure it works as expected. Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>

codesome · 2018-12-14T01:06:45Z

Bit occupied in KubeCon and AFK. Will take a look in couple of days.

codesome · 2018-12-15T21:07:57Z

head.go

@@ -480,10 +486,10 @@ func (h *Head) Init(minValidTime int64) error {
 		return nil
 	}
 	level.Warn(h.logger).Log("msg", "encountered WAL error, attempting repair", "err", err)
+	h.metrics.walCorruptionsTotal.Inc()


I believe we detect WAL corruptions when we call loadWAL. So do we also need to increment it for this: https://github.com/prometheus/tsdb/blob/9e51d56e08958f22f55daf26795ee477def7797e/head.go#L471-L473

And also maybe a small test for that?

When this happens Prometheus will exist, so why would we increment there?

If there is no way to recover this Inc() info after we return, then there is no need of adding it here.

Maybe add the metrics directly to wal.Repair() so we can know when there is a corruption and whether or not it has been repaired?

don't think it makes much difference and where we place it, but not a bad idea.

How would we know if the corruption has been repaired?

codesome · 2018-12-18T10:24:42Z

LGTM 👍

krasi-georgiev · 2018-12-18T14:32:46Z

@dkalashnik how are you using the prometheus_tsdb_wal_corruptions_total metric?
asking so that I know if the current implementation would work for your use case.

dkalashnik · 2018-12-18T14:45:23Z

@krasi-georgiev We are using it as a part of the generic prometheus dashboard in grafana, so no specific case.

krasi-georgiev · 2018-12-18T14:53:55Z

@dkalashnik how is your alerting defined based on this metric?

dkalashnik · 2018-12-18T14:56:01Z

@krasi-georgiev We don't have alerts based on that metric, just a panel in dashboard

https://github.com/prometheus/prometheus/releases Some of these changes seem to be interesting enough to update [ENHANCEMENT] Query performance improvements. prometheus-junkyard/tsdb#531 [BUGFIX] Scrape: catch errors when creating HTTP clients #5182. Adds new metrics: prometheus_target_scrape_pools_* deprecating the flag storage.tsdb.retention -> use storage.tsdb.retention.time [FEATURE] Add subqueries to PromQL. [ENHANCEMENT] Kubernetes SD: Add service external IP and external name to the discovery metadata. #4940 [ENHANCEMENT] Add metric for number of rule groups loaded. #5090 BUGFIX] Make sure the retention period does not overflow. #5112 [BUGFIX] Make sure the blocks do not get very large. #5112 [BUGFIX] Do not generate blocks with no samples. prometheus-junkyard/tsdb#374 [BUGFIX] Reintroduce metric for WAL corruptions. prometheus-junkyard/tsdb#473 Signed-off-by: Mikkel Oscar Lyderik Larsen <mikkel.larsen@zalando.de>

re-add the missing prometheus_tsdb_wal_corruptions_total

9e51d56

after implementing the new WAL this metric was missing so adding it again. Also added it in a test to make sure it works as expected. Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>

krasi-georgiev force-pushed the wal-corrution-metric branch from a3ac035 to 9e51d56 Compare December 13, 2018 13:34

codesome reviewed Dec 15, 2018

View reviewed changes

krasi-georgiev merged commit 520ab7d into prometheus-junkyard:master Dec 18, 2018

krasi-georgiev deleted the wal-corrution-metric branch December 18, 2018 10:25

mikkeloscar mentioned this pull request Apr 7, 2019

Update prometheus to v2.8.1 zalando-incubator/kubernetes-on-aws#1979

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re-add the missing prometheus_tsdb_wal_corruptions_total #473

re-add the missing prometheus_tsdb_wal_corruptions_total #473

krasi-georgiev commented Dec 12, 2018 •

edited

Loading

krasi-georgiev commented Dec 12, 2018

codesome commented Dec 14, 2018

codesome Dec 15, 2018 •

edited

Loading

krasi-georgiev Dec 17, 2018 •

edited

Loading

codesome Dec 18, 2018

simonpasquier Dec 18, 2018

krasi-georgiev Dec 18, 2018 •

edited

Loading

codesome commented Dec 18, 2018

krasi-georgiev commented Dec 18, 2018

dkalashnik commented Dec 18, 2018

krasi-georgiev commented Dec 18, 2018

dkalashnik commented Dec 18, 2018

re-add the missing prometheus_tsdb_wal_corruptions_total #473

re-add the missing prometheus_tsdb_wal_corruptions_total #473

Conversation

krasi-georgiev commented Dec 12, 2018 • edited Loading

krasi-georgiev commented Dec 12, 2018

codesome commented Dec 14, 2018

codesome Dec 15, 2018 • edited Loading

Choose a reason for hiding this comment

krasi-georgiev Dec 17, 2018 • edited Loading

Choose a reason for hiding this comment

codesome Dec 18, 2018

Choose a reason for hiding this comment

simonpasquier Dec 18, 2018

Choose a reason for hiding this comment

krasi-georgiev Dec 18, 2018 • edited Loading

Choose a reason for hiding this comment

codesome commented Dec 18, 2018

krasi-georgiev commented Dec 18, 2018

dkalashnik commented Dec 18, 2018

krasi-georgiev commented Dec 18, 2018

dkalashnik commented Dec 18, 2018

krasi-georgiev commented Dec 12, 2018 •

edited

Loading

codesome Dec 15, 2018 •

edited

Loading

krasi-georgiev Dec 17, 2018 •

edited

Loading

krasi-georgiev Dec 18, 2018 •

edited

Loading