Skip to content
This repository has been archived by the owner on Aug 13, 2019. It is now read-only.

re-add the missing prometheus_tsdb_wal_corruptions_total #473

Merged

Conversation

krasi-georgiev
Copy link
Contributor

@krasi-georgiev krasi-georgiev commented Dec 12, 2018

closes #471

after implementing the new WAL this metric was missing so adding it again.
Also added it in a test to make sure it works as expected.

Signed-off-by: Krasi Georgiev kgeorgie@redhat.com

@krasi-georgiev
Copy link
Contributor Author

@codesome mind having a quick look?

after implementing the new WAL this metric was missing so adding it
again.
Also added it in a test to make sure it works as expected.

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
@codesome
Copy link
Contributor

Bit occupied in KubeCon and AFK. Will take a look in couple of days.

@@ -480,10 +486,10 @@ func (h *Head) Init(minValidTime int64) error {
return nil
}
level.Warn(h.logger).Log("msg", "encountered WAL error, attempting repair", "err", err)
h.metrics.walCorruptionsTotal.Inc()
Copy link
Contributor

@codesome codesome Dec 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we detect WAL corruptions when we call loadWAL. So do we also need to increment it for this: https://github.com/prometheus/tsdb/blob/9e51d56e08958f22f55daf26795ee477def7797e/head.go#L471-L473

And also maybe a small test for that?

Copy link
Contributor Author

@krasi-georgiev krasi-georgiev Dec 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this happens Prometheus will exist, so why would we increment there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no way to recover this Inc() info after we return, then there is no need of adding it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add the metrics directly to wal.Repair() so we can know when there is a corruption and whether or not it has been repaired?

Copy link
Contributor Author

@krasi-georgiev krasi-georgiev Dec 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think it makes much difference and where we place it, but not a bad idea.

How would we know if the corruption has been repaired?

@codesome
Copy link
Contributor

LGTM 👍

@krasi-georgiev krasi-georgiev merged commit 520ab7d into prometheus-junkyard:master Dec 18, 2018
@krasi-georgiev krasi-georgiev deleted the wal-corrution-metric branch December 18, 2018 10:25
@krasi-georgiev
Copy link
Contributor Author

@dkalashnik how are you using the prometheus_tsdb_wal_corruptions_total metric?
asking so that I know if the current implementation would work for your use case.

@dkalashnik
Copy link

@krasi-georgiev We are using it as a part of the generic prometheus dashboard in grafana, so no specific case.

@krasi-georgiev
Copy link
Contributor Author

@dkalashnik how is your alerting defined based on this metric?

@dkalashnik
Copy link

@krasi-georgiev We don't have alerts based on that metric, just a panel in dashboard

mikkeloscar added a commit to zalando-incubator/kubernetes-on-aws that referenced this pull request Apr 7, 2019
https://github.com/prometheus/prometheus/releases

Some of these changes seem to be interesting enough to update

[ENHANCEMENT] Query performance improvements. prometheus-junkyard/tsdb#531
[BUGFIX] Scrape: catch errors when creating HTTP clients #5182. Adds new metrics: prometheus_target_scrape_pools_*
deprecating the flag storage.tsdb.retention -> use storage.tsdb.retention.time
[FEATURE] Add subqueries to PromQL.
[ENHANCEMENT] Kubernetes SD: Add service external IP and external name to the discovery metadata. #4940
[ENHANCEMENT] Add metric for number of rule groups loaded. #5090
BUGFIX] Make sure the retention period does not overflow. #5112
[BUGFIX] Make sure the blocks do not get very large. #5112
[BUGFIX] Do not generate blocks with no samples. prometheus-junkyard/tsdb#374
[BUGFIX] Reintroduce metric for WAL corruptions. prometheus-junkyard/tsdb#473

Signed-off-by: Mikkel Oscar Lyderik Larsen <mikkel.larsen@zalando.de>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

prometheus_tsdb_wal_corruptions_total is missing in new wal implementation
4 participants