Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a monitoring/alert rule to capture etcd revision divergence issue #15606

Open
chaochn47 opened this issue Mar 31, 2023 · 6 comments
Open

Comments

@chaochn47
Copy link
Member

chaochn47 commented Mar 31, 2023

What would you like to be added?

If rate(max(etcd_debugging_mvcc_current_revision) - min(etcd_debugging_mvcc_current_revision))[1m] > 0 for more than 1 hour, it generates a grafana rule to notify the operator.

It can be added in https://github.com/etcd-io/etcd/blob/main/contrib/mixin/mixin.libsonnet

It should be equivalent to
Screen Shot 2022-02-04 at 9 56 22 AM
Screen Shot 2022-02-04 at 9 56 37 AM

Eventually this rule can be added in https://etcd.io/docs/v3.5/op-guide/data_corruption/ and https://etcd.io/docs/v3.4/op-guide/data_corruption/ after etcd community positive feedback.

Why is this needed?

  1. It was already configured by some etcd users in one of the reports. etcd server DB out of sync undetected #12535 (comment)
  2. Our team uses this equivalent rule configured in cloudwatch successfully capturing data inconsistency issues with no false positive alarms in prod for a couple of months. I think it's worth add to the mixin rule (prometheus and grafana based) in upstream as well.
  3. Server side corruption check is an experimental feature and can be turned on independently. Not to mention its 3.4 version is not reliably capturing all the cases.
@chaochn47 chaochn47 changed the title Add a monitoring/alert rule to capture etcd revision divergence Add a monitoring/alert rule to capture etcd revision divergence issue Apr 1, 2023
@chaochn47
Copy link
Member Author

chaochn47 commented Apr 1, 2023

cc @serathius @ahrtr @ptabor

@ptabor
Copy link
Contributor

ptabor commented Apr 1, 2023

Double checking if I interpret the query right:

For every minute (within one hour window) there was observed growth of divergence of storage revisions between last and leading etcd instance.

12:00:00   (3,4,2)  -> divergence 2
12:01:00   (7,5,9)  -> divergence 4

minute considered 'unhealthy'

12:01:00   (334,454,234)  -> divergence 220
12:02:00   (762,690,765)  -> divergence 75

minute consider healthy. Healthy minute restarts the alert.

Seems good to me.

@chaochn47
Copy link
Member Author

chaochn47 commented Apr 1, 2023

Yeah exactly. Thanks for confirm.

The idea is continuous growth of revision divergence across members is not expected. A member can be slow applying temporarily but cannot be strictly slow for each data point for hours.

To reduce the false alarm, the sampling interval should be the same scrape interval to guarantee "continuous". I believe such syntax rate(xxx)[scape_interval] is supported. I am not an Prometheus expert but on top of my head, the default scrape interval is 15 seconds. With the current proposed 1min setting, it may generate some false alarms.

@ahrtr
Copy link
Member

ahrtr commented Apr 1, 2023

rate(max(etcd_debugging_mvcc_current_revision) - min(etcd_debugging_mvcc_current_revision))[1m] > 0

It seems not correct, because rate should only be used with counters and native histograms where the components behave like counters. per functions/#rate. Obviously max(etcd_debugging_mvcc_current_revision) - min(etcd_debugging_mvcc_current_revision) isn't a counter.

@chaochn47
Copy link
Member Author

chaochn47 commented Apr 2, 2023

rate(max(etcd_debugging_mvcc_current_revision) - min(etcd_debugging_mvcc_current_revision))[1m] > 0

It seems not correct, because rate should only be used with counters and native histograms where the components behave like counters. per functions/#rate. Obviously max(etcd_debugging_mvcc_current_revision) - min(etcd_debugging_mvcc_current_revision) isn't a counter.

I haven't yet tested it. So the above query syntax could be wrong.

The following query could be another candidate. (I did not test it...)

changes((max_over_time(etcd_debugging_mvcc_current_revision[1m]) - min_over_time(etcd_debugging_mvcc_current_revision[1m])) > 0 for 60m

The smaller the scrape interval, the lower the probability of false alarms occurring.

@stale
Copy link

stale bot commented Aug 12, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants