Add a monitoring/alert rule to capture etcd revision divergence issue #15606

chaochn47 · 2023-03-31T23:57:41Z

What would you like to be added?

If rate(max(etcd_debugging_mvcc_current_revision) - min(etcd_debugging_mvcc_current_revision))[1m] > 0 for more than 1 hour, it generates a grafana rule to notify the operator.

It can be added in https://github.com/etcd-io/etcd/blob/main/contrib/mixin/mixin.libsonnet

It should be equivalent to

Eventually this rule can be added in https://etcd.io/docs/v3.5/op-guide/data_corruption/ and https://etcd.io/docs/v3.4/op-guide/data_corruption/ after etcd community positive feedback.

Why is this needed?

It was already configured by some etcd users in one of the reports. etcd server DB out of sync undetected #12535 (comment)
Our team uses this equivalent rule configured in cloudwatch successfully capturing data inconsistency issues with no false positive alarms in prod for a couple of months. I think it's worth add to the mixin rule (prometheus and grafana based) in upstream as well.
Server side corruption check is an experimental feature and can be turned on independently. Not to mention its 3.4 version is not reliably capturing all the cases.

The text was updated successfully, but these errors were encountered:

chaochn47 · 2023-04-01T00:03:14Z

cc @serathius @ahrtr @ptabor

ptabor · 2023-04-01T10:14:43Z

Double checking if I interpret the query right:

For every minute (within one hour window) there was observed growth of divergence of storage revisions between last and leading etcd instance.

12:00:00   (3,4,2)  -> divergence 2
12:01:00   (7,5,9)  -> divergence 4

minute considered 'unhealthy'

12:01:00   (334,454,234)  -> divergence 220
12:02:00   (762,690,765)  -> divergence 75

minute consider healthy. Healthy minute restarts the alert.

Seems good to me.

chaochn47 · 2023-04-01T21:57:45Z

Yeah exactly. Thanks for confirm.

The idea is continuous growth of revision divergence across members is not expected. A member can be slow applying temporarily but cannot be strictly slow for each data point for hours.

To reduce the false alarm, the sampling interval should be the same scrape interval to guarantee "continuous". I believe such syntax rate(xxx)[scape_interval] is supported. I am not an Prometheus expert but on top of my head, the default scrape interval is 15 seconds. With the current proposed 1min setting, it may generate some false alarms.

ahrtr · 2023-04-01T23:52:14Z

rate(max(etcd_debugging_mvcc_current_revision) - min(etcd_debugging_mvcc_current_revision))[1m] > 0

It seems not correct, because rate should only be used with counters and native histograms where the components behave like counters. per functions/#rate. Obviously max(etcd_debugging_mvcc_current_revision) - min(etcd_debugging_mvcc_current_revision) isn't a counter.

chaochn47 · 2023-04-02T06:07:02Z

rate(max(etcd_debugging_mvcc_current_revision) - min(etcd_debugging_mvcc_current_revision))[1m] > 0

It seems not correct, because rate should only be used with counters and native histograms where the components behave like counters. per functions/#rate. Obviously max(etcd_debugging_mvcc_current_revision) - min(etcd_debugging_mvcc_current_revision) isn't a counter.

I haven't yet tested it. So the above query syntax could be wrong.

The following query could be another candidate. (I did not test it...)

changes((max_over_time(etcd_debugging_mvcc_current_revision[1m]) - min_over_time(etcd_debugging_mvcc_current_revision[1m])) > 0 for 60m

The smaller the scrape interval, the lower the probability of false alarms occurring.

stale · 2023-08-12T07:29:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

chaochn47 added the type/feature label Mar 31, 2023

chaochn47 changed the title ~~Add a monitoring/alert rule to capture etcd revision divergence~~ Add a monitoring/alert rule to capture etcd revision divergence issue Apr 1, 2023

chaochn47 added the area/contrib label Apr 11, 2023

chaochn47 added the stage/tracked label Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a monitoring/alert rule to capture etcd revision divergence issue #15606

Add a monitoring/alert rule to capture etcd revision divergence issue #15606

chaochn47 commented Mar 31, 2023 •

edited

Loading

chaochn47 commented Apr 1, 2023 •

edited

Loading

ptabor commented Apr 1, 2023

chaochn47 commented Apr 1, 2023 •

edited

Loading

ahrtr commented Apr 1, 2023

chaochn47 commented Apr 2, 2023 •

edited

Loading

stale bot commented Aug 12, 2023

Add a monitoring/alert rule to capture etcd revision divergence issue #15606

Add a monitoring/alert rule to capture etcd revision divergence issue #15606

Comments

chaochn47 commented Mar 31, 2023 • edited Loading

What would you like to be added?

Why is this needed?

chaochn47 commented Apr 1, 2023 • edited Loading

ptabor commented Apr 1, 2023

chaochn47 commented Apr 1, 2023 • edited Loading

ahrtr commented Apr 1, 2023

chaochn47 commented Apr 2, 2023 • edited Loading

stale bot commented Aug 12, 2023

chaochn47 commented Mar 31, 2023 •

edited

Loading

chaochn47 commented Apr 1, 2023 •

edited

Loading

chaochn47 commented Apr 1, 2023 •

edited

Loading

chaochn47 commented Apr 2, 2023 •

edited

Loading