-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a monitoring/alert rule to capture etcd revision divergence issue #15606
Comments
Double checking if I interpret the query right: For every minute (within one hour window) there was observed growth of divergence of storage revisions between last and leading etcd instance.
minute considered 'unhealthy'
minute consider healthy. Healthy minute restarts the alert. Seems good to me. |
Yeah exactly. Thanks for confirm. The idea is continuous growth of revision divergence across members is not expected. A member can be slow applying temporarily but cannot be strictly slow for each data point for hours. To reduce the false alarm, the sampling interval should be the same scrape interval to guarantee "continuous". I believe such syntax |
It seems not correct, because |
I haven't yet tested it. So the above query syntax could be wrong. The following query could be another candidate. (I did not test it...)
The smaller the scrape interval, the lower the probability of false alarms occurring. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
What would you like to be added?
If
rate(max(etcd_debugging_mvcc_current_revision) - min(etcd_debugging_mvcc_current_revision))[1m] > 0 for more than 1 hour
, it generates a grafana rule to notify the operator.It can be added in https://github.com/etcd-io/etcd/blob/main/contrib/mixin/mixin.libsonnet
It should be equivalent to
Eventually this rule can be added in https://etcd.io/docs/v3.5/op-guide/data_corruption/ and https://etcd.io/docs/v3.4/op-guide/data_corruption/ after etcd community positive feedback.
Why is this needed?
The text was updated successfully, but these errors were encountered: