-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation/op-guide: fix failed RPC rate, leader election metrics #8093
Conversation
Documentation/op-guide/grafana.json
Outdated
@@ -922,7 +922,7 @@ | |||
"stack": false, | |||
"steppedLine": false, | |||
"targets": [{ | |||
"expr": "etcd_server_leader_changes_seen_total", | |||
"expr": "delta(etcd_server_leader_changes_seen_total[1m])", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is wrong. we still want to see the total.
@brancz can you do a double check on the rules? thanks! |
943a113
to
01644b1
Compare
@xiang90 Removed election query change, instead changed the title. |
Documentation/op-guide/grafana.json
Outdated
@@ -924,15 +924,15 @@ | |||
"targets": [{ | |||
"expr": "etcd_server_leader_changes_seen_total", | |||
"intervalFactor": 2, | |||
"legendFormat": "{{instance}} Leader Change Seen", | |||
"legendFormat": "{{instance}} Total Leader Elections", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can actually change this to Leader Change Seen in one day. and change [1m] to [1day]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just switched to 1 day with changes
function.
3618884
to
02e7257
Compare
Codecov Report
@@ Coverage Diff @@
## master #8093 +/- ##
=========================================
Coverage ? 76.71%
=========================================
Files ? 342
Lines ? 26568
Branches ? 0
=========================================
Hits ? 20381
Misses ? 4739
Partials ? 1448 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally looks good, just the range selector is maybe a bit too small
Documentation/op-guide/grafana.json
Outdated
@@ -115,17 +115,17 @@ | |||
"stack": false, | |||
"steppedLine": false, | |||
"targets": [{ | |||
"expr": "sum(rate(grpc_server_started_total{grpc_type=\"unary\"} [1m]))", | |||
"expr": "sum(rate(grpc_server_started_total{grpc_type=\"unary\"}[1m])) by (grpc_method)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A typical scrape interval is 15s, 30s or even 60s, in such a case this range query would only rate over 4, 2 or 1 sample. I'd suggest to make this rather 5m
.
@brancz thanks for the review. |
@xiang90 always happy to help 🙂 |
lgtm |
Documentation/op-guide/grafana.json
Outdated
"intervalFactor": 2, | ||
"legendFormat": "{{instance}} RPC Rate", | ||
"legendFormat": "{{grpc_method}} RPC Rate", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably also change this to {{instance}}{{grpc_method}} RPC Rate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can change the view to a stack view to aggregate them
@xiang90 PTAL. |
@gyuho Hmm... After second thought, I prefer an aggregated view for both RPC and failed RPC. As discussed with @heyitsanthony before, the dashboard should be as clear and easy as possible. If users want to see detailed information, they can create a perf/debugging page themselves. |
This fixes failed RPC rate query, where we do not need subtraction because we already query by the status code. Also adds grpc_method to make it more specific. Most of the time, the failure recovers within 10-second, which is our Prometheus scrap interval, so 'rate' query might not cover that time window, showing as 0s, but still shows up in the graph. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
lgtm |
This fixes failed RPC rate query, where we do not need
subtraction because we already query by the status code.
Also adds grpc_method to make it more specific. Most of the
time, the failure recovers within 10-second, which is our
Prometheus scrap interval, so 'rate' query might not cover
that time window, showing as 0s, but still shows up in the graph.
Before
After
Also fixes RPC rate query.