Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation/op-guide: fix failed RPC rate, leader election metrics #8093

Merged
merged 1 commit into from
Jun 15, 2017

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Jun 14, 2017

This fixes failed RPC rate query, where we do not need
subtraction because we already query by the status code.
Also adds grpc_method to make it more specific. Most of the
time, the failure recovers within 10-second, which is our
Prometheus scrap interval, so 'rate' query might not cover
that time window, showing as 0s, but still shows up in the graph.

Before
screen shot 2017-06-14 at 3 59 16 am

After
2

Also fixes RPC rate query.

screen shot 2017-06-14 at 4 24 53 am

@@ -922,7 +922,7 @@
"stack": false,
"steppedLine": false,
"targets": [{
"expr": "etcd_server_leader_changes_seen_total",
"expr": "delta(etcd_server_leader_changes_seen_total[1m])",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wrong. we still want to see the total.

@xiang90
Copy link
Contributor

xiang90 commented Jun 14, 2017

@brancz can you do a double check on the rules? thanks!

@gyuho gyuho force-pushed the grafana branch 3 times, most recently from 943a113 to 01644b1 Compare June 14, 2017 15:33
@etcd-io etcd-io deleted a comment from codecov-io Jun 14, 2017
@gyuho
Copy link
Contributor Author

gyuho commented Jun 14, 2017

@xiang90 Removed election query change, instead changed the title.

@@ -924,15 +924,15 @@
"targets": [{
"expr": "etcd_server_leader_changes_seen_total",
"intervalFactor": 2,
"legendFormat": "{{instance}} Leader Change Seen",
"legendFormat": "{{instance}} Total Leader Elections",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can actually change this to Leader Change Seen in one day. and change [1m] to [1day]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just switched to 1 day with changes function.

@gyuho gyuho force-pushed the grafana branch 2 times, most recently from 3618884 to 02e7257 Compare June 14, 2017 18:11
@etcd-io etcd-io deleted a comment from codecov-io Jun 14, 2017
@codecov-io
Copy link

codecov-io commented Jun 14, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@e6d2667). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #8093   +/-   ##
=========================================
  Coverage          ?   76.71%           
=========================================
  Files             ?      342           
  Lines             ?    26568           
  Branches          ?        0           
=========================================
  Hits              ?    20381           
  Misses            ?     4739           
  Partials          ?     1448

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e6d2667...02e7257. Read the comment docs.

Copy link
Contributor

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally looks good, just the range selector is maybe a bit too small

@@ -115,17 +115,17 @@
"stack": false,
"steppedLine": false,
"targets": [{
"expr": "sum(rate(grpc_server_started_total{grpc_type=\"unary\"} [1m]))",
"expr": "sum(rate(grpc_server_started_total{grpc_type=\"unary\"}[1m])) by (grpc_method)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A typical scrape interval is 15s, 30s or even 60s, in such a case this range query would only rate over 4, 2 or 1 sample. I'd suggest to make this rather 5m.

@xiang90
Copy link
Contributor

xiang90 commented Jun 15, 2017

@brancz thanks for the review.

@brancz
Copy link
Contributor

brancz commented Jun 15, 2017

@xiang90 always happy to help 🙂

@xiang90
Copy link
Contributor

xiang90 commented Jun 15, 2017

lgtm

"intervalFactor": 2,
"legendFormat": "{{instance}} RPC Rate",
"legendFormat": "{{grpc_method}} RPC Rate",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably also change this to {{instance}}{{grpc_method}} RPC Rate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can change the view to a stack view to aggregate them

@gyuho
Copy link
Contributor Author

gyuho commented Jun 15, 2017

screen shot 2017-06-15 at 11 39 37 am

@xiang90 PTAL.

@xiang90
Copy link
Contributor

xiang90 commented Jun 15, 2017

@gyuho Hmm... After second thought, I prefer an aggregated view for both RPC and failed RPC. As discussed with @heyitsanthony before, the dashboard should be as clear and easy as possible. If users want to see detailed information, they can create a perf/debugging page themselves.

@gyuho
Copy link
Contributor Author

gyuho commented Jun 15, 2017

screen shot 2017-06-15 at 11 52 08 am

All fixed.

This fixes failed RPC rate query, where we do not need
subtraction because we already query by the status code.
Also adds grpc_method to make it more specific. Most of the
time, the failure recovers within 10-second, which is our
Prometheus scrap interval, so 'rate' query might not cover
that time window, showing as 0s, but still shows up in the graph.

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
@xiang90
Copy link
Contributor

xiang90 commented Jun 15, 2017

lgtm

@gyuho gyuho merged commit 037e33e into etcd-io:master Jun 15, 2017
@gyuho gyuho deleted the grafana branch June 15, 2017 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants