-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding some finer grained metrics around RPC processing #480
Conversation
Also increasing the buffer size we read network data into to 256KB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
For visibility/historic record this change has been well tested internally and by some users we've worked with on a specific issue so we've seen these metrics in real environments and are happy that they have helped us diagnose specific performance symptoms that were otherwise hard.
For example slow disks can be observed with the existing storeLogs
timing metric, however slow disks can then cause backpressure on this appendEntries
handler since we enqueue the request here and then wait for a response before we loop and read the next message off the connection.
If disk gets slow enough compared with write rate, that means that the TCP buffers can fill up and start to cause TCP backpressure. The end result is that leader metrics show RPCs taking multiple seconds, while the existing appendEntriesRPC
metric on the "slow" follower is till showing "only" the few milliseconds that the slow disk write takes because it's not including any of the queing time spent waiting here. This can falsely lead to concluding the network is at fault for delays.
These metrics help expose that so network problems can be ruled out and the queuing delay between rpcEnqueue
and rpcRespond
can be captured. It helps rule out bad network connections causing the follower to be stuck reading the first byte from the network, as well as buffering/decoding delays as possible causes for slow leader-observed updates too.
net_transport.go
Outdated
Name: "rpcType", | ||
Value: "AppendEntries", | ||
}, | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prometheus doesn't like to see the same metric with different labelsets. Instead of having the heartbeat
label present only for this rpcType, can we use two different values - e.g. AppendEntries and AppendEntriesHeartbeat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, the proposed change would be easy enough to make and I can do that.
For my own knowledge though could you elaborate a bit on what prometheus doesn't like about that and how it impacts using the metrics? I have been using these with prometheus + grafana and haven't run into any strange behavior so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been a while (years) since I ran into issues with this kind of thing, so I'm a little hazy on the details. Definitely no problem ingesting the data, but some kinds of queries (aggregates maybe?) are more awkward when you have inconsistent labelsets. Not a huge deal, just not a great pattern IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think it was joins - you have to jump through some extra hoops to ensure the labelsets are consistent or your joins will drop data. But for sums too it's a bit of a pain, like if a user sees one of the single-label metrics as an example, they might do sum by (rpcType)
and not realize that there's a meaningful distinction between AppendEntries as a heartbeat and AppendEntries with actual logs, because they got lumped in together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that you can really sum quantiles, but the same would apply to max
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I do know that when crafting some queries utilizing max I did have to add a {heartbeat="false"}
in the query to get it to ignore those append entries calls. Having a separate type for the heartbeats all together though seems easier to manage.
…n increased buffer size
Also increasing the buffer size we read network data into to 256KB.
The extra metrics can help to see if certain parts of the system are experiencing slowdowns. The increased buffer size is a minor performance optimization so that reading from the TCP conn is more efficient.