Adding some finer grained metrics around RPC processing #480

mkeeler · 2021-12-02T19:13:37Z

Also increasing the buffer size we read network data into to 256KB.

The extra metrics can help to see if certain parts of the system are experiencing slowdowns. The increased buffer size is a minor performance optimization so that reading from the TCP conn is more efficient.

Also increasing the buffer size we read network data into to 256KB

banks

LGTM.

For visibility/historic record this change has been well tested internally and by some users we've worked with on a specific issue so we've seen these metrics in real environments and are happy that they have helped us diagnose specific performance symptoms that were otherwise hard.

For example slow disks can be observed with the existing storeLogs timing metric, however slow disks can then cause backpressure on this appendEntries handler since we enqueue the request here and then wait for a response before we loop and read the next message off the connection.

If disk gets slow enough compared with write rate, that means that the TCP buffers can fill up and start to cause TCP backpressure. The end result is that leader metrics show RPCs taking multiple seconds, while the existing appendEntriesRPC metric on the "slow" follower is till showing "only" the few milliseconds that the slow disk write takes because it's not including any of the queing time spent waiting here. This can falsely lead to concluding the network is at fault for delays.

These metrics help expose that so network problems can be ruled out and the queuing delay between rpcEnqueue and rpcRespond can be captured. It helps rule out bad network connections causing the follower to be stuck reading the first byte from the network, as well as buffering/decoding delays as possible causes for slow leader-observed updates too.

ncabatoff · 2021-12-06T18:21:43Z

net_transport.go

+				Name:  "rpcType",
+				Value: "AppendEntries",
+			},
+			{


Prometheus doesn't like to see the same metric with different labelsets. Instead of having the heartbeat label present only for this rpcType, can we use two different values - e.g. AppendEntries and AppendEntriesHeartbeat?

First, the proposed change would be easy enough to make and I can do that.

For my own knowledge though could you elaborate a bit on what prometheus doesn't like about that and how it impacts using the metrics? I have been using these with prometheus + grafana and haven't run into any strange behavior so far.

It's been a while (years) since I ran into issues with this kind of thing, so I'm a little hazy on the details. Definitely no problem ingesting the data, but some kinds of queries (aggregates maybe?) are more awkward when you have inconsistent labelsets. Not a huge deal, just not a great pattern IMO.

Actually I think it was joins - you have to jump through some extra hoops to ensure the labelsets are consistent or your joins will drop data. But for sums too it's a bit of a pain, like if a user sees one of the single-label metrics as an example, they might do sum by (rpcType) and not realize that there's a meaningful distinction between AppendEntries as a heartbeat and AppendEntries with actual logs, because they got lumped in together.

Not that you can really sum quantiles, but the same would apply to max.

That makes sense. I do know that when crafting some queries utilizing max I did have to add a {heartbeat="false"} in the query to get it to ignore those append entries calls. Having a separate type for the heartbeats all together though seems easier to manage.

…n increased buffer size

Adding some finer grained metrics around RPC processing

2ccd963

Also increasing the buffer size we read network data into to 256KB

banks approved these changes Dec 3, 2021

View reviewed changes

ncabatoff reviewed Dec 6, 2021

View reviewed changes

Ensure the send side of the conn from leader -> followers also uses a…

d495cc7

…n increased buffer size

mkeeler merged commit aa1afe5 into main Dec 7, 2021

mkeeler deleted the experiment-metrics branch December 7, 2021 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding some finer grained metrics around RPC processing #480

Adding some finer grained metrics around RPC processing #480

mkeeler commented Dec 2, 2021

banks left a comment

ncabatoff Dec 6, 2021

mkeeler Dec 7, 2021

ncabatoff Dec 7, 2021

ncabatoff Dec 7, 2021

ncabatoff Dec 7, 2021

mkeeler Dec 7, 2021

Adding some finer grained metrics around RPC processing #480

Adding some finer grained metrics around RPC processing #480

Conversation

mkeeler commented Dec 2, 2021

banks left a comment

Choose a reason for hiding this comment

ncabatoff Dec 6, 2021

Choose a reason for hiding this comment

mkeeler Dec 7, 2021

Choose a reason for hiding this comment

ncabatoff Dec 7, 2021

Choose a reason for hiding this comment

ncabatoff Dec 7, 2021

Choose a reason for hiding this comment

ncabatoff Dec 7, 2021

Choose a reason for hiding this comment

mkeeler Dec 7, 2021

Choose a reason for hiding this comment