Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding some finer grained metrics around RPC processing #480

Merged
merged 2 commits into from
Dec 7, 2021

Conversation

mkeeler
Copy link
Member

@mkeeler mkeeler commented Dec 2, 2021

Also increasing the buffer size we read network data into to 256KB.

The extra metrics can help to see if certain parts of the system are experiencing slowdowns. The increased buffer size is a minor performance optimization so that reading from the TCP conn is more efficient.

Also increasing the buffer size we read network data into to 256KB
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

For visibility/historic record this change has been well tested internally and by some users we've worked with on a specific issue so we've seen these metrics in real environments and are happy that they have helped us diagnose specific performance symptoms that were otherwise hard.

For example slow disks can be observed with the existing storeLogs timing metric, however slow disks can then cause backpressure on this appendEntries handler since we enqueue the request here and then wait for a response before we loop and read the next message off the connection.

If disk gets slow enough compared with write rate, that means that the TCP buffers can fill up and start to cause TCP backpressure. The end result is that leader metrics show RPCs taking multiple seconds, while the existing appendEntriesRPC metric on the "slow" follower is till showing "only" the few milliseconds that the slow disk write takes because it's not including any of the queing time spent waiting here. This can falsely lead to concluding the network is at fault for delays.

These metrics help expose that so network problems can be ruled out and the queuing delay between rpcEnqueue and rpcRespond can be captured. It helps rule out bad network connections causing the follower to be stuck reading the first byte from the network, as well as buffering/decoding delays as possible causes for slow leader-observed updates too.

net_transport.go Outdated
Name: "rpcType",
Value: "AppendEntries",
},
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus doesn't like to see the same metric with different labelsets. Instead of having the heartbeat label present only for this rpcType, can we use two different values - e.g. AppendEntries and AppendEntriesHeartbeat?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, the proposed change would be easy enough to make and I can do that.

For my own knowledge though could you elaborate a bit on what prometheus doesn't like about that and how it impacts using the metrics? I have been using these with prometheus + grafana and haven't run into any strange behavior so far.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a while (years) since I ran into issues with this kind of thing, so I'm a little hazy on the details. Definitely no problem ingesting the data, but some kinds of queries (aggregates maybe?) are more awkward when you have inconsistent labelsets. Not a huge deal, just not a great pattern IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think it was joins - you have to jump through some extra hoops to ensure the labelsets are consistent or your joins will drop data. But for sums too it's a bit of a pain, like if a user sees one of the single-label metrics as an example, they might do sum by (rpcType) and not realize that there's a meaningful distinction between AppendEntries as a heartbeat and AppendEntries with actual logs, because they got lumped in together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that you can really sum quantiles, but the same would apply to max.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I do know that when crafting some queries utilizing max I did have to add a {heartbeat="false"} in the query to get it to ignore those append entries calls. Having a separate type for the heartbeats all together though seems easier to manage.

@mkeeler mkeeler merged commit aa1afe5 into main Dec 7, 2021
@mkeeler mkeeler deleted the experiment-metrics branch December 7, 2021 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants