-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add API monitoring latency (info, validator) #370
Conversation
|
||
func NewAppRequestNetworkMetrics(cfg *config.Config, registerer prometheus.Registerer) (*AppRequestNetworkMetrics, error) { | ||
infoAPICallLatencyMS := prometheus.NewGaugeVec( | ||
prometheus.GaugeOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall it be an histogram ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea
i've made the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together! At a high level, this all looks good. I have a handful of minor comments and asks.
peers/app_request_network.go
Outdated
// | ||
|
||
func (n *AppRequestNetwork) setInfoAPICallLatencyMS(latency float64) { | ||
n.metrics.infoAPICallLatencyMS.WithLabelValues(n.metrics.infoAPIBaseURL).Observe(latency) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the deployer has access to the base URLs for the info an P-Chain APIs, and that the config only contains a single value for each, let's use a static label value to keep the number of unique labels constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to remove the base URL's from the metrics struct as well.
@@ -179,12 +189,15 @@ func (n *AppRequestNetwork) ConnectPeers(nodeIDs set.Set[ids.NodeID]) set.Set[id | |||
|
|||
// If the Info API node is in nodeIDs, it will not be reflected in the call to info.Peers. | |||
// In this case, we need to manually track the API node. | |||
startInfoAPICall = time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should measure this API call latency regardless of the error status.
Help: "Latency of calling info api in milliseconds", | ||
Buckets: prometheus.LinearBuckets(10, 10, 10), | ||
}, | ||
[]string{"info_api_base_url"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should update these labels when changing to static label values.
prometheus.HistogramOpts{ | ||
Name: "info_api_call_latency_ms", | ||
Help: "Latency of calling info api in milliseconds", | ||
Buckets: prometheus.LinearBuckets(10, 10, 10), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than linearly spaced buckets, exponentially distributed buckets would be more useful for identifying latency spikes. I think a good distribution to use would be to measure between ~0.1s and ~10s latency (a reasonable timeout), with bucket doubling the previous bucket's range. This can be done with ExponentialBucketsRange.
peers/app_request_network.go
Outdated
n.setInfoAPICallLatencyMS(float64(time.Since(startInfoAPICall).Milliseconds())) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This control flow is a bit hard to follow. I'd recomment moving the call to infoAPI.GetNodeID
above the if
statement to clean it up:
startInfoAPICall = time.Now()
apiNodeID, _, err := n.infoAPI.GetNodeID(context.Background())
n.setInfoAPICallLatencyMS(float64(time.Since(startInfoAPICall).Milliseconds()))
if err != nil {
...
} else if nodeIDs.Contains(apiNodeID) {
...
}
Similarly for the below call to infoAPI.GetNodeIP
, let's move it out of the if
statement so that we can put the latency measurement right next to the call:
startInfoAPICall = time.Now()
apiNodeIP, err := n.infoAPI.GetNodeIP(context.Background())
n.setInfoAPICallLatencyMS(float64(time.Since(startInfoAPICall).Milliseconds()))
if err != nil {
...
}
@@ -55,7 +55,7 @@ func NewNetwork( | |||
), | |||
) | |||
|
|||
metrics, err := NewAppRequestNetworkMetrics(cfg, registerer) | |||
metrics, err := newAppRequestNetworkMetrics(registerer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't leave a comment at the exact line, but let's also measure the latency of the call to infoAPI.GetNetworkID
in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this comment is not resolved, but on second thought I think it's fine to skip measuring this call since it's only called once at startup.
peers/app_request_network.go
Outdated
@@ -79,6 +88,7 @@ func NewNetwork( | |||
) | |||
return nil, err | |||
} | |||
metrics.setInfoAPICallLatencyMS(float64(time.Since(startInfoAPICall).Milliseconds())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's measure the latency even for the error case so that we can detect timeouts.
peers/app_request_network.go
Outdated
@@ -169,12 +182,18 @@ func (n *AppRequestNetwork) ConnectPeers(nodeIDs set.Set[ids.NodeID]) set.Set[id | |||
|
|||
// If the Info API node is in nodeIDs, it will not be reflected in the call to info.Peers. | |||
// In this case, we need to manually track the API node. | |||
startInfoAPICall = time.Now() | |||
if apiNodeID, _, err := n.infoAPI.GetNodeID(context.Background()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the previous changes you made to the control flow may have been lost in the rebase. To improve readability, let's tightly associate the metric measurement with the remote call. The code should look like this:
// If the Info API node is in nodeIDs, it will not be reflected in the call to info.Peers.
// In this case, we need to manually track the API node.
startInfoAPICall = time.Now()
apiNodeID, _, err := n.infoAPI.GetNodeID(context.Background())
n.metrics.setInfoAPICallLatencyMS(float64(time.Since(startInfoAPICall).Milliseconds()))
if err != nil {
n.logger.Error(
"Failed to get API Node ID",
zap.Error(err),
)
} else if nodeIDs.Contains(apiNodeID) {
startInfoAPICall = time.Now()
apiNodeIPPort, err := n.infoAPI.GetNodeIP(context.Background())
n.metrics.setInfoAPICallLatencyMS(float64(time.Since(startInfoAPICall).Milliseconds()))
if err != nil {
n.logger.Error(
"Failed to get API Node IP",
zap.Error(err),
)
} else {
trackedNodes.Add(apiNodeID)
n.Network.ManuallyTrack(apiNodeID, apiNodeIPPort)
}
}
Hey @won-js do you mind solving the merge conflicts in this PR and address the last comment @cam-schultz made? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
@won-js this looks good to go, just need to bring up to date with |
Why this should be merged
Resolve #187
How this works
Added latency measurement logic for the info and validator APIs.
I created a new Metrics type
I have also made the necessary modifications.
How this was tested
How is this documented