Add API monitoring latency (info, validator) #370

won-js · 2024-07-17T06:31:45Z

Why this should be merged

Resolve #187

How this works

Added latency measurement logic for the info and validator APIs.
I created a new Metrics type
I have also made the necessary modifications.

registerer should change to prometheus.DefaultRegisterer here
we should pass in registerer to the constructor here

How this was tested

How is this documented

najeal · 2024-07-17T15:28:52Z

peers/app_request_netrwork_metrics.go

+
+func NewAppRequestNetworkMetrics(cfg *config.Config, registerer prometheus.Registerer) (*AppRequestNetworkMetrics, error) {
+	infoAPICallLatencyMS := prometheus.NewGaugeVec(
+		prometheus.GaugeOpts{


Shall it be an histogram ?

That's a good idea
i've made the changes.

cam-schultz

Thanks for putting this together! At a high level, this all looks good. I have a handful of minor comments and asks.

cam-schultz · 2024-07-18T16:21:36Z

peers/app_request_network.go

+//
+
+func (n *AppRequestNetwork) setInfoAPICallLatencyMS(latency float64) {
+	n.metrics.infoAPICallLatencyMS.WithLabelValues(n.metrics.infoAPIBaseURL).Observe(latency)


Given that the deployer has access to the base URLs for the info an P-Chain APIs, and that the config only contains a single value for each, let's use a static label value to keep the number of unique labels constant.

We should be able to remove the base URL's from the metrics struct as well.

cam-schultz · 2024-07-18T16:24:40Z

peers/app_request_network.go

@@ -179,12 +189,15 @@ func (n *AppRequestNetwork) ConnectPeers(nodeIDs set.Set[ids.NodeID]) set.Set[id

 	// If the Info API node is in nodeIDs, it will not be reflected in the call to info.Peers.
 	// In this case, we need to manually track the API node.
+	startInfoAPICall = time.Now()


We should measure this API call latency regardless of the error status.

cam-schultz · 2024-07-18T16:26:59Z

peers/app_request_netrwork_metrics.go

+			Help:    "Latency of calling info api in milliseconds",
+			Buckets: prometheus.LinearBuckets(10, 10, 10),
+		},
+		[]string{"info_api_base_url"},


We should update these labels when changing to static label values.

cam-schultz · 2024-07-18T18:11:17Z

peers/app_request_netrwork_metrics.go

+		prometheus.HistogramOpts{
+			Name:    "info_api_call_latency_ms",
+			Help:    "Latency of calling info api in milliseconds",
+			Buckets: prometheus.LinearBuckets(10, 10, 10),


Rather than linearly spaced buckets, exponentially distributed buckets would be more useful for identifying latency spikes. I think a good distribution to use would be to measure between ~0.1s and ~10s latency (a reasonable timeout), with bucket doubling the previous bucket's range. This can be done with ExponentialBucketsRange.

cam-schultz · 2024-07-19T15:12:42Z

peers/app_request_network.go

+		n.setInfoAPICallLatencyMS(float64(time.Since(startInfoAPICall).Milliseconds()))
+


This control flow is a bit hard to follow. I'd recomment moving the call to infoAPI.GetNodeID above the if statement to clean it up:

startInfoAPICall = time.Now() apiNodeID, _, err := n.infoAPI.GetNodeID(context.Background()) n.setInfoAPICallLatencyMS(float64(time.Since(startInfoAPICall).Milliseconds())) if err != nil { ... } else if nodeIDs.Contains(apiNodeID) { ... }

Similarly for the below call to infoAPI.GetNodeIP, let's move it out of the if statement so that we can put the latency measurement right next to the call:

startInfoAPICall = time.Now() apiNodeIP, err := n.infoAPI.GetNodeIP(context.Background()) n.setInfoAPICallLatencyMS(float64(time.Since(startInfoAPICall).Milliseconds())) if err != nil { ... }

cam-schultz · 2024-07-19T15:14:49Z

peers/app_request_network.go

@@ -55,7 +55,7 @@ func NewNetwork(
 		),
 	)

-	metrics, err := NewAppRequestNetworkMetrics(cfg, registerer)
+	metrics, err := newAppRequestNetworkMetrics(registerer)


I can't leave a comment at the exact line, but let's also measure the latency of the call to infoAPI.GetNetworkID in the constructor.

Looks like this comment is not resolved, but on second thought I think it's fine to skip measuring this call since it's only called once at startup.

peers/app_request_netrwork_metrics.go

peers/app_request_network.go

cam-schultz · 2024-07-22T14:17:27Z

peers/app_request_network.go

@@ -79,6 +88,7 @@ func NewNetwork(
 		)
 		return nil, err
 	}
+	metrics.setInfoAPICallLatencyMS(float64(time.Since(startInfoAPICall).Milliseconds()))


Let's measure the latency even for the error case so that we can detect timeouts.

cam-schultz · 2024-07-22T14:25:18Z

peers/app_request_network.go

@@ -169,12 +182,18 @@ func (n *AppRequestNetwork) ConnectPeers(nodeIDs set.Set[ids.NodeID]) set.Set[id

 	// If the Info API node is in nodeIDs, it will not be reflected in the call to info.Peers.
 	// In this case, we need to manually track the API node.
+	startInfoAPICall = time.Now()
 	if apiNodeID, _, err := n.infoAPI.GetNodeID(context.Background()); err != nil {


I think the previous changes you made to the control flow may have been lost in the rebase. To improve readability, let's tightly associate the metric measurement with the remote call. The code should look like this:

// If the Info API node is in nodeIDs, it will not be reflected in the call to info.Peers. // In this case, we need to manually track the API node. startInfoAPICall = time.Now() apiNodeID, _, err := n.infoAPI.GetNodeID(context.Background()) n.metrics.setInfoAPICallLatencyMS(float64(time.Since(startInfoAPICall).Milliseconds())) if err != nil { n.logger.Error( "Failed to get API Node ID", zap.Error(err), ) } else if nodeIDs.Contains(apiNodeID) { startInfoAPICall = time.Now() apiNodeIPPort, err := n.infoAPI.GetNodeIP(context.Background()) n.metrics.setInfoAPICallLatencyMS(float64(time.Since(startInfoAPICall).Milliseconds())) if err != nil { n.logger.Error( "Failed to get API Node IP", zap.Error(err), ) } else { trackedNodes.Add(apiNodeID) n.Network.ManuallyTrack(apiNodeID, apiNodeIPPort) } }

geoff-vball · 2024-08-19T12:32:41Z

Hey @won-js do you mind solving the merge conflicts in this PR and address the last comment @cam-schultz made?

…call

geoff-vball

LGTM 👍

cam-schultz · 2024-08-21T18:35:41Z

@won-js this looks good to go, just need to bring up to date with main.

won-js requested a review from a team as a code owner July 17, 2024 06:31

won-js requested review from iansuvak, feuGeneA, geoff-vball, minghinmatthewlam, bernard-avalabs, michaelkaplan13 and cam-schultz July 17, 2024 06:31

najeal reviewed Jul 17, 2024

View reviewed changes

won-js requested a review from najeal July 18, 2024 05:54

cam-schultz reviewed Jul 18, 2024

View reviewed changes

won-js requested a review from cam-schultz July 19, 2024 04:30

cam-schultz reviewed Jul 19, 2024

View reviewed changes

peers/app_request_netrwork_metrics.go Show resolved Hide resolved

peers/app_request_netrwork_metrics.go Outdated Show resolved Hide resolved

Joonsoo Won added 3 commits July 20, 2024 15:52

Add API monitoring latency (info, validator)

8d726cd

change gauge to histogram

147ee40

modify to static label && change bucket type

55f3a9b

won-js force-pushed the main branch from a366c57 to 3920370 Compare July 20, 2024 07:00

won-js requested a review from cam-schultz July 20, 2024 07:08

geoff-vball reviewed Jul 22, 2024

View reviewed changes

peers/app_request_network.go Outdated Show resolved Hide resolved

cam-schultz reviewed Jul 22, 2024

View reviewed changes

won-js force-pushed the main branch from 3920370 to 55f3a9b Compare August 20, 2024 14:01

Joonsoo Won added 2 commits August 20, 2024 23:12

fix: lint & tightly associate the metric measurement with the remote …

7233807

…call

resolved conflicts with upstream

9d694d9

won-js requested review from geoff-vball and cam-schultz August 20, 2024 14:26

geoff-vball approved these changes Aug 20, 2024

View reviewed changes

cam-schultz approved these changes Aug 21, 2024

View reviewed changes

Merge branch 'main' into main

e83fad4

cam-schultz merged commit 820041e into ava-labs:main Aug 22, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add API monitoring latency (info, validator) #370

Add API monitoring latency (info, validator) #370

won-js commented Jul 17, 2024

najeal Jul 17, 2024

won-js Jul 18, 2024

cam-schultz left a comment

cam-schultz Jul 18, 2024

cam-schultz Jul 18, 2024

cam-schultz Jul 18, 2024

cam-schultz Jul 18, 2024

cam-schultz Jul 18, 2024

cam-schultz Jul 19, 2024

cam-schultz Jul 19, 2024

cam-schultz Aug 21, 2024

cam-schultz Jul 22, 2024

cam-schultz Jul 22, 2024

geoff-vball commented Aug 19, 2024

geoff-vball left a comment

cam-schultz commented Aug 21, 2024 •

edited

Loading

		n.setInfoAPICallLatencyMS(float64(time.Since(startInfoAPICall).Milliseconds()))

Add API monitoring latency (info, validator) #370

Add API monitoring latency (info, validator) #370

Conversation

won-js commented Jul 17, 2024

Why this should be merged

How this works

How this was tested

How is this documented

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cam-schultz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geoff-vball commented Aug 19, 2024

geoff-vball left a comment

Choose a reason for hiding this comment

cam-schultz commented Aug 21, 2024 • edited Loading

cam-schultz commented Aug 21, 2024 •

edited

Loading