Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API monitoring latency (info, validator) #370

Merged
merged 6 commits into from
Aug 22, 2024
Merged

Conversation

won-js
Copy link
Contributor

@won-js won-js commented Jul 17, 2024

Why this should be merged

Resolve #187

How this works

Added latency measurement logic for the info and validator APIs.
I created a new Metrics type
I have also made the necessary modifications.

  • registerer should change to prometheus.DefaultRegisterer here
  • we should pass in registerer to the constructor here

How this was tested

How is this documented


func NewAppRequestNetworkMetrics(cfg *config.Config, registerer prometheus.Registerer) (*AppRequestNetworkMetrics, error) {
infoAPICallLatencyMS := prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall it be an histogram ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea
i've made the changes.

@won-js won-js requested a review from najeal July 18, 2024 05:54
Copy link
Collaborator

@cam-schultz cam-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together! At a high level, this all looks good. I have a handful of minor comments and asks.

//

func (n *AppRequestNetwork) setInfoAPICallLatencyMS(latency float64) {
n.metrics.infoAPICallLatencyMS.WithLabelValues(n.metrics.infoAPIBaseURL).Observe(latency)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the deployer has access to the base URLs for the info an P-Chain APIs, and that the config only contains a single value for each, let's use a static label value to keep the number of unique labels constant.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to remove the base URL's from the metrics struct as well.

@@ -179,12 +189,15 @@ func (n *AppRequestNetwork) ConnectPeers(nodeIDs set.Set[ids.NodeID]) set.Set[id

// If the Info API node is in nodeIDs, it will not be reflected in the call to info.Peers.
// In this case, we need to manually track the API node.
startInfoAPICall = time.Now()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should measure this API call latency regardless of the error status.

Help: "Latency of calling info api in milliseconds",
Buckets: prometheus.LinearBuckets(10, 10, 10),
},
[]string{"info_api_base_url"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update these labels when changing to static label values.

prometheus.HistogramOpts{
Name: "info_api_call_latency_ms",
Help: "Latency of calling info api in milliseconds",
Buckets: prometheus.LinearBuckets(10, 10, 10),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than linearly spaced buckets, exponentially distributed buckets would be more useful for identifying latency spikes. I think a good distribution to use would be to measure between ~0.1s and ~10s latency (a reasonable timeout), with bucket doubling the previous bucket's range. This can be done with ExponentialBucketsRange.

@won-js won-js requested a review from cam-schultz July 19, 2024 04:30
Comment on lines 201 to 192
n.setInfoAPICallLatencyMS(float64(time.Since(startInfoAPICall).Milliseconds()))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This control flow is a bit hard to follow. I'd recomment moving the call to infoAPI.GetNodeID above the if statement to clean it up:

startInfoAPICall = time.Now()
apiNodeID, _, err := n.infoAPI.GetNodeID(context.Background())
n.setInfoAPICallLatencyMS(float64(time.Since(startInfoAPICall).Milliseconds()))
if err != nil {
  ...
} else if nodeIDs.Contains(apiNodeID) {
  ...
} 

Similarly for the below call to infoAPI.GetNodeIP, let's move it out of the if statement so that we can put the latency measurement right next to the call:

startInfoAPICall = time.Now()
apiNodeIP, err := n.infoAPI.GetNodeIP(context.Background())
n.setInfoAPICallLatencyMS(float64(time.Since(startInfoAPICall).Milliseconds()))
if err != nil {
 ...
}

@@ -55,7 +55,7 @@ func NewNetwork(
),
)

metrics, err := NewAppRequestNetworkMetrics(cfg, registerer)
metrics, err := newAppRequestNetworkMetrics(registerer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't leave a comment at the exact line, but let's also measure the latency of the call to infoAPI.GetNetworkID in the constructor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this comment is not resolved, but on second thought I think it's fine to skip measuring this call since it's only called once at startup.

@@ -79,6 +88,7 @@ func NewNetwork(
)
return nil, err
}
metrics.setInfoAPICallLatencyMS(float64(time.Since(startInfoAPICall).Milliseconds()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's measure the latency even for the error case so that we can detect timeouts.

@@ -169,12 +182,18 @@ func (n *AppRequestNetwork) ConnectPeers(nodeIDs set.Set[ids.NodeID]) set.Set[id

// If the Info API node is in nodeIDs, it will not be reflected in the call to info.Peers.
// In this case, we need to manually track the API node.
startInfoAPICall = time.Now()
if apiNodeID, _, err := n.infoAPI.GetNodeID(context.Background()); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous changes you made to the control flow may have been lost in the rebase. To improve readability, let's tightly associate the metric measurement with the remote call. The code should look like this:

	// If the Info API node is in nodeIDs, it will not be reflected in the call to info.Peers.
	// In this case, we need to manually track the API node.
	startInfoAPICall = time.Now()
	apiNodeID, _, err := n.infoAPI.GetNodeID(context.Background())
	n.metrics.setInfoAPICallLatencyMS(float64(time.Since(startInfoAPICall).Milliseconds()))
	if err != nil {
		n.logger.Error(
			"Failed to get API Node ID",
			zap.Error(err),
		)
	} else if nodeIDs.Contains(apiNodeID) {
		startInfoAPICall = time.Now()
		apiNodeIPPort, err := n.infoAPI.GetNodeIP(context.Background())
		n.metrics.setInfoAPICallLatencyMS(float64(time.Since(startInfoAPICall).Milliseconds()))
		if err != nil {
			n.logger.Error(
				"Failed to get API Node IP",
				zap.Error(err),
			)
		} else {
			trackedNodes.Add(apiNodeID)
			n.Network.ManuallyTrack(apiNodeID, apiNodeIPPort)
		}
	}

@geoff-vball
Copy link
Contributor

Hey @won-js do you mind solving the merge conflicts in this PR and address the last comment @cam-schultz made?

Copy link
Contributor

@geoff-vball geoff-vball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@cam-schultz
Copy link
Collaborator

cam-schultz commented Aug 21, 2024

@won-js this looks good to go, just need to bring up to date with main.

@cam-schultz cam-schultz merged commit 820041e into ava-labs:main Aug 22, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Add API monitoring latency
4 participants