Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: hardware metrics #876

Merged
merged 19 commits into from
Oct 11, 2024
Merged

FEAT: hardware metrics #876

merged 19 commits into from
Oct 11, 2024

Conversation

kmaus-near
Copy link
Collaborator

Still need to actually test if this does anything more than compile, but here are some eyes on what I came up with.

@kmaus-near kmaus-near self-assigned this Oct 7, 2024
// Update CPU usage metric
let cpu_usage = system.global_cpu_usage() as i64;
crate::metrics::CPU_USAGE_PERCENTAGE
.with_label_values(&["global"])
Copy link
Contributor

@ppca ppca Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better also add account_id to the label values, like .with_label_values(&["global", my_account_id.as_str()]).
You'd likely need to pass in the account_id in the update_system_metrics() to be able to access it. The account_id value is available here:

let my_account_id = self.ctx.account_id.to_string();

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed now I think

});

// Total Memory Metric
pub(crate) static TOTAL_MEMORY_BYTES: Lazy<IntGaugeVec> = Lazy::new(|| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either this one or AVAILABLE_MEMORY_BYTES is redundant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we want a total to ensure partners have enough memory? I figured available would be useful to alert off of as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saying that available + used = total

@@ -227,6 +228,9 @@ impl MpcSignProtocol {
loop {
let protocol_time = Instant::now();
tracing::trace!("trying to advance chain signatures protocol");
// Hardware metric refresh
update_system_metrics(&my_account_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often is this executed in practice?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ppca might know better than I 😂

Copy link
Contributor

@ppca ppca Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as often as a loop. Might be too often in this case. I think we can do once every minute?
add
let mut last_hardware_pull = Instant::now(); at around here:

let mut last_pinged = Instant::now();

and

if last_hardware_pull.elapsed() > Duration::from_secs(60) {
    update_system_metrics(&my_account_id);;
}

at around

tracing::trace!("trying to advance chain signatures protocol");

This will update the metrics every minute. @volovyks @kmaus-near

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do that every ~10 seconds? I'm afraid we can lose CPU spikes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, let's change to if last_hardware_pull.elapsed() > Duration::from_secs(10) then

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the example the crate uses is 5s, which would be pretty common practice for hardware metrics, and is the lowest refresh rate in grafana.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5s sounds good to me

});

// Disk Space Metric
pub(crate) static DISK_SPACE_BYTES: Lazy<IntGaugeVec> = Lazy::new(|| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also have available and used for the DISK_SPACE?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is Available disk space, mostly for alerting purposes. We can also add a total disk space just so there's visibility into it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's specify in the name if it is used or remaining.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, I will do a calculation in grafana for used disk space since only total and available is part of that library.

Copy link
Collaborator

@volovyks volovyks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any blockers here?

});

// Total Memory Metric
pub(crate) static TOTAL_MEMORY_BYTES: Lazy<IntGaugeVec> = Lazy::new(|| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saying that available + used = total

@kmaus-near
Copy link
Collaborator Author

Any blockers here?

I think we are alright. I was slightly concerned with CPU metrics being inaccurate, but it seems they are a pretty decent representation of actual host metrics after watching the grafana metrics vs the host metrics. Let me convert this to a PR and remove the Total memory metric, if we need it we can calculate it from the other two.

@kmaus-near kmaus-near marked this pull request as ready for review October 10, 2024 16:50
@ppca
Copy link
Contributor

ppca commented Oct 10, 2024

Please rebase latest develop and fix the test failures.

if last_hardware_pull.elapsed() > Duration::from_secs(5) {
update_system_metrics(&my_account_id);
}

tracing::debug!("trying to advance chain signatures protocol");
Copy link
Contributor

@ppca ppca Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe delete this one then? there are 2 tracing::debug!("trying to advance chain signatures protocol");

ppca
ppca previously approved these changes Oct 10, 2024
Copy link
Contributor

@ppca ppca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except that one repeated tracing :D

@volovyks volovyks merged commit 9b495c5 into develop Oct 11, 2024
3 checks passed
@volovyks volovyks deleted the kmaus-near/hardware-metrics branch October 11, 2024 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants