Define critical metrics of the agent and expose them. #293

xuezhaojun · 2023-10-10T08:04:51Z

No description provided.

DerekHeldtWerle · 2023-11-15T18:08:14Z

Hello, we have started to look into OCM and came up with a few metrics that we think could potentially be valuable in addition to the existing ones that are in place for the Placement CR. Some of the below might not be feasible in the current state of the code, but can be used potentially as a starting point for adding

Current Metrics:

schedulingDuration: Time to schedule a placement.
bindDuration: Time to bind a placement to placementDecisions.
PluginDuration: Duration of plugin execution for a placement.

Proposed Additional Metrics:

ManifestWork Object Count:
- Metric Name: manifestWorkObjectCount
- Description: Counts the total number of ManifestWork objects. This metric can help in understanding the workload and distribution of resources managed by the hub.
- Type: Gauge
Active Managed Clusters Count:
- Metric Name: activeManagedClustersCount
- Description: Tracks the number of active clusters managed by the hub. This is crucial for assessing the scale at which the hub is operating.
- Type: Gauge
ManagedClusterSets Count:
- Metric Name: managedClustersCount, managedClusterSetsCount
- Description: Provides counts for ManagedClusterSets, offering insights into the cluster grouping and management efficiency.
- Type: Gauge
Cluster Health Status:
- Metric Name: clusterHealthStatus
- Description: Reports on the health status of each managed cluster. This can be instrumental in proactive monitoring and maintenance.
- Type: Gauge (with labels for different health statuses)
Resource Utilization Metrics:
- Metric Name: clusterResourceUtilization
- Description: Measures resource utilization (CPU, Memory, etc.) across managed clusters. This aids in resource planning and optimization.
- Type: Histogram/Summary
API Request Latency:
- Metric Name: apiRequestLatency
- Description: Captures the latency of various API requests within OCM. This can help in identifying and resolving bottlenecks.
- Type: Histogram
Cluster Sync Latency:
- Metric Name: clusterSyncLatency
- Description: Monitors the time taken for synchronization tasks between the hub and managed clusters.
- Type: Histogram
Cluster Update Frequency:
- Metric Name: clusterUpdateFrequency
- Description: Tracks how frequently each cluster's configuration or state is updated.
- Type: Counter
Error Counts:
- Metric Name: errorCounts
- Description: Records the number of errors encountered across various OCM operations, aiding in reliability analysis.
- Type: Counter

berenss · 2023-11-16T19:42:54Z

cc @bjoydeep this looks very interesting

bjoydeep · 2023-11-17T20:15:15Z

Great points @DerekHeldtWerle . Yes, it absolutely makes sense to add more metrics. BTW, in RedHat's productized version of ACM, we do add a few metrics: https://github.com/stolostron/metrics-chronicle/blob/main/docs/acm/component/server-foundation/metrics.md. You will see some overlaps with what you suggested above.
6,7,8 would be very interesting. It informs engineers of OCM as well consumers of OCM about key SLIs which can be used to form SLO. Hugely +1 that. We can think through how to measure these meaningfully from Kube controllers.

I am personally a little ambivalent to 5. clusterResourceUtilization is ideally gathered by Prometheus and takes that route. Would be eager to hear your take on it.

However, there is different requirement IMHO which can be great practical help to engineers maintaining the system. If we can collect key life cycle changes in a cluster and publish them as events that can be consumed by non-Kube systems, that may be very helpful. If for example a node has been added or removed from a cluster - being informed about that explicitly can be immense help when debugging problems. Just seeding this idea to see if it spawns some thoughts! Key here is that it should be small selected set.

github-actions · 2024-03-17T01:44:52Z

This issue is stale because it has been open for 120 days with no activity. After 14 days of inactivity, it will be closed. Remove the stable label to prevent this issue from being closed.

xuezhaojun added this to OCM releases Oct 10, 2023

xuezhaojun converted this from a draft issue Oct 10, 2023

github-actions bot added the Stale label Mar 17, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 31, 2024

github-project-automation bot moved this from To do to Done in OCM releases Mar 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define critical metrics of the agent and expose them. #293

Define critical metrics of the agent and expose them. #293

xuezhaojun commented Oct 10, 2023

DerekHeldtWerle commented Nov 15, 2023

berenss commented Nov 16, 2023

bjoydeep commented Nov 17, 2023

github-actions bot commented Mar 17, 2024

Define critical metrics of the agent and expose them. #293

Define critical metrics of the agent and expose them. #293

Comments

xuezhaojun commented Oct 10, 2023

DerekHeldtWerle commented Nov 15, 2023

berenss commented Nov 16, 2023

bjoydeep commented Nov 17, 2023

github-actions bot commented Mar 17, 2024