Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define critical metrics of the agent and expose them. #293

Closed
xuezhaojun opened this issue Oct 10, 2023 · 4 comments
Closed

Define critical metrics of the agent and expose them. #293

xuezhaojun opened this issue Oct 10, 2023 · 4 comments
Labels

Comments

@xuezhaojun
Copy link
Member

No description provided.

@xuezhaojun xuezhaojun converted this from a draft issue Oct 10, 2023
@DerekHeldtWerle
Copy link

Hello, we have started to look into OCM and came up with a few metrics that we think could potentially be valuable in addition to the existing ones that are in place for the Placement CR. Some of the below might not be feasible in the current state of the code, but can be used potentially as a starting point for adding

Current Metrics:

  1. schedulingDuration: Time to schedule a placement.
  2. bindDuration: Time to bind a placement to placementDecisions.
  3. PluginDuration: Duration of plugin execution for a placement.

Proposed Additional Metrics:

  1. ManifestWork Object Count:

    • Metric Name: manifestWorkObjectCount
    • Description: Counts the total number of ManifestWork objects. This metric can help in understanding the workload and distribution of resources managed by the hub.
    • Type: Gauge
  2. Active Managed Clusters Count:

    • Metric Name: activeManagedClustersCount
    • Description: Tracks the number of active clusters managed by the hub. This is crucial for assessing the scale at which the hub is operating.
    • Type: Gauge
  3. ManagedClusterSets Count:

    • Metric Name: managedClustersCount, managedClusterSetsCount
    • Description: Provides counts for ManagedClusterSets, offering insights into the cluster grouping and management efficiency.
    • Type: Gauge
  4. Cluster Health Status:

    • Metric Name: clusterHealthStatus
    • Description: Reports on the health status of each managed cluster. This can be instrumental in proactive monitoring and maintenance.
    • Type: Gauge (with labels for different health statuses)
  5. Resource Utilization Metrics:

    • Metric Name: clusterResourceUtilization
    • Description: Measures resource utilization (CPU, Memory, etc.) across managed clusters. This aids in resource planning and optimization.
    • Type: Histogram/Summary
  6. API Request Latency:

    • Metric Name: apiRequestLatency
    • Description: Captures the latency of various API requests within OCM. This can help in identifying and resolving bottlenecks.
    • Type: Histogram
  7. Cluster Sync Latency:

    • Metric Name: clusterSyncLatency
    • Description: Monitors the time taken for synchronization tasks between the hub and managed clusters.
    • Type: Histogram
  8. Cluster Update Frequency:

    • Metric Name: clusterUpdateFrequency
    • Description: Tracks how frequently each cluster's configuration or state is updated.
    • Type: Counter
  9. Error Counts:

    • Metric Name: errorCounts
    • Description: Records the number of errors encountered across various OCM operations, aiding in reliability analysis.
    • Type: Counter

@berenss
Copy link

berenss commented Nov 16, 2023

cc @bjoydeep this looks very interesting

@bjoydeep
Copy link

Great points @DerekHeldtWerle . Yes, it absolutely makes sense to add more metrics. BTW, in RedHat's productized version of ACM, we do add a few metrics: https://github.com/stolostron/metrics-chronicle/blob/main/docs/acm/component/server-foundation/metrics.md. You will see some overlaps with what you suggested above.
6,7,8 would be very interesting. It informs engineers of OCM as well consumers of OCM about key SLIs which can be used to form SLO. Hugely +1 that. We can think through how to measure these meaningfully from Kube controllers.

I am personally a little ambivalent to 5. clusterResourceUtilization is ideally gathered by Prometheus and takes that route. Would be eager to hear your take on it.

However, there is different requirement IMHO which can be great practical help to engineers maintaining the system. If we can collect key life cycle changes in a cluster and publish them as events that can be consumed by non-Kube systems, that may be very helpful. If for example a node has been added or removed from a cluster - being informed about that explicitly can be immense help when debugging problems. Just seeding this idea to see if it spawns some thoughts! Key here is that it should be small selected set.

Copy link

This issue is stale because it has been open for 120 days with no activity. After 14 days of inactivity, it will be closed. Remove the stable label to prevent this issue from being closed.

@github-actions github-actions bot added the Stale label Mar 17, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 31, 2024
@github-project-automation github-project-automation bot moved this from To do to Done in OCM releases Mar 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

4 participants