chore: add argocd_app_refresh_total counter metric #17178

jsolana · 2024-02-12T10:30:28Z

Introduce a new count metric in the application controller to track the number of refresh requests, named argocd_app_refresh_total.

Labels:

namespace
dest_server,
project
name
compare_with, possible values: "CompareWithLatestForceResolve", "CompareWithLatest", "CompareWithRecent", "ComparisonWithNothing ", "Unknown"

Checklist:

Signed-off-by: Javier Solana Huertas <javier.solana@cabify.com>

blakepettersson

LGTM

pasha-codefresh

LGTM

agaudreault · 2024-02-13T13:55:04Z

controller/appcontroller.go

+	obj, exists, err := ctrl.appInformer.GetIndexer().GetByKey(key)
+	app, ok := obj.(*appv1.Application)
+	if exists && err == nil && ok {
+		ctrl.metricsServer.IncRefresh(app, compareWith.String())


I am wondering what additional value this will add?

The controller has metrics (#17013) on the refreshQueue that will give almost the same information as this metrics.

Additionally, the refresh/reconciliation metric is already provided.

Maybe adding the compare_with to the existing metrics would help?

I am wondering what additional value this will add?
The controller has metrics (#17013) on the refreshQueue that will give almost the same information as this metrics.

Right now, It is not clear how to monitor and diagnose application controller performance issues eg related with "Refresh requested by object updated" (including orphan resources). The dashboard provided maybe is not up-to-date and has a lot of panels.

Eg: I am calculating the average in seconds application reconciliation with the next promQL:

sum(rate(argocd_app_reconcile_sum{app_kubernetes_io_instance=~"$namespace",kubernetes_namespace=~"$namespace", app_kubernetes_io_name=~"$component"}[5m])) by (app_kubernetes_io_instance, app_kubernetes_io_name) / sum(rate(argocd_app_reconcile_count{app_kubernetes_io_instance=~"$namespace",kubernetes_namespace=~"$namespace", app_kubernetes_io_name=~"$component"}[5m]))by (app_kubernetes_io_instance, app_kubernetes_io_name) > 0

The times returned are in order of minutes
I can see high numbers related to applicationset related to workqueue_unfinished_work_seconds and memory consumption in application controller metrics.

I was assuming the root cause is non desirable refresh requested but I have no clear metric to visualize it and diagnose the root of the problem

How is the way to observe the application controller performance using refreshQueue metrics?

Additionally, the refresh/reconciliation metric is already provided.

Refresh requested is not the same than reconciliation, no?

It consumes a lot of memory and require manual intervention to fix unknown status apps
We started detecting a lot of Refresh requested by object updated and thats why I suggested to improve the total refresh requested including the compare_with to identify those from orphan resources.

Based on the code path, this will be almost the same as the reconciliation. Reconciliation will be increased less often than your new metric because a lot of refreshRequest that does not require a reconciliation will be discarded.

Afaik, refreshRequest that does not require a reconciliation (reconciliation here is what happens when you click refresh in the ui, when an app resources has changed, when timeout expires, when orphan resource in the same namespace changes, etc) have a very low impact on performance.

I think this metrics is a duplication of
workqueue_adds_total{name="app_reconciliation_queue"}, with a cardinality for each app * reconciliation_level.

For the reconciliation, there is already argocd_app_reconcile histogram, but it does not have the app * reconciliation_level cardinality.

argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="0.25"} 61 argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="0.5"} 124 argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="1"} 124 argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="2"} 246 argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="4"} 252 argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="8"} 315 argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="16"} 315 argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="+Inf"} 315 argocd_app_reconcile_sum{dest_server="https://kubernetes.default.svc",namespace="argocd"} 537.5461727870005 argocd_app_reconcile_count{dest_server="https://kubernetes.default.svc",namespace="argocd"} 315

I am not sure adding a metric with app * reconciliation_level cardinality is a good idea....

Hi! After apply your fix, honestly I think this PR doesnt make sense and can be closed.

crenshaw-dev

Requesting change to give us some time to consider the increased cardinality from this change

leoluz

This PR needs to be redesigned to avoid high cardinality in the new metric. @agaudreault provided a suggestion to reuse existing metrics to achieve similar results. Please consider.

leoluz · 2024-02-14T22:05:20Z

controller/metrics/metrics.go

@@ -229,6 +240,11 @@ func (m *MetricsServer) IncSync(app *argoappv1.Application, state *argoappv1.Ope
 	m.syncCounter.WithLabelValues(app.Namespace, app.Name, app.Spec.GetProject(), app.Spec.Destination.Server, string(state.Phase)).Inc()
 }

+// IncRefresh increments the refresh counter for an application
+func (m *MetricsServer) IncRefresh(app *argoappv1.Application, compareWithStr string) {
+	m.refreshCounter.WithLabelValues(app.Namespace, app.Name, app.Spec.GetProject(), app.Spec.Destination.Server, compareWithStr).Inc()


We have Argo CD instances handling thousands of apps in hundreds of namespaces. Adding all those labels in the new refreshCounter metric will cause a huge boost in cardinality which could lead to serious issues in Prometheus.

More on the subject: https://blog.fourninecloud.com/understanding-cardinality-in-prometheus-monitoring-96ad082b6398

Hi!
After applying this fix, I think this PR can be closed.

Thank you for confirming.

leoluz · 2024-02-15T15:38:14Z

Closing due to #17178 (comment)

add git argocd_app_refresh_total counter metric

897b608

Signed-off-by: Javier Solana Huertas <javier.solana@cabify.com>

jsolana requested review from a team as code owners February 12, 2024 10:30

fix unit tests

f80a475

Signed-off-by: Javier Solana Huertas <javier.solana@cabify.com>

blakepettersson added the ready-for-review label Feb 12, 2024

blakepettersson approved these changes Feb 12, 2024

View reviewed changes

pasha-codefresh approved these changes Feb 12, 2024

View reviewed changes

Merge branch 'master' into js/improve-refresh-observability

f5d0bff

agaudreault reviewed Feb 13, 2024

View reviewed changes

jsolana changed the title ~~chore: add git argocd_app_refresh_total counter metric~~ chore: add argocd_app_refresh_total counter metric Feb 14, 2024

crenshaw-dev requested changes Feb 14, 2024

View reviewed changes

leoluz requested changes Feb 14, 2024

View reviewed changes

leoluz closed this Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: add argocd_app_refresh_total counter metric #17178

chore: add argocd_app_refresh_total counter metric #17178

jsolana commented Feb 12, 2024 •

edited by blakepettersson

Loading

blakepettersson left a comment

pasha-codefresh left a comment

agaudreault Feb 13, 2024 •

edited

Loading

jsolana Feb 14, 2024

agaudreault Feb 14, 2024 •

edited

Loading

jsolana Feb 15, 2024

crenshaw-dev left a comment

leoluz left a comment •

edited

Loading

leoluz Feb 14, 2024 •

edited

Loading

jsolana Feb 15, 2024

leoluz Feb 15, 2024

leoluz commented Feb 15, 2024

chore: add argocd_app_refresh_total counter metric #17178

chore: add argocd_app_refresh_total counter metric #17178

Conversation

jsolana commented Feb 12, 2024 • edited by blakepettersson Loading

blakepettersson left a comment

Choose a reason for hiding this comment

pasha-codefresh left a comment

Choose a reason for hiding this comment

agaudreault Feb 13, 2024 • edited Loading

Choose a reason for hiding this comment

jsolana Feb 14, 2024

Choose a reason for hiding this comment

agaudreault Feb 14, 2024 • edited Loading

Choose a reason for hiding this comment

jsolana Feb 15, 2024

Choose a reason for hiding this comment

crenshaw-dev left a comment

Choose a reason for hiding this comment

leoluz left a comment • edited Loading

Choose a reason for hiding this comment

leoluz Feb 14, 2024 • edited Loading

Choose a reason for hiding this comment

jsolana Feb 15, 2024

Choose a reason for hiding this comment

leoluz Feb 15, 2024

Choose a reason for hiding this comment

leoluz commented Feb 15, 2024

jsolana commented Feb 12, 2024 •

edited by blakepettersson

Loading

agaudreault Feb 13, 2024 •

edited

Loading

agaudreault Feb 14, 2024 •

edited

Loading

leoluz left a comment •

edited

Loading

leoluz Feb 14, 2024 •

edited

Loading