Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: add argocd_app_refresh_total counter metric #17178

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions controller/appcontroller.go
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,24 @@ func (a CompareWith) Pointer() *CompareWith {
return &a
}

func (c *CompareWith) String() string {
if c == nil {
return "Unknown"
}
switch *c {
case CompareWithLatestForceResolve:
return "CompareWithLatestForceResolve"
case CompareWithLatest:
return "CompareWithLatest"
case CompareWithRecent:
return "CompareWithRecent"
case ComparisonWithNothing:
return "ComparisonWithNothing"
default:
return "Unknown"
}
}

// ApplicationController is the controller for application resources.
type ApplicationController struct {
cache *appstatecache.Cache
Expand Down Expand Up @@ -840,6 +858,12 @@ func (ctrl *ApplicationController) Run(ctx context.Context, statusProcessors int
func (ctrl *ApplicationController) requestAppRefresh(appName string, compareWith *CompareWith, after *time.Duration) {
key := ctrl.toAppKey(appName)

obj, exists, err := ctrl.appInformer.GetIndexer().GetByKey(key)
app, ok := obj.(*appv1.Application)
if exists && err == nil && ok {
ctrl.metricsServer.IncRefresh(app, compareWith.String())
Copy link
Member

@agaudreault agaudreault Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering what additional value this will add?

The controller has metrics (#17013) on the refreshQueue that will give almost the same information as this metrics.

Additionally, the refresh/reconciliation metric is already provided.

Maybe adding the compare_with to the existing metrics would help?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering what additional value this will add?
The controller has metrics (#17013) on the refreshQueue that will give almost the same information as this metrics.

Right now, It is not clear how to monitor and diagnose application controller performance issues eg related with "Refresh requested by object updated" (including orphan resources). The dashboard provided maybe is not up-to-date and has a lot of panels.

Eg: I am calculating the average in seconds application reconciliation with the next promQL:

sum(rate(argocd_app_reconcile_sum{app_kubernetes_io_instance=~"$namespace",kubernetes_namespace=~"$namespace", app_kubernetes_io_name=~"$component"}[5m])) by (app_kubernetes_io_instance, app_kubernetes_io_name) / sum(rate(argocd_app_reconcile_count{app_kubernetes_io_instance=~"$namespace",kubernetes_namespace=~"$namespace", app_kubernetes_io_name=~"$component"}[5m]))by (app_kubernetes_io_instance, app_kubernetes_io_name) > 0

The times returned are in order of minutes
I can see high numbers related to applicationset related to workqueue_unfinished_work_seconds and memory consumption in application controller metrics.

I was assuming the root cause is non desirable refresh requested but I have no clear metric to visualize it and diagnose the root of the problem

How is the way to observe the application controller performance using refreshQueue metrics?

Additionally, the refresh/reconciliation metric is already provided.

Refresh requested is not the same than reconciliation, no?

It consumes a lot of memory and require manual intervention to fix unknown status apps
We started detecting a lot of Refresh requested by object updated and thats why I suggested to improve the total refresh requested including the compare_with to identify those from orphan resources.

Copy link
Member

@agaudreault agaudreault Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the code path, this will be almost the same as the reconciliation. Reconciliation will be increased less often than your new metric because a lot of refreshRequest that does not require a reconciliation will be discarded.

Afaik, refreshRequest that does not require a reconciliation (reconciliation here is what happens when you click refresh in the ui, when an app resources has changed, when timeout expires, when orphan resource in the same namespace changes, etc) have a very low impact on performance.

I think this metrics is a duplication of
workqueue_adds_total{name="app_reconciliation_queue"}, with a cardinality for each app * reconciliation_level.

For the reconciliation, there is already argocd_app_reconcile histogram, but it does not have the app * reconciliation_level cardinality.

argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="0.25"} 61
argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="0.5"} 124
argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="1"} 124
argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="2"} 246
argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="4"} 252
argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="8"} 315
argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="16"} 315
argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="+Inf"} 315
argocd_app_reconcile_sum{dest_server="https://kubernetes.default.svc",namespace="argocd"} 537.5461727870005
argocd_app_reconcile_count{dest_server="https://kubernetes.default.svc",namespace="argocd"} 315

I am not sure adding a metric with app * reconciliation_level cardinality is a good idea....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! After apply your fix, honestly I think this PR doesnt make sense and can be closed.

}

if compareWith != nil && after != nil {
ctrl.appComparisonTypeRefreshQueue.AddAfter(fmt.Sprintf("%s/%d", key, compareWith), *after)
} else {
Expand Down
17 changes: 17 additions & 0 deletions controller/metrics/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ import (
type MetricsServer struct {
*http.Server
syncCounter *prometheus.CounterVec
refreshCounter *prometheus.CounterVec
kubectlExecCounter *prometheus.CounterVec
kubectlExecPendingGauge *prometheus.GaugeVec
k8sRequestCounter *prometheus.CounterVec
Expand Down Expand Up @@ -90,6 +91,14 @@ var (
append(descAppDefaultLabels, "dest_server", "phase"),
)

refreshCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "argocd_app_refresh_total",
Help: "Number of application refreshes.",
},
append(descAppDefaultLabels, "dest_server", "compare_with"),
)

k8sRequestCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "argocd_app_k8s_request_total",
Expand Down Expand Up @@ -171,6 +180,7 @@ func NewMetricsServer(addr string, appLister applister.ApplicationLister, appFil
healthz.ServeHealthCheck(mux, healthCheck)

registry.MustRegister(syncCounter)
registry.MustRegister(refreshCounter)
registry.MustRegister(k8sRequestCounter)
registry.MustRegister(kubectlExecCounter)
registry.MustRegister(kubectlExecPendingGauge)
Expand All @@ -186,6 +196,7 @@ func NewMetricsServer(addr string, appLister applister.ApplicationLister, appFil
Handler: mux,
},
syncCounter: syncCounter,
refreshCounter: refreshCounter,
k8sRequestCounter: k8sRequestCounter,
kubectlExecCounter: kubectlExecCounter,
kubectlExecPendingGauge: kubectlExecPendingGauge,
Expand Down Expand Up @@ -229,6 +240,11 @@ func (m *MetricsServer) IncSync(app *argoappv1.Application, state *argoappv1.Ope
m.syncCounter.WithLabelValues(app.Namespace, app.Name, app.Spec.GetProject(), app.Spec.Destination.Server, string(state.Phase)).Inc()
}

// IncRefresh increments the refresh counter for an application
func (m *MetricsServer) IncRefresh(app *argoappv1.Application, compareWithStr string) {
m.refreshCounter.WithLabelValues(app.Namespace, app.Name, app.Spec.GetProject(), app.Spec.Destination.Server, compareWithStr).Inc()
Copy link
Collaborator

@leoluz leoluz Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have Argo CD instances handling thousands of apps in hundreds of namespaces. Adding all those labels in the new refreshCounter metric will cause a huge boost in cardinality which could lead to serious issues in Prometheus.

More on the subject: https://blog.fourninecloud.com/understanding-cardinality-in-prometheus-monitoring-96ad082b6398

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!
After applying this fix, I think this PR can be closed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for confirming.

}

func (m *MetricsServer) IncKubectlExec(command string) {
m.kubectlExecCounter.WithLabelValues(m.hostname, command).Inc()
}
Expand Down Expand Up @@ -288,6 +304,7 @@ func (m *MetricsServer) SetExpiration(cacheExpiration time.Duration) error {
_, err := m.cron.AddFunc(fmt.Sprintf("@every %s", cacheExpiration), func() {
log.Infof("Reset Prometheus metrics based on existing expiration '%v'", cacheExpiration)
m.syncCounter.Reset()
m.refreshCounter.Reset()
m.kubectlExecCounter.Reset()
m.kubectlExecPendingGauge.Reset()
m.k8sRequestCounter.Reset()
Expand Down
36 changes: 36 additions & 0 deletions controller/metrics/metrics_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -374,6 +374,34 @@ func assertMetricsNotPrinted(t *testing.T, expectedLines, body string) {
}
}

func TestMetricsRefreshCounter(t *testing.T) {
cancel, appLister := newFakeLister()
defer cancel()
metricsServ, err := NewMetricsServer("localhost:8082", appLister, appFilter, noOpHealthCheck, []string{})
assert.NoError(t, err)

appRefreshTotal := `
# HELP argocd_app_refresh_total Number of application refreshes.
# TYPE argocd_app_refresh_total counter
argocd_app_refresh_total{compare_with="CompareWithLatest",dest_server="https://localhost:6443",name="my-app",namespace="argocd",project="important-project"} 2
argocd_app_refresh_total{compare_with="ComparisonWithNothing",dest_server="https://localhost:6443",name="my-app",namespace="argocd",project="important-project"} 1
`

fakeApp := newFakeApp(fakeApp)
metricsServ.IncRefresh(fakeApp, "ComparisonWithNothing")
metricsServ.IncRefresh(fakeApp, "CompareWithLatest")
metricsServ.IncRefresh(fakeApp, "CompareWithLatest")

req, err := http.NewRequest(http.MethodGet, "/metrics", nil)
assert.NoError(t, err)
rr := httptest.NewRecorder()
metricsServ.Handler.ServeHTTP(rr, req)
assert.Equal(t, rr.Code, http.StatusOK)
body := rr.Body.String()
log.Println(body)
assertMetricsPrinted(t, appRefreshTotal, body)
}

func TestReconcileMetrics(t *testing.T) {
cancel, appLister := newFakeLister()
defer cancel()
Expand Down Expand Up @@ -421,13 +449,20 @@ argocd_app_sync_total{dest_server="https://localhost:6443",name="my-app",namespa
argocd_app_sync_total{dest_server="https://localhost:6443",name="my-app",namespace="argocd",phase="Succeeded",project="important-project"} 2
`

appRefreshTotal := `
# HELP argocd_app_refresh_total Number of application refreshes.
# TYPE argocd_app_refresh_total counter
argocd_app_refresh_total{compare_with="CompareWithLatest",dest_server="https://localhost:6443",name="my-app",namespace="argocd",project="important-project"} 2
argocd_app_refresh_total{compare_with="ComparisonWithNothing",dest_server="https://localhost:6443",name="my-app",namespace="argocd",project="important-project"} 1
`
req, err := http.NewRequest(http.MethodGet, "/metrics", nil)
assert.NoError(t, err)
rr := httptest.NewRecorder()
metricsServ.Handler.ServeHTTP(rr, req)
assert.Equal(t, rr.Code, http.StatusOK)
body := rr.Body.String()
assertMetricsPrinted(t, appSyncTotal, body)
assertMetricsPrinted(t, appRefreshTotal, body)

err = metricsServ.SetExpiration(time.Second)
assert.NoError(t, err)
Expand All @@ -440,6 +475,7 @@ argocd_app_sync_total{dest_server="https://localhost:6443",name="my-app",namespa
body = rr.Body.String()
log.Println(body)
assertMetricsNotPrinted(t, appSyncTotal, body)
assertMetricsNotPrinted(t, appRefreshTotal, body)
err = metricsServ.SetExpiration(time.Second)
assert.Error(t, err)
}
1 change: 1 addition & 0 deletions docs/operator-manual/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Metrics about applications. Scraped at the `argocd-metrics:8082/metrics` endpoin
| `argocd_app_labels` | gauge | Argo Application labels converted to Prometheus labels. Disabled by default. See section below about how to enable it. |
| `argocd_app_reconcile` | histogram | Application reconciliation performance. |
| `argocd_app_sync_total` | counter | Counter for application sync history |
| `argocd_app_refresh_total` | counter | Counter for application refresh history |
| `argocd_cluster_api_resource_objects` | gauge | Number of k8s resource objects in the cache. |
| `argocd_cluster_api_resources` | gauge | Number of monitored Kubernetes API resources. |
| `argocd_cluster_cache_age_seconds` | gauge | Cluster cache age in seconds. |
Expand Down
Loading