Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: add argocd_app_refresh_total counter metric #17178

Closed

Conversation

jsolana
Copy link
Contributor

@jsolana jsolana commented Feb 12, 2024

Closes: #17122

Introduce a new count metric in the application controller to track the number of refresh requests, named argocd_app_refresh_total.

Labels:

  • namespace
  • dest_server,
  • project
  • name
  • compare_with, possible values: "CompareWithLatestForceResolve", "CompareWithLatest", "CompareWithRecent", "ComparisonWithNothing ", "Unknown"

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Toolchain Guide
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.
  • Optional. My organization is added to USERS.md.
  • Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

Signed-off-by: Javier Solana Huertas <javier.solana@cabify.com>
@jsolana jsolana requested review from a team as code owners February 12, 2024 10:30
Signed-off-by: Javier Solana Huertas <javier.solana@cabify.com>
Copy link
Member

@blakepettersson blakepettersson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@pasha-codefresh pasha-codefresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

obj, exists, err := ctrl.appInformer.GetIndexer().GetByKey(key)
app, ok := obj.(*appv1.Application)
if exists && err == nil && ok {
ctrl.metricsServer.IncRefresh(app, compareWith.String())
Copy link
Member

@agaudreault agaudreault Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering what additional value this will add?

The controller has metrics (#17013) on the refreshQueue that will give almost the same information as this metrics.

Additionally, the refresh/reconciliation metric is already provided.

Maybe adding the compare_with to the existing metrics would help?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering what additional value this will add?
The controller has metrics (#17013) on the refreshQueue that will give almost the same information as this metrics.

Right now, It is not clear how to monitor and diagnose application controller performance issues eg related with "Refresh requested by object updated" (including orphan resources). The dashboard provided maybe is not up-to-date and has a lot of panels.

Eg: I am calculating the average in seconds application reconciliation with the next promQL:

sum(rate(argocd_app_reconcile_sum{app_kubernetes_io_instance=~"$namespace",kubernetes_namespace=~"$namespace", app_kubernetes_io_name=~"$component"}[5m])) by (app_kubernetes_io_instance, app_kubernetes_io_name) / sum(rate(argocd_app_reconcile_count{app_kubernetes_io_instance=~"$namespace",kubernetes_namespace=~"$namespace", app_kubernetes_io_name=~"$component"}[5m]))by (app_kubernetes_io_instance, app_kubernetes_io_name) > 0

The times returned are in order of minutes
I can see high numbers related to applicationset related to workqueue_unfinished_work_seconds and memory consumption in application controller metrics.

I was assuming the root cause is non desirable refresh requested but I have no clear metric to visualize it and diagnose the root of the problem

How is the way to observe the application controller performance using refreshQueue metrics?

Additionally, the refresh/reconciliation metric is already provided.

Refresh requested is not the same than reconciliation, no?

It consumes a lot of memory and require manual intervention to fix unknown status apps
We started detecting a lot of Refresh requested by object updated and thats why I suggested to improve the total refresh requested including the compare_with to identify those from orphan resources.

Copy link
Member

@agaudreault agaudreault Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the code path, this will be almost the same as the reconciliation. Reconciliation will be increased less often than your new metric because a lot of refreshRequest that does not require a reconciliation will be discarded.

Afaik, refreshRequest that does not require a reconciliation (reconciliation here is what happens when you click refresh in the ui, when an app resources has changed, when timeout expires, when orphan resource in the same namespace changes, etc) have a very low impact on performance.

I think this metrics is a duplication of
workqueue_adds_total{name="app_reconciliation_queue"}, with a cardinality for each app * reconciliation_level.

For the reconciliation, there is already argocd_app_reconcile histogram, but it does not have the app * reconciliation_level cardinality.

argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="0.25"} 61
argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="0.5"} 124
argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="1"} 124
argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="2"} 246
argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="4"} 252
argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="8"} 315
argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="16"} 315
argocd_app_reconcile_bucket{dest_server="https://kubernetes.default.svc",namespace="argocd",le="+Inf"} 315
argocd_app_reconcile_sum{dest_server="https://kubernetes.default.svc",namespace="argocd"} 537.5461727870005
argocd_app_reconcile_count{dest_server="https://kubernetes.default.svc",namespace="argocd"} 315

I am not sure adding a metric with app * reconciliation_level cardinality is a good idea....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! After apply your fix, honestly I think this PR doesnt make sense and can be closed.

@jsolana jsolana changed the title chore: add git argocd_app_refresh_total counter metric chore: add argocd_app_refresh_total counter metric Feb 14, 2024
Copy link
Collaborator

@crenshaw-dev crenshaw-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting change to give us some time to consider the increased cardinality from this change

Copy link
Collaborator

@leoluz leoluz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR needs to be redesigned to avoid high cardinality in the new metric. @agaudreault provided a suggestion to reuse existing metrics to achieve similar results. Please consider.

@@ -229,6 +240,11 @@ func (m *MetricsServer) IncSync(app *argoappv1.Application, state *argoappv1.Ope
m.syncCounter.WithLabelValues(app.Namespace, app.Name, app.Spec.GetProject(), app.Spec.Destination.Server, string(state.Phase)).Inc()
}

// IncRefresh increments the refresh counter for an application
func (m *MetricsServer) IncRefresh(app *argoappv1.Application, compareWithStr string) {
m.refreshCounter.WithLabelValues(app.Namespace, app.Name, app.Spec.GetProject(), app.Spec.Destination.Server, compareWithStr).Inc()
Copy link
Collaborator

@leoluz leoluz Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have Argo CD instances handling thousands of apps in hundreds of namespaces. Adding all those labels in the new refreshCounter metric will cause a huge boost in cardinality which could lead to serious issues in Prometheus.

More on the subject: https://blog.fourninecloud.com/understanding-cardinality-in-prometheus-monitoring-96ad082b6398

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!
After applying this fix, I think this PR can be closed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for confirming.

@leoluz
Copy link
Collaborator

leoluz commented Feb 15, 2024

Closing due to #17178 (comment)

@leoluz leoluz closed this Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Improve refreshes observability
6 participants