Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharding algorithm constantly evaluated waste CPU and create too many logs #14337

Closed
3 tasks done
agaudreault opened this issue Jul 4, 2023 · 6 comments · Fixed by #15237
Closed
3 tasks done

Sharding algorithm constantly evaluated waste CPU and create too many logs #14337

agaudreault opened this issue Jul 4, 2023 · 6 comments · Fixed by #15237
Assignees
Labels
bug Something isn't working

Comments

@agaudreault
Copy link
Member

agaudreault commented Jul 4, 2023

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

On version 2.8.0, ArgoCD starts logging a lot. The principal source of logs is in the new sharding algorithm. This algorithm is evaluated on every refresh to an application. The sharding results could easily be cached and re-evaluated when a cluster changes, is added or is removed.

To Reproduce

Deploy 2.8.0-rc1

Caused by #13018

Expected behavior

  • Info Logs on which shard should process which cluster should only be logged:
    • When the controller is started
    • When a cluster is added/removed/updated.
  • There should not be constant debug logs
  • Shard should not be reevaluated on every application reconcile. The value can likely be cached until a cluster is added/removed/updated.
  • Log should contain the server name/url, not the internal argoCD ID value.

Version

v2.8.0-rc1

Logs

{"level":"debug","msg":"Calculating cluster shard for cluster id: ","time":"2023-07-04T20:44:17Z"}
{"level":"debug","msg":"Calculating cluster shard for cluster id: ","time":"2023-07-04T20:44:17Z"}
{"level":"debug","msg":"Calculating cluster shard for cluster id: 14a0dfbf-8fa6-4bf4-ae35-b1b7e2ebe948","time":"2023-07-04T20:44:17Z"}
{"level":"info","msg":"Cluster with id=14a0dfbf-8fa6-4bf4-ae35-b1b7e2ebe948 will be processed by shard 0","time":"2023-07-04T20:44:17Z"}
{"level":"debug","msg":"Calculating cluster shard for cluster id: 6d41f1db-fb4f-4883-b28f-4074159d1c6a","time":"2023-07-04T20:44:17Z"}
{"level":"info","msg":"Cluster with id=6d41f1db-fb4f-4883-b28f-4074159d1c6a will be processed by shard 1","time":"2023-07-04T20:44:17Z"}
{"level":"debug","msg":"Calculating cluster shard for cluster id: ","time":"2023-07-04T20:44:17Z"}
{"level":"debug","msg":"Calculating cluster shard for cluster id: 14a0dfbf-8fa6-4bf4-ae35-b1b7e2ebe948","time":"2023-07-04T20:44:17Z"}
{"level":"info","msg":"Cluster with id=14a0dfbf-8fa6-4bf4-ae35-b1b7e2ebe948 will be processed by shard 0","time":"2023-07-04T20:44:17Z"}
{"level":"debug","msg":"Calculating cluster shard for cluster id: 57d330be-ae91-4a23-9303-ab8dbcc306da","time":"2023-07-04T20:44:17Z"}
{"level":"info","msg":"Cluster with id=57d330be-ae91-4a23-9303-ab8dbcc306da will be processed by shard 1","time":"2023-07-04T20:44:17Z"}
{"level":"debug","msg":"Calculating cluster shard for cluster id: 57d330be-ae91-4a23-9303-ab8dbcc306da","time":"2023-07-04T20:44:17Z"}
{"level":"info","msg":"Cluster with id=57d330be-ae91-4a23-9303-ab8dbcc306da will be processed by shard 1","time":"2023-07-04T20:44:17Z"}
{"level":"debug","msg":"Calculating cluster shard for cluster id: 57d330be-ae91-4a23-9303-ab8dbcc306da","time":"2023-07-04T20:44:17Z"}
{"level":"info","msg":"Cluster with id=57d330be-ae91-4a23-9303-ab8dbcc306da will be processed by shard 1","time":"2023-07-04T20:44:17Z"}
@agaudreault agaudreault added the bug Something isn't working label Jul 4, 2023
@crenshaw-dev
Copy link
Member

I'd happily quickly review a fix for this, if you have time to write one. :-)

@agaudreault
Copy link
Member Author

I think the PR above can be cherry-picked in 2.8 to fix the logging issue.

I'll try to code something to cache the sharding results in another PR.

@akram let me know if you were developing something around the sharding that would heavily conflict with a cached implementation of the cluster shards.

@Enclavet
Copy link
Contributor

Enclavet commented Aug 2, 2023

FYI round-robin vs legacy sharding algorithm CPU utilization is significant. This is a test with ArgoCD managing 99 clusters with 5000 applications. First part is round-robin and second part is switching to legacy. Using 2.8.0-rc5

image

@akram
Copy link
Contributor

akram commented Aug 3, 2023

Hi @agaudreault-jive , thanks for showing these findings. As I was working on something else, it only popped in my radar today.
I will have a look at your PR and test it as well.
Regarding possible impacts of cached sharding results, I know that a colleague is working on different implementation and will check that as well.

@agaudreault
Copy link
Member Author

@akram Awesome, if you have questions, ping me on the cncf slack and I'll answer a bit faster, my handle is @agaudreault!

I hope my draft isn't too far-off!

@agaudreault agaudreault self-assigned this Aug 11, 2023
@agaudreault
Copy link
Member Author

I started working on my draft PR, I am currently testing with multiple clusters all with the same server url and hitting the issue #15027. I will check if it is possible to change my implementation to use a key name/server for the cache.

@agaudreault agaudreault moved this from Backlog to Completed in Argo CD Roadmap Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Completed
4 participants