-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Argocd-application-controller not sharding per cluster #9633
Comments
ArgoCD will return this error if it can't handle the cluster. Please inspect the controller logs for additional information. argo-cd/controller/sharding/sharding.go Lines 14 to 26 in d48c808
|
Hi @leoluz I don't see that anywhere in the logs. This is from a fresh start of the argocd-application-controller:
|
This means that replica 0 isn't managing any cluster. If you don't have huge clusters maybe you can start the controller with 2 replicas and see how your clusters are grouped in the 2 shards? If you absolutely want to have 8 replicas and having each one managing one specific cluster you can configure your shard manually. In order to do so you need to configure each controller replica with the env var argo-cd/pkg/apis/application/v1alpha1/types.go Line 1260 in 3008b52
|
@leoluz thanks, I was hoping 4 controllers with one explicitly running our "prod" cluster and the rest split between the remaining 7 clusters |
@leoluz so to this point: In order to do so you need to configure each controller replica with the env var ARGOCD_CONTROLLER_SHARD and have a matching cluster secret defined with the same shard value We would need 8 different deployments because each pod would need its own ARGOCD_CONTROLLER_SHARD value, correct? |
Im sorry but Im not sure what exactly you mean by "deployments". ArgoCD controller is a The other (simpler) alternative is defining a shard that distributes clusters automatically like you were doing initially. In this case you just have to define the |
I apologize I meant 8 different statefulsets to implement different env vars for ARGOCD_CONTROLLER_SHARD. Would this accomplish the task? |
I haven't tried it and can't tell for sure if this is going to work. It could break other features also that might expect every controller name to match and just have the last ordinal number incremented (standard StatefulSet behaviour). I wouldn't recommend going in this direction personally. |
Ok, thank you |
No problem. |
I think the shard inferring algorithm is not able to balance clusters among shards equally. I've taken the algorithm into a Playground script with 10 random UUIDS as follows: package main
import (
"fmt"
"hash/fnv"
)
func main() {
const replicas = 5
uids := []string{
"c6e30ba7-8f0c-443e-add4-9e70f031cbaf",
"e4bcf18c-4d97-486d-be48-d49d8c0fb59c",
"c41b2959-d465-4d68-8645-00140d2ab177",
"31441d1c-6181-40ed-82de-52acf530be3b",
"bc98e29f-ced1-487c-9625-a753789e03ab",
"0aecff21-eb62-4b17-9557-92e7d430e604",
"8c15c8ab-2116-4a2d-93f5-df8d8d6af08b",
"0f193adb-352f-401b-bd4c-b31886eb2329",
"f72dd0c6-610e-4951-a652-e78af6e499cc",
"90017d9f-a570-4437-9dc2-a6189a3828cc",
}
for _, id := range uids {
h := fnv.New32a()
_, _ = h.Write([]byte(id))
fmt.Printf("%s - Sum: %d - Shard: %d\n", id, int(h.Sum32()), int(h.Sum32()%uint32(replicas)))
}
} I think using the remainder of a modulo operation on the FNV sum of the UUID here is the wrong approach to the problem. The output of above yields:
We should fix it. The workaround is to assign clusters to shards manually, but that can become cumbersome. |
@jannfis I agree that it can be improved. |
@leoluz Is there really a use-case that benefits of distributing clusters in such an uneven way as it does now? I always thought the reason behind automatically assigning clusters to shards would be to distribute them evenly, so that you could scale application controller replicas on an informed basis. |
Btw, manually assigning a cluster to a shard is possible (however, not documented) by patching the
There is no need to patch the statefulset |
@jannfis Interesting. Now I understand what the InferShard is doing. tks!
I'm afraid I don't know the answer. Maybe @alexmt has a better understanding in the implementation history? Regarding my previous suggestion to:
I think we have 2 main use-cases:
|
Thanks for these good info! We will keep following this issue. Currently, we're using the method @jannfis mentioned to manually configure sharding. May I know if there is a nicer way to balance app-controllers? |
Doing the modulo of the hash result doesn't produce even distribution across shards—it's basically trying to balance placement for all of the hash's result space. To fix that algorithm, you just need to sort the clusters by the hash result in an array, and then modulo their index. |
@pharaujo indeed, this gives better results. We need to think also what would happen when a "cluster" is removed. Then, the shard placement is recalculated, and many permutation could occur, temporarily increasing response times until the placement is stable enough and the CRD cache is rebuilt properly on every shard.
which gives:
|
Having a deeper look it seems to be not that easy to implement: The cluster UID list cannot be sorted easily in the real-life scenario with the current code base, because, clusterList is pulled by batch of 10. https://github.com/argoproj/argo-cd/blob/master/cmd/argocd/commands/admin/cluster.go#L107-L121 Another possible simple implementation, which can still be stateless, is to use a random distribution |
Even batches of 10 would still be an improvement. Eg we have three clusters with
Being able to scale these 500 resources of Cluster 3 would be an immensive betterment. We are currently using the non-documented |
@akram That piece of the code you are referring to is just a CLI command, it's not used within the application controller. @alexmt may know why he chose a batch size. I believe it may be for performance reasons within the CLI's command, because it calls So I would not consider that to be an obstacle. |
According to the docs here: https://argo-cd.readthedocs.io/en/stable/operator-manual/high_availability/#argocd-application-controller it says:
If the controller is managing too many clusters and uses too much memory then you can shard clusters across multiple controller replicas. To enable sharding increase the number of replicas in argocd-application-controller StatefulSet and repeat number of replicas in ARGOCD_CONTROLLER_REPLICAS environment variable. The strategic merge patch below demonstrates changes required to configure two controller replicas.
Since we have 8 clusters, we have ours set to 8. However, when looking at the logs for several of the pods, they show all 8 clusters being ignored:
time="2022-06-10T19:21:10Z" level=info msg="Ignoring cluster redacted"
time="2022-06-10T19:21:10Z" level=info msg="Ignoring cluster redacted"
time="2022-06-10T19:21:10Z" level=info msg="Ignoring cluster redacted"
time="2022-06-10T19:21:10Z" level=info msg="Ignoring cluster redacted"
time="2022-06-10T19:21:10Z" level=info msg="Ignoring cluster redacted"
time="2022-06-10T19:21:10Z" level=info msg="Ignoring cluster redacted"
time="2022-06-10T19:21:10Z" level=info msg="Ignoring cluster redacted"
time="2022-06-10T19:21:10Z" level=info msg="Ignoring cluster redacted"
We're running the latest ArgoCD version 2.4.0, however this was present in ArgoCD 2.3.3 and 2.3.4.
The text was updated successfully, but these errors were encountered: