-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: proposal/dynamic rebalancing of clusters across shards #13221
docs: proposal/dynamic rebalancing of clusters across shards #13221
Conversation
Codecov ReportPatch coverage has no change and project coverage change:
Additional details and impacted files@@ Coverage Diff @@
## master #13221 +/- ##
==========================================
- Coverage 49.57% 49.56% -0.01%
==========================================
Files 256 256
Lines 43920 43920
==========================================
- Hits 21773 21770 -3
- Misses 19985 19987 +2
- Partials 2162 2163 +1 ☔ View full report in Codecov by Sentry. |
This looks good! One question I have is that I had a proposal in #12123 to be able to scale multiple independent deployments of the application controller, so that say Does this proposal assume that there's only one |
@blakepettersson Your idea seems to be more around having multiple deployments of application controllers and separating those out based on AppProjects. This proposal assumes there to be only 1 deployment of ApplicationController. The goal of this proposal is to be able dynamically re-distribute clusters across shards upon addition/removal of shards. I think this proposal should be able to handle all shard groups. |
This proposal specifically excludes the problem most faced by Argo users, which is that it's impossible to horizontally scale to deal with many applications in a single cluster. Would this proposal at least consider controller "load" when rebalancing clusters in the case of multiple clusters? As I read it now or seems to me that "load" would still be defined as a 'constant' 1 per cluster, so we would still end up with some controllers doing nothing while others are overloaded. Lastly is this meant as an intermediate improvement so the other asks are easier to design and implement or is this the only part of the problem argo is currently willing to solve? |
Yes indeed, I was using that as a specific example of how it could be useful to be able to support the use case of being able to have multiple independent deployments of the
What I meant about shard group (or assume rather) is that a group of shards is handled by a single deployment of |
@rouke-broersma , Yes, the current proposal is not focusing on load and is trying to balance clusters across shards for dynamic scaling. We are not including load on the shards as of now.
Yes, This proposal is an intermediate improvement to be able to move towards including load for scaling. |
Yes, that is correct. We assume there to be a single deployment of application controller. |
|
||
Each shard replica knows about the total number of available shards by evaluating the environment variable ARGOCD_CONTROLLER_REPLICAS, which needs to be kept up-to-date with the actual number of available replicas (shards). If the number of replicas does not equal the number set in ARGOCD_CONTROLLER_REPLICAS, sharding will not work as intended, leading to both, unused and overused replicas. As this environment variable is set on the StatefulSet and propagated to the pods, all the pods in the StatefulSet need to be restarted in order to pick up the new number of total shards. | ||
|
||
The current sharding mechanism relies on predictable pod names for the application controller to determine which shard a given replica should impersonate, e.g. the first replica of the StatefulSet (argocd-application-controller-0) will be the first shard, the second replica (argocd-application-controller-1) will be the second and so forth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. This is the current approach in order to guarantee that certain clusters will always be associated with certain replicas by applying a deterministic logic during cluster assignment.
How will this work? | ||
* The application controller will query Redis for the status of all the application controllers and last updated heartbeat timestamp. | ||
* It will check if any application controller is flagged as Unhealthy or has not updated its status in Redis during the heartbeat process for a certain period of time. | ||
* If the status for an application controller was already flagged as Unhealthy, it will pick one of the clusters from its list and assign the cluster to itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid that this approach will lead to race conditions issues. Maybe this should be done by all applications controllers but having just one dealing with analyzing heartbeat and redistributing clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't you then have the issue that if that controller goes down you no longer have a mechanism to perform the healthcheck, so the health check controller never gets marked as unhealthy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @rouke-broersma here.. Relying on a single controller has a bunch of issues as well, for example, if the controller performing the redistribution goes down or starts to distribute in a faulty manner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I agree with @leoluz about race conditions. Controller health might be flapping and it is hard to predict what happens sooner: controller instance recovers or the shards are redistributed among healthy controllers.
I'm proposing to just don't detect unhealthy clusters and let kubenetes/operators deal with it:
- healthy controllers just "claim" shards and update ConfigMap
- each controller should have a readiness probe that returns healthy only if the controller successfully claimed shard and updated heartbeat within the allowed timeframe.
So if one controller instance is unhealthy one shard won't be processed but operator will know about it because readiness probe would be failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alexmt, the suggestion of not detecting unhealthy shards does make sense.. In this case, we would need to re-introduce the liveness probe or update the readiness probe with the condition where we check whether the controller was able to claim a shard or not.
* If the status for an application controller was already flagged as Unhealthy, it will pick one of the clusters from its list and assign the cluster to itself. | ||
* If the status is not flagged and an application controller has not updated the last active timestamp in a long time, then we mark the Application Controller as Unhealthy. | ||
|
||
This will continue to happen till the time the list of clusters is marked as empty. As soon as the list is marked as empty, the entry of the shard/application-controller is removed from Redis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if in the middle of a cluster redistribution the failed application controller heals and starts updating the heartbeat again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that scenario, the redistribution will kick in again as the number of healthy controllers will increase again.
|
||
### Cons | ||
|
||
* Possibility of race conditions while flagging the shard as Unhealthy during the heartbeat process. Although this can be handled using the [distributed locks](https://redis.io/docs/manual/patterns/distributed-locks/) in Redis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that you considered race conditions as well. :)
I would be opposed to using distributed locks as it increases complexity quite a bit and we already have stability issues with our Redis HA setup.
There are many mixed feelings about distributed locks being an anti-pattern (which I agree).
Example:
https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
|
||
* Possibility of race conditions while flagging the shard as Unhealthy during the heartbeat process. Although this can be handled using the [distributed locks](https://redis.io/docs/manual/patterns/distributed-locks/) in Redis. | ||
|
||
* In scenarios when Redis becomes unavailable, the heartbeat mechanism will pause working till the redis comes back online again. This will also pause the dynamic redistribution of clusters till Redis comes back online. The redistribution of clusters will be triggered again when Redis comes back online. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could rely less in Redis if we had leader election in place. In this case we could instead, maintain an InMemory cache of the sharding state. This would be synced in Redis mainly so it can be picked by the new leader in case of current leader crashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we use configmaps for this instead of redis, k8s would prevent concurrent updates to the configmaps so we can rely on that for handling race conditions while we also remove the issue of relying on redis too much.
|
||
### Pros of Leader Election | ||
|
||
* We can refrain from performing multiple calls to Redis about the load and status of the shards and store it in a local cache within the leader while updating data in Redis on a timely manner (for e.g. every 10 mins). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated sentence with line 134
|
||
Inorder to be able to accurately know which shards are being managed by which application-controller, especially in scenarios of redistribution of load, addition/removal of `application controller`, etc., we would need to have a mechanism to assign clusters to the shards. | ||
|
||
The service account used by the application controller has read access to all the resources in the cluster. Thus, instead of setting the environment variable ARGOCD_CONTROLLER_REPLICAS representing the number of replicas, the number of replicas can be read directly from the number of healthy replicas of the application controller deployment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The application controller service account does not necessarily has access to all resources in the cluster.
Some users install controller with only argocd-application-controller-role and use it to manage remote clusters only.
So I suggest changing argocd-application-controller-role and allowing controller inspect own deployment and find out a number of replicas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I will add this point to the document.
|
||
### Cons | ||
|
||
* Possibility of race conditions while flagging the shard as Unhealthy during the heartbeat process. Although this can be handled using the [distributed locks](https://redis.io/docs/manual/patterns/distributed-locks/) in Redis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also concerned about using Redis. Some users prefer cloud-hosted Redis (e.g. Amazon ElastiCache). So in this case controller would need to rely on external resource to work. HA Redis also has some known issues: under heavy load, master might be unable to perform replication quick enough and only known solution is to kill all redis pods. So we would lose information about shards.
So instead of redis I would suggest to just use ConfigMap. It is more stable and we can leverage Kubernetes conflict errors to deal with race conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this what https://kubernetes.io/docs/concepts/architecture/leases/#custom-workload are for? Or is that too limited for the information you would need to store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of using ConfigMap's to reduce dependency on Redis. I will incorporate ConfigMaps in the proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added one comment about preserving backward compatibility, but we can discuss more in the PR that implements the feature.
LGTM!
|
||
### Security Considerations | ||
|
||
* This would be a breaking change of converting StatefulSets to Deployments. Any automation done by customers which is based on the assumption that the controller is modelled as a StatefulSet would break with this change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could easily present backward compatibility with StatefulSet: controller could check if HOSTNAME matches regex argocd-application-controller-\d
than is can just infer shard number from hostname, otherwise use the new logic.
@alexmt thanks for the approval. We could surely discuss the point of backwards compatibility when I have a draft PR out for this proposal's implementation. Wanted to check if we could we get the proposal merged as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
docs/proposals/rebalancing-clusters-across-shards-dynamically.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Tks Ishita!
Signed-off-by: ishitasequeira <ishiseq29@gmail.com>
Signed-off-by: ishitasequeira <ishiseq29@gmail.com>
…or unhealthy controllers Signed-off-by: ishitasequeira <ishiseq29@gmail.com>
Signed-off-by: ishitasequeira <ishiseq29@gmail.com>
Head branch was pushed to by a user without write access
…j#13221) * Add proposal document for dynamic rebalancing of clusters across shards Signed-off-by: ishitasequeira <ishiseq29@gmail.com> * Use ConfigMap instead of Redis based on the feedback on the proposal Signed-off-by: ishitasequeira <ishiseq29@gmail.com> * address comments and add comments about not redistributing clusters for unhealthy controllers Signed-off-by: ishitasequeira <ishiseq29@gmail.com> * Address comments Signed-off-by: ishitasequeira <ishiseq29@gmail.com> --------- Signed-off-by: ishitasequeira <ishiseq29@gmail.com>
…j#13221) * Add proposal document for dynamic rebalancing of clusters across shards Signed-off-by: ishitasequeira <ishiseq29@gmail.com> * Use ConfigMap instead of Redis based on the feedback on the proposal Signed-off-by: ishitasequeira <ishiseq29@gmail.com> * address comments and add comments about not redistributing clusters for unhealthy controllers Signed-off-by: ishitasequeira <ishiseq29@gmail.com> * Address comments Signed-off-by: ishitasequeira <ishiseq29@gmail.com> --------- Signed-off-by: ishitasequeira <ishiseq29@gmail.com>
Proposal for dynamic rebalancing of clusters across shards.
Narrowing down the scope of #8340.
Note on DCO:
If the DCO action in the integration test fails, one or more of your commits are not signed off. Please click on the Details link next to the DCO action for instructions on how to resolve this.
Checklist:
Please see Contribution FAQs if you have questions about your pull-request.