Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chart & Operator support for replica configuration #47

Open
2 tasks
mrajashree opened this issue Sep 30, 2020 · 2 comments
Open
2 tasks

Chart & Operator support for replica configuration #47

mrajashree opened this issue Sep 30, 2020 · 2 comments

Comments

@mrajashree
Copy link
Contributor

mrajashree commented Sep 30, 2020

Context: Today BRO chart and operator logic isn't designed or intended to have multiple replicas. This is a known aspect of the BRO design and for now it's intended to be a "singleton" pod.

#41 (comment)

To consider:

  • Manage lease when more than one operator pods(controllers) are running
  • Adjusting the Chart to allow replicas to be configured
@MKlimuszka
Copy link

This is a ticket to track tech debt that was pointed out in a review of an old PR that had been merged.

@zube zube bot added this to the v2.x - Backlog milestone Nov 6, 2023
@MKlimuszka MKlimuszka removed this from the v2.x - Backlog milestone Aug 21, 2024
@mallardduck mallardduck changed the title Manage lease when more than one operator pods(controllers) are running [RFE] Chart & Operator support for replica configuration Nov 1, 2024
@mallardduck mallardduck changed the title [RFE] Chart & Operator support for replica configuration Chart & Operator support for replica configuration Dec 5, 2024
@alexandreLamarre
Copy link
Contributor

alexandreLamarre commented Dec 5, 2024

It seems the consideration here is to have a failover/redundancy mechanism for BRO.

If we aim to accomplish redudancy, then adding a leader election to the BRO operator is a 3 line code change, and a trivial CRD change to deploy the number of replicas. However, I don't see this as useful.

Although a nice improvement, we haven't seen any of these issues in production. The scenarios where redundancy is useful as a failure mechanism is if the pod fails critically due to a sporadic software bug (in which case the software bug should be fixed, and redundancy will only speed up recovering) or if resource saturation is reached.

When resource saturation is reached, either the cluster is having problems which is outside the scope in which BRO can fix itself or the resource limits are improperly configured, becoming a user configuration issue. A useful use case of redundancy is in the case of node failures and to deploy BRO as a daemon-set, but a singleton will still attempt to recover on healthy nodes, so the only gain here is speeding up failure recovery scenarios.

If the idea here is instead to scale up the BRO operator (optimize), then redundancy isn't useful unless we shard the backup and restore workloads, but that isn't incredibly useful as the workloads are IO (and thus CPU) bound. The only speed up feasible I see with sharding the operator workloads is to shard across node resources, but that requires more sophisticated techniques beyond kubernetes leases, controller runtime leader election or any kind of network load balancing.

I believe sharding will be the last resort for optimizing these types of workloads anyways, so even in this case I don't see it being useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants