Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No HA support for some components in ManageIQ #583

Open
5 tasks
gyliu513 opened this issue Jul 15, 2020 · 5 comments
Open
5 tasks

No HA support for some components in ManageIQ #583

gyliu513 opened this issue Jul 15, 2020 · 5 comments

Comments

@gyliu513
Copy link
Contributor

gyliu513 commented Jul 15, 2020

Currently, I can see the ManageIQ operator is hardcoding some of the deployment replicas as 1, like httpd, memcached etc, the problem is that this will not able to enable the HA deployment for ManageIQ.

Can we enable the customer set the replica number or just add a new field in CR like enableHA etc, and let customer to define if they want to use HA or not?

FYI @carbonin @Fryguy @chessbyte

Components:

  • httpd
  • memcached
  • orchestrator
  • postgresql
  • kafka/zookeeper
@miq-bot miq-bot added the bug label Jul 15, 2020
@carbonin carbonin added enhancement and removed bug labels Jul 15, 2020
@carbonin
Copy link
Member

I think we should be able to safely scale httpd, that just needs to be tested. Maybe we just remove the replica bit from reconcile to let the user set their own value (or maybe we just ensure it's >= 1).

I'm not sure about memcached. We store the user session in there so if it's getting load balanced it's possible the user will have to log in every time we hit a new memcached replica.

The orchestrator is also difficult because of the way the manageiq app works. Currently we have the one orchestrator pod watching for all of the "server" records. I think doing some kind of dynamic work distribution will be hard and having each replica map to a specific server record is harder so we would have to either move to a deployment per server (similar to the queue workers) or solve this in some other way.

Postgres is not mentioned here, but it's probably the most difficult if we want to roll our own container-based HA solution. I would sooner look for something that has already solved this problem possibly https://github.com/CrunchyData/postgres-operator ?

That all said, I think we would need to nail down exactly what kind of HA we're looking to achieve with this. Something like active-active postgres is much harder than active-standby. Additionally the HA I'm talking about for PG is very different than just having two httpd pods.

Suffice it to say this is not a bug ... it's a rather large enhancement.

@carbonin
Copy link
Member

I think we would need to nail down exactly what kind of HA we're looking to achieve with this

First question is, do you consider "supports multiple replicas" HA? Or are we talking about full multi-cluster active-active or active-passive with data replication type HA?

@gyliu513
Copy link
Contributor Author

Thanks @carbonin for the detailed explanation here, really very helpful!!

First question is, do you consider "supports multiple replicas" HA? Or are we talking about full multi-cluster active-active or active-passive with data replication type HA?

Here is just single cluster model with supports multiple replicas HA.

@carbonin
Copy link
Member

Okay, so the goal here is to have very little interruption (less than the time to reschedule a pod) for something like a node failure.

Generally this would mean every service supporting multiple replicas, but in a case like a database with persistent storage multiple replicas are not going to be the solution.

This is probably a good enough description to go on for now. I'll add a checklist to the initial issue comment to cover the components that need work.

@p-v-a
Copy link

p-v-a commented Nov 18, 2020

I'm not sure about memcached. We store the user session in there so if it's getting load balanced it's possible the user will have to log in every time we hit a new memcached replica.

The way memcached works is that it doesn't need load balancer in front. LB is part of the protocol, so to make it works memcached should be deployed using headless service.

This however spells troubles for ManageIQ, because it relies on environment variable MEMCACHED_SERVICE_HOST and MEMCACHED_SERVICE_PORT to discover memcached. The catch is that when service is headless, kubernetes doesn't define those env variables. Causing whole deployment to collapse, as none of the pods that orchestrator deploys can't start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants