No HA support for some components in ManageIQ #583

gyliu513 · 2020-07-15T04:59:35Z

Currently, I can see the ManageIQ operator is hardcoding some of the deployment replicas as 1, like httpd, memcached etc, the problem is that this will not able to enable the HA deployment for ManageIQ.

Can we enable the customer set the replica number or just add a new field in CR like enableHA etc, and let customer to define if they want to use HA or not?

FYI @carbonin @Fryguy @chessbyte

Components:

The text was updated successfully, but these errors were encountered:

carbonin · 2020-07-15T14:09:22Z

I think we should be able to safely scale httpd, that just needs to be tested. Maybe we just remove the replica bit from reconcile to let the user set their own value (or maybe we just ensure it's >= 1).

I'm not sure about memcached. We store the user session in there so if it's getting load balanced it's possible the user will have to log in every time we hit a new memcached replica.

The orchestrator is also difficult because of the way the manageiq app works. Currently we have the one orchestrator pod watching for all of the "server" records. I think doing some kind of dynamic work distribution will be hard and having each replica map to a specific server record is harder so we would have to either move to a deployment per server (similar to the queue workers) or solve this in some other way.

Postgres is not mentioned here, but it's probably the most difficult if we want to roll our own container-based HA solution. I would sooner look for something that has already solved this problem possibly https://github.com/CrunchyData/postgres-operator ?

That all said, I think we would need to nail down exactly what kind of HA we're looking to achieve with this. Something like active-active postgres is much harder than active-standby. Additionally the HA I'm talking about for PG is very different than just having two httpd pods.

Suffice it to say this is not a bug ... it's a rather large enhancement.

carbonin · 2020-07-15T14:13:36Z

I think we would need to nail down exactly what kind of HA we're looking to achieve with this

First question is, do you consider "supports multiple replicas" HA? Or are we talking about full multi-cluster active-active or active-passive with data replication type HA?

gyliu513 · 2020-07-15T14:53:23Z

Thanks @carbonin for the detailed explanation here, really very helpful!!

First question is, do you consider "supports multiple replicas" HA? Or are we talking about full multi-cluster active-active or active-passive with data replication type HA?

Here is just single cluster model with supports multiple replicas HA.

carbonin · 2020-07-15T15:38:33Z

Okay, so the goal here is to have very little interruption (less than the time to reschedule a pod) for something like a node failure.

Generally this would mean every service supporting multiple replicas, but in a case like a database with persistent storage multiple replicas are not going to be the solution.

This is probably a good enough description to go on for now. I'll add a checklist to the initial issue comment to cover the components that need work.

p-v-a · 2020-11-18T11:29:50Z

I'm not sure about memcached. We store the user session in there so if it's getting load balanced it's possible the user will have to log in every time we hit a new memcached replica.

The way memcached works is that it doesn't need load balancer in front. LB is part of the protocol, so to make it works memcached should be deployed using headless service.

This however spells troubles for ManageIQ, because it relies on environment variable MEMCACHED_SERVICE_HOST and MEMCACHED_SERVICE_PORT to discover memcached. The catch is that when service is headless, kubernetes doesn't define those env variables. Causing whole deployment to collapse, as none of the pods that orchestrator deploys can't start.

miq-bot added the bug label Jul 15, 2020

carbonin added enhancement and removed bug labels Jul 15, 2020

carbonin added the help wanted label Jul 15, 2020

Fryguy mentioned this issue Jul 28, 2020

Differences between podified and appliance #595

Open

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No HA support for some components in ManageIQ #583

No HA support for some components in ManageIQ #583

gyliu513 commented Jul 15, 2020 •

edited by carbonin

Loading

carbonin commented Jul 15, 2020

carbonin commented Jul 15, 2020

gyliu513 commented Jul 15, 2020

carbonin commented Jul 15, 2020

p-v-a commented Nov 18, 2020

No HA support for some components in ManageIQ #583

No HA support for some components in ManageIQ #583

Comments

gyliu513 commented Jul 15, 2020 • edited by carbonin Loading

carbonin commented Jul 15, 2020

carbonin commented Jul 15, 2020

gyliu513 commented Jul 15, 2020

carbonin commented Jul 15, 2020

p-v-a commented Nov 18, 2020

gyliu513 commented Jul 15, 2020 •

edited by carbonin

Loading