Requesting redesign of how federation is handled. #2236

cetex · 2017-01-24T05:27:20Z

I know this is no small request, but i do think i bring some valid points.
Problem description
One misconfigured server can effectively destroy the state of nomad globally as it is designed right now, see #1515 for a current issue and see #1987 for some other problems i've had earlier which goes back to this design choice. The issue may not be a major one by itself, but not being able to see if hosts are registered in the datacenter is quite problematic to me, and the fact that this spreads globally to all nomad datacenters in a federated consul setup is quite bad imho.

The only solution i've found to this problem is to wipe all state and then redeploy everything from scratch which is highly problematic as we're spinning up and destroying datacenters quite often, we'd have to wipe and recreate all nomad datacenters globally every time a change is made at the datacenter level if we were using nomad already.

Fixing this specific bug would have made the testing less painful, but the fact that this can happen globally if a bug shows up is a quite big design problem imho.
I really think this approach of "Joining everything that we can find together automagically and keeping lots of state everywhere" is wrong, It brings with it very tight coupling and opens up for lots of new potential issues in the future.
Every time I've tried nomad i've very soon hit a bug or some other issue which effectively makes it unusable due to this design choice. I get the feeling that any small bug which exists in the federation setup could effectively bring down all nomad datacenters globally forcing users to a complete wipe and redeploy of all datacenters to fix it.
It's also seems very easy for a user of nomad to accidently upgrade or kill the wrong instance of a service in the wrong datacenter through this federation which is another quite big issue to us as nomad will happily forward requests anywhere.

Consul and Vault doesn't work like this and this design seems to break the design pattern usually used at Hashicorp.
From the documentation of consul:
'''
One of the key features of Consul is its support for multiple datacenters. The architecture of Consul is designed to promote a low coupling of datacenters so that connectivity issues or failure of any datacenter does not impact the availability of Consul in other datacenters. This means each datacenter runs independently, each having a dedicated group of servers and a private LAN gossip pool.
'''
From my experimentation I find that nomad seems to work in an opposite way, very tight coupling so if anything breaks unexpectedly in one datacenter or if some small outage happens it's almost guaranteed to affect functionality in the other datacenters effectively defeating the purpose of multiple datacenters for HA (at least in our case).

Solution?
I think the federation logic should be moved into an external service, a "regional manager" of sorts. This service could depend on consul to find all nomad datacenters (or simply be told to search for nomad datacenter X, Y, Z either through consul or directly by giving it a hostname to lookup).
This regional manager could handle leader election as well as storing it's state in consul's kv store like for example Vault does which should make those parts relatively simple to understand for anyone new looking at this.
By doing this it would be a choice for us users (or potential users) of nomad if we'd like to use this functionality or not, and it's also a choice for us how we'd like to scale this service as we could deploy a separate consul cluster spread out over multiple physical datacenters simply for keeping state and maintaining quorum for this global manager.

I believe that by default each datacenter should be isolated so any failures or misconfiguration is isolated and kept at the datacenter level, this is to me the only real way to build HA setups that will actually work. This design would be in line with how consul and vault is functioning today where you've done an excellent job at making sure that an issue in one datacenter isn't affecting or triggering outages or bugs in other datacenters. (We have at least not hit any hard bugs in consul or vault related to this over the last few years we've been using your services)

This would most likely require some redesign of how services in datacenters are scaled and how outages should be handled, but I'd much rather see this federation logic put into another service external to nomad which would tell each nomad datacenter to scale up, scale down, or move tasks around if needed. Nomad would then only handle deploys of services within a datacenter and it wouldn't hit any bugs due to this federation since it wouldn't even look outside of it's own datacenter, as it wouldn't care.

I could find the design of automatic federation acceptable if the coupling between datacenters was very loose and basically only occured on request, like how consul is handling it today.
For example if nomad in DC1 got a request to deploy something to DC2 it would search in consul for a consul datacenter matching the name DC2 and then refer (redirect) the client to DC2 directly. After that it shouldn't retain any state at all about what's happening outside of the local datacenter.

I hope i've not insulted anyone by this, I really like what nomad could bring in terms of simplifying our setup and management of deploys within a datacenter as it could potentially replace our currently aging and quite complex / hard to manage mesos + aurora + zk setup, Nomad seems to be very down to earth, nice and simple, but this automatic federation of everything sadly makes it unusable for us every time we try it out.

varjoranta · 2017-01-24T10:47:13Z

+1

dadgar · 2017-01-24T17:26:52Z

@cetex I have responded to all your points on the other issue. I would suggest you break this issue into the individual bugs you are hitting. I think there is a slight misunderstanding as to what is happening under the hood which is totally fine! I have linked you to the architecture page in the other issue.

The solution you outlined is almost the same as how it actually works minus some terminology. Are isolation level is at a region level and not a datacenter. Regions are fully separate isolation domains and no configuration in one region can effect another.

We will hopefully fix the server-members bug in a coming release and we can add an option for disabling automatic federation. However automatic federation should not be causing you any configuration problems.

I am going to close this issue but will happily respond to any further questions you have or we can keep it in the other issue you have filed. Hope my answers have been helpful and you have not insulted any one 👍 Hope we can work together to get you happily using Nomad!

cetex · 2017-01-24T18:48:34Z

@dadgar
It is highly possible that i've misunderstood something about how it works under the hood and that this is just a couple of weird coincidences. It's just the general feeling i get is that it's designed weirdly when one of the primary goals should be robustness, HA and that i can blow up another datacenter with whatever weird config i can think of, and it shouldn't affect another datacenter.

What i've seen is that this is not the case as it breaks in weird ways every time I try. :)

Something which disables automatic federation would make me sleep way better as then i know that no one can send config which kills all services through another datacenter, this also means we who run the datacenters have full control over who can do what and can implement scenarios like: "limit access so new deploys from automatic pipeline can only ever happen in one datacenter and then once it's proven to work we start rolling those out to all other datacenters after a manual verification that everything is fine". software which automatically tries to connect everything and "solve stuff for us" over datacenter borders makes such solutions tricky to manage as any form of misconfiguration could join them together and then potentially end up with a disastrous global failure scenario which is what we're trying to avoid at all costs.

github-actions · 2022-12-16T02:12:41Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar closed this as completed Jan 24, 2017

github-actions bot locked as resolved and limited conversation to collaborators Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requesting redesign of how federation is handled. #2236

Requesting redesign of how federation is handled. #2236

cetex commented Jan 24, 2017 •

edited

Loading

varjoranta commented Jan 24, 2017

dadgar commented Jan 24, 2017

cetex commented Jan 24, 2017

github-actions bot commented Dec 16, 2022

Requesting redesign of how federation is handled. #2236

Requesting redesign of how federation is handled. #2236

Comments

cetex commented Jan 24, 2017 • edited Loading

varjoranta commented Jan 24, 2017

dadgar commented Jan 24, 2017

cetex commented Jan 24, 2017

github-actions bot commented Dec 16, 2022

cetex commented Jan 24, 2017 •

edited

Loading