-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem deploying system task/Consul auto-join questions #1987
Comments
does it place 0 allocations in your cluster? e.g. |
sadly no.
|
can you try and check the evaluation from the API and see if there is any additional information? also, debug wise, do nomad detect >1gbit network interface and the resources you require? e.g. |
Probably on to something there. (This is one of the two nodes in dc1 i want to run the job in) And, where does this filter come from?
|
Believe they are getting filtered by this constraint:
|
Yeah.. I've done some tests and cut it down a bit, seems like you're right. Works once i removed those constraints and just set datacenter = ["dc1"] When i had datacenter = ["dc1", "dc2"] it tried to place the job on all nodes in both dc's which is not what i had in mind.. (which is why i had that consul.datacenter constraint there, which didn't work at all it seems) |
Hey, If your Consul clusters are federated Nomad will find other Nomad servers by searching through the various Consul datacenters! To do that type of deployment I see two ways:
And then you can run the same job and each task group would have the parameters for the DC it is targeting.
|
Ok. Is there any way to disable federation but use consul to find local neighbours? We use "consul join -wan" to get dns lookups to work from our jumpstation to the different datacenters, (to find mesos, aurora and similar..) and have production, integration and testing tied together like this so we can reach all of them from our jumpstations. Maybe this is what "region" in nomad is about? (set region to dc1 or dc2 instead of default maybe?) Would be pretty nice to also tell consul not to build a full mesh between all dc's and just setup point-to-point connectivity between some dc's (think "consul join -p2pwan" to build a hub & spoke setup where one dc knows all other dc's and each other dc only knows about the hub dc/cluster) |
Since Nomad can manage many datacenters in a single region we search through federated Consul datacenters to search for the relevant Nomad Servers since it is a common scenario to have many Consul DCs in a single Nomad region. Is there a reason you would want Nomad to not federate? |
Yes, an important issue is that I want to limit the failure domain in case someone screws up. And stuff like this makes me worried, wiped out the nodes and restarted nomad with a new region set to try to isolate it entirely from the other clusters.
I expect a setting which only uses local consul cluster to find local nomad neighbours, and which doesn't try to / refuses to talk to any other cluster unless being explicitly told to talk to someone else. What if that snapshot is broken / compromised in the other datacenter or there's a bug that gets triggered in a couple of months when we upgrade to 0.6.0 or similar which brings down "*"?, that simply can't happen. Consul has local leaders in each datacenter and doesn't try to build a global cluster unless being told to specifically, that global cluster (to my knowledge) isn't used for much else than allowing certain queries to be passed between the datacenters directly, each datacenter still has it's own leaders and those leaders are autonomous, if connectivity to central consul is lost I know that the datacenter won't loose quorum, die or something else. I have to go out of my way quite a bit to accidently write / delete in all key-value stores globally, or change the registered services in each datacenter. The only downside currently with consul is that it expects to have full connectivity between all datacenters and then complains when we drop traffic between all datacenters that's not the "hub" datacenter, but that's about it. And then we have stuff like this which just started occuring, not sure why. But nomad simply won't start for some reason. I have no idea why, but it seems like it's trying to connect to the other datacenter and expect some leadership election to happen which simply doesn't happen for some reason. If it didn't try to connect to the other datacenter (if i block all traffic between the datacenters) it just works as i expect it to.
I wipe these new nodes again, stop nomad in the other datacenter, restart it and it "just works" to launch it properly. I'm not sure why it breaks right now actually, could it be new version?, some state about this datacenter that the other datacenter believes it has? For us things like this is a dealbreaker, I thought we could get this deployed and tested relatively soon (I really like the idea behind nomad and that we could drop our legacy way of deploying the platform-level jobs) but at the current state we either can't run nomad at all, or we can't use consul as we use it today (Which means we can't "federate" consul, which in turn means we'd have to redesign large parts of our internal dns infrastructure to be able to run nomad, which probably won't happen) |
@dadgar any thoughts on this? |
@cetex Sorry I didn't see your response. Nomad has two concepts that are important to understand what we are doing with Consul.
We use Consul so that servers within a region discover each other and so that clients can discover their servers. As we do this scan through Consul, we can also detect servers that are part of separate Nomad regions and we automatically federate them. This does not mean we compromise the configuration of the servers (we won't elect something across the world unless you have placed servers with the same region across the world which will lead to problems regardless). As for not being able to start that looks like a configuration problem and would be more than happy to help if you want to file a new issue with the relevant configs and nomad/consul network topology! Thanks, |
Alright! So in our case we should set region=datacenter_name to have nomad only elect leaders from the local datacenter? It would be nice to be able to disable federation entirely but still retain the functionality to lookup other nomad nodes within the local consul datacenter though, since if nomad in Dc1 (region1) doesn't even know about nomad in Dc2 (region2) accidental screw-ups doesn't happen as easily, but nomad agents and servers would still find each other in the local datacenter. When it comes to deploying a job defined for Dc1 by contacting nomad in Dc2 I consider the intent of that deploy unclear and the deploy should be denied instead of trying to work around the error within nomad. The only way to do what we want today seems to be to disable consul integration entirely in nomad, which means we'd loose a lot of functionality, or remove all federation between consul datacenters which would also remove a lot of functionality. Another option would be to make consul support a hub-and-spoke design, where the hub (our jumpstation) can reach and know of all spokes (datacenters: prod, dev) and all spokes know of the hub, but where two spokes don't know about each other. Regarding starting nomad i'm willing to try to recreate it later on if we can resolve these issues first. |
This is correct
Can you please file a separate issue for this.
I think there may be some confusion between Consul's and Nomad's topology. Please take a look at our architecture page. At a high level, a set of Nomad servers manage a Region. A region can consist of multiple data centers. Jobs can span datacenter but not regions. This is useful if you have a job that should be resilient to a single datacenter failure as Nomad can detect and reschedule onto a separate DC. There is no state shared across regions. The Nomad servers can communicate via Serf (a gossip protocol) which allows federated regions to forward jobs to the appropriate region. This is useful if you have a service creating jobs, it can submit jobs to its local server and have the server forward to the appropriate regional servers. |
Going to close this as the relevant issue has been filed separately |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
I'm trying to deploy nginx through nomad as a system task and i'm failing to understand why it doesn't work, maybe i'm just blind or maybe i've seriously misunderstood something..
When trying to run it:
I'm trying to deploy this to s01 and s02 in dc1
As you can see i'm basically trying to run jobs on the nomad masters and failing pretty badly.
Commandline we use to run nomad:
The text was updated successfully, but these errors were encountered: