Nomad client deregistering from consul #525

BSick7 · 2015-12-02T13:39:41Z

We are running consul server + nomad (running in server and client mode) on the same boxes that we call "managers". We also create "workers" that run consul client + nomad client.

We place a service registration in the consul config directory so that nomad.service.consul resolves to nomad servers. This allows us to configure nomad clients to configure servers = ["nomad.service.consul:4647"].

Consul boots up, registers nomad, then deregisters nomad. After disabling nomad client mode and restarting consul and nomad, nomad remained in the consul service registration. Since nomad.service.consul doesn't resolve, the worker nodes are never able to connect to the nomad cluster.

I believe I traced the culprit to https://github.com/hashicorp/nomad/blob/master/client/task_runner.go#L239-L240. This seems to be very intentional, yet makes little sense.

Could you add some documentation about this and better strategies for joining nomad clients?

The text was updated successfully, but these errors were encountered:

cbednarski · 2015-12-02T17:19:46Z

Our expectation is that in a production cluster nomad server and client don't run on the same node so I suspect some weirdness is caused by that. (We should still fix this, though.) I'm curious if this problem persists if you run these on separate nodes.

I recall from reviewing this code that Nomad should only deregister services that it is tracking (that it has started) so if there are host-level services also registered with consul they should be left alone.

If you are in AWS or another virtualized network environment you can use floating IPs for you nomad servers. Also you only need to join the nodes to the cluster once and they will check in and get the latest list of servers periodically, so IIRC they only need one valid IP once ever, and provided your cluster stays healthy they will maintain an updated list of servers over time.

BSick7 · 2015-12-02T19:14:46Z

Thanks @cbednarski. It would seem that it's deregistering more than nomad-tracked services.

In production, I would imagine it could be common that nomad servers are idle. This means 3+ m4.large aws instances are sitting idle burning holes in your pockets. By compacting nomad clients on nomad servers, it becomes cost-effective to run small clusters.

dadgar · 2015-12-02T20:03:01Z

@diptanu

The current design assumes that the Nomad Client is the sole user of the local Consul Agent. As @cbednarski mentioned we expect you would not run both the server and client on the same node. To support that would require a re-architecture of how Consul registration takes place.

diptanu · 2015-12-02T20:54:53Z

So like @dadgar said, in the current design Nomad Client registers/de-registers the running services with Consul on a node.

The reason why we de-register everything and register only the processes we know are running is that we don't want any zombie services registered which is not running anymore. Say for example, we register a Redis container with Consul and the client crashes and it restarts and meanwhile the redis container dies we would be left with a zombie service unless we de-register everything which isn't running anymore on a node.

I would suggest running Nomad client on a separate node.

diptanu · 2015-12-02T21:35:50Z

@BSick7 We discussed about this, and we would change the current implementation to de-register only services that the Nomad client doesn't know about and are tagged or have an id prefixed based on a convention.

cbednarski · 2015-12-02T22:02:11Z

In production, I would imagine it could be common that nomad servers are idle. This means 3+ m4.large aws instances are sitting idle burning holes in your pockets. By compacting nomad clients on nomad servers, it becomes cost-effective to run small clusters.

In a smaller cluster you could possibly run Nomad and Consul server nodes on the same machines but there's currently no way to account for the resource utilization here so Nomad can't effectively schedule workloads around this. As workload size grows both nomad and consul become fairly RAM hungry since they keep state in memory, and both are subject to varying CPU load depending on scheduling events, outages, partitions, recovery, etc. For stability and manageability the servers for both of these should have dedicated nodes.

I think at the scale where it makes sense to operate a scheduler 3 dedicated nodes for Nomad should not be a significant cost. For instance if you run 3x m4.large and 15x c3.8xlarge or c4.8xlarge workers, Nomad is approximately 1.5% of your infrastructure spend. The consolidation benefits here are easily going to save you more than 1.5% so you're going to spend less overall.

BSick7 · 2015-12-02T22:23:57Z

I could easily see zombie services getting out of control with little visibility.

If it's really detrimental to run nomad in server mode and client mode, perhaps the following configuration would emit a warning.

server {
  enabled = true
}
client {
  enabled = true
}

Once I configured the following, nomad worked brilliantly revealing that node in both nomad server-members and nomad node-status.

client {
  servers = ["nomad.<subdomain>:4647"]
}

cbednarski · 2015-12-02T22:37:55Z

If it's really detrimental to run nomad in server mode and client mode, perhaps the following configuration would emit a warning.

Agreed, that makes sense.

Once I configured the following...

@BSick7 I take it you used non-consul DNS for that? Or did I miss a step?

cbednarski · 2015-12-02T23:22:17Z

Related to #510

steve-jansen · 2015-12-03T01:57:39Z

@BSick7 I take it you used non-consul DNS for that? Or did I miss a step?

@cbednarski

@BSick7 and I are using Consul DNS (nomad.service.consul:4647) for Nomad join operations. Each Nomad server runs a Consul agent. The Consul agent config includes a service entry for nomad on port 4647.

This is a temporary approach for us until Nomad integrates with Atlas/Scada for discovery.

adrianlop · 2015-12-03T07:21:35Z

@diptanu thanks for the fast respone in #529 !
I see the point now. Like you said in a previous comment, this behaviour could be easily overrided if the service or check has a "not_managed_by_nomad" tag, so Nomad doesn't touch it.
Are you planning to implement something like this?

cbednarski · 2015-12-03T07:24:59Z

Are you planning to implement something like this?

@poll0rz Yes this is in-flight.

cbednarski · 2015-12-03T07:27:53Z

BSick7 and I are using Consul DNS (nomad.service.consul:4647) for Nomad join operations. Each Nomad server runs a Consul agent. The Consul agent config includes a service entry for nomad on port 4647.

@steve-jansen Thanks for the explanation. After Brad said he got it working I thought maybe he had changed something in his config but I wasn't sure of the details.

BSick7 · 2015-12-03T12:27:47Z

@cbednarski I did change the config a little to use our route 53 dns instead of using consul dns.

agy · 2015-12-09T22:16:22Z

As a workaround, until this is fixed, I found that you can manually register the Nomad service with different node names (but the correct IP addresses) of your Nomad servers.

This has the benefit of having Consul monitoring and DNS resolution. The downside is that you'll have additional "duplicate" nodes in your Consul node listing.

Example:

$ curl -X PUT -d '{"Datacenter": "dc1", "Node": "nodeA_", "Address": "10.210.197.22", "Service": {"Service": "nomad", "Port": 4647}}' http://127.0.0.1:8500/v1/catalog/register

diptanu · 2015-12-11T22:31:14Z

@BSick7 We have a fix for this in master and soon releasing with 0.2.2

BSick7 · 2015-12-11T22:54:13Z

Great news! Thanks @diptanu

github-actions · 2022-12-28T02:14:17Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

cbednarski added type/bug theme/discovery labels Dec 2, 2015

diptanu removed the type/bug label Dec 2, 2015

diptanu added the type/enhancement label Dec 3, 2015

diptanu mentioned this issue Dec 3, 2015

Nomad removes checks not managed by itself #529

Closed

dadgar added this to the v0.3.0 milestone Dec 8, 2015

diptanu mentioned this issue Dec 11, 2015

Don't deregister services and checks which are not managed by Nomad #568

Merged

diptanu closed this as completed Dec 11, 2015

marcjay mentioned this issue Jul 12, 2017

Nomad 0.6.0-rc1: Nomad Client deregisters Server from Consul. Regression of #525 #2827

Closed

github-actions bot locked as resolved and limited conversation to collaborators Dec 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad client deregistering from consul #525

Nomad client deregistering from consul #525

BSick7 commented Dec 2, 2015

cbednarski commented Dec 2, 2015

BSick7 commented Dec 2, 2015

dadgar commented Dec 2, 2015

diptanu commented Dec 2, 2015

diptanu commented Dec 2, 2015

cbednarski commented Dec 2, 2015

BSick7 commented Dec 2, 2015

cbednarski commented Dec 2, 2015

cbednarski commented Dec 2, 2015

steve-jansen commented Dec 3, 2015

adrianlop commented Dec 3, 2015

cbednarski commented Dec 3, 2015

cbednarski commented Dec 3, 2015

BSick7 commented Dec 3, 2015

agy commented Dec 9, 2015

diptanu commented Dec 11, 2015

BSick7 commented Dec 11, 2015

github-actions bot commented Dec 28, 2022

Nomad client deregistering from consul #525

Nomad client deregistering from consul #525

Comments

BSick7 commented Dec 2, 2015

cbednarski commented Dec 2, 2015

BSick7 commented Dec 2, 2015

dadgar commented Dec 2, 2015

diptanu commented Dec 2, 2015

diptanu commented Dec 2, 2015

cbednarski commented Dec 2, 2015

BSick7 commented Dec 2, 2015

cbednarski commented Dec 2, 2015

cbednarski commented Dec 2, 2015

steve-jansen commented Dec 3, 2015

adrianlop commented Dec 3, 2015

cbednarski commented Dec 3, 2015

cbednarski commented Dec 3, 2015

BSick7 commented Dec 3, 2015

agy commented Dec 9, 2015

diptanu commented Dec 11, 2015

BSick7 commented Dec 11, 2015

github-actions bot commented Dec 28, 2022