Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ServiceAddress in Consul isn't updated when Nomad's client_interface changes #8732

Open
snh opened this issue Aug 25, 2020 · 4 comments
Open
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul theme/service-discovery/consul type/bug

Comments

@snh
Copy link

snh commented Aug 25, 2020

Hello 👋

I run Nomad in a configuration where I occasionally need to change the Nomad client_interface, without requiring any downtime for the services under Nomad's control. The purpose of changing the client_interface is to influence the address advertised for these services in Consul.

The system in question undergoes occasional changes where the network interface and address used for these service advertisements needs to be updated without an interruption to the underlying services.

I have observed that the registrations in Consul aren't updated to reflect this change unless I completely stop, and re-run the relevant jobs, leading to clients continuing to try and contact these services via the address attached to the previous client_interface, rather than the updated one.

Is this the intended behaviour? Is there any way to update these Consul service registrations without interrupting these services or registrations?

Apologies if this is a duplicate of an existing issue. I did locate #4815, which is related, and would possibly be a suitable workaround if it was available.

Happy to provide further information and background on this use-case and how to replicate this if needed! Thanks!

Nomad version

vagrant@node-1:~$ nomad version
Nomad v0.12.3 (2db8abd9620dd41cb7bfe399551ba0f7824b3f61)
vagrant@node-1:~$ consul version
Consul v1.8.3
Revision a9322b9c7
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Operating system and Environment details

Debian GNU/Linux 10 (buster) AMD64 running in VirtualBox via Vagrant.

Issue

The ServiceAddress and ServiceTaggedAddresses for the Service registration in Consul, as well as the address used for associated Checks are not updated in Consul when network_interface is updated in Nomad's client configuration.

Reproduction steps

  1. Deploy a Consul integrated Nomad instance with two Ethernet interfaces (eth0 and eth1) with unique addresses. Start Nomad with client_interface set to eth0:

    client {
        enabled = true
        network_interface = "eth0"
    }
    
  2. Run a new job which contains at least one service and check.

  3. Confirm that the ServiceAddress and ServiceTaggedAddresses for the Service registration in Consul, as well as the address used for associated Check(s) in the Consul registration reflect the IP address of eth0.

  4. Update the Nomad configuration to eth1:

    client {
        enabled = true
        network_interface = "eth1"
    }
    
  5. Restart Nomad.

  6. Observe that the ServiceAddress and ServiceTaggedAddresses for the Service registration in Consul, as well as the address used for associated Check(s) in the Consul registration continue to reflect the IP address of eth0.

  7. Re-run the already allocated job.

  8. Observe that these have still not updated.

  9. Stop and re-run the job.

  10. Observe that these have updated.

@shoenig
Copy link
Member

shoenig commented Aug 25, 2020

Thanks for the detailed writeup, @snh!

I believe what's happening is we accidentally short-circuit the evaluation of whether the service address has been changed by first checking whether the hash of the service definition has changed - which only checks the address_mode parameter rather than the actual address.

@shoenig shoenig added type/bug stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Aug 25, 2020
@nickethier
Copy link
Member

Hey @snh I started thinking about this one and I'm confused about the intended behavior here.

If a job is already running and bound to an addresses on eth0 and Nomad did as you asked and updated all the services/checks to reflect the new IP of eth1, wouldn't that break your application without a restart? For example lets say eth0 and eth1 have addresses 10.0.0.10 and 10.0.0.11 respectively. If an allocation has bound an http server to 10.0.0.10:80 and Nomad reregisters the service/check with eth1's address of 10.0.0.11, Consul would have the incorrect IP/Port without the allocation restarting/rebinding to the new IP address.

I'm always hesitant to make a change like this where an inplace networking update to the client is assumed to have the same behavior across various users network configurations. The safest way to do this is to drain the node, make the change and put the node back into service.

@nickethier nickethier self-assigned this Sep 24, 2020
@snh
Copy link
Author

snh commented Sep 28, 2020

Hey @nickethier

If a job is already running and bound to an addresses on eth0 and Nomad did as you asked and updated all the services/checks to reflect the new IP of eth1, wouldn't that break your application without a restart?

We use Docker host networking for all of our containers, and don't bind the services to a specific interface, so this isn't an issue in our use case. We effectively bind the services to all interfaces (*).

Even if we did bind them to a specific interface, we have observed that updating the job so that a new allocation is created (and service restarted) still doesn't appear to update the Consul service registration, and have found we have to stop the existing job and create a new job completely for this to update, which results in a reasonable period of downtime.

@nickethier
Copy link
Member

nickethier commented Sep 28, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul theme/service-discovery/consul type/bug
Projects
None yet
Development

No branches or pull requests

4 participants