Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate Node IDs after upgrading to 1.2.3 #4741

Closed
mgresser opened this issue Oct 2, 2018 · 47 comments · Fixed by #5485
Closed

Duplicate Node IDs after upgrading to 1.2.3 #4741

mgresser opened this issue Oct 2, 2018 · 47 comments · Fixed by #5485
Labels
type/enhancement Proposed improvement or new feature
Milestone

Comments

@mgresser
Copy link

mgresser commented Oct 2, 2018

After upgrading to 1.2.3 hosts begin reporting the following when registering with the catalog. The node name for the host making the new reservation and the node name for the conflicted always match.

[ERR] consul: "Catalog.Register" RPC failed to server xxx.xxx.xxx.xxx:8300: rpc error making call: failed inserting node: Error while renaming Node ID: "4833aa15-8428-1d1a-46d8-9dba157dbc60": Node name xxx.xxx.xxx is reserved by node c5dc5b48-f105-79f0-7910-3de6629fddd0 with name xxx.xxx.xxx

Restarting consul on the host seems to resolve the issue at least temporarily. I was able to produce this on a mixed environment consisting of CentOS hosts ranging from major release 5 through 7 but most hosts on CentOS 7.

@pierresouchay
Copy link
Contributor

Did your node was cleaned up and re-installed? (I mean is the message possible?)

Do you use the option leave on terminate?

Do you use https://www.consul.io/docs/agent/options.html?#disable_host_node_id with false value (do you have fixed ID per node - I assume you are not)?

@mgresser
Copy link
Author

mgresser commented Oct 3, 2018

I upgraded from 1.2.2 to 1.2.3 so the consul agent restarted for that upgrade. I don't have the leave on terminate or disable_host_node_id options on. I've upgraded many times with my current settings without issue.

@JackXu2
Copy link

JackXu2 commented Oct 3, 2018

yes, i just install 1.2.3, when the node2 restart, also has this issue

2018/10/03 23:12:28 [WARN] consul.fsm: EnsureRegistration failed: failed inserting node: Error while renaming Node ID: "af57f171-44ed-b905-1452-8dfc775afb40": Node name node2 is reserved by node cc5f253c-7498-4d8b-519b-6ad213139023 with name node2
2018/10/03 23:12:28 [WARN] consul.fsm: EnsureRegistration failed: failed inserting node: Error while renaming Node ID: "af57f171-44ed-b905-1452-8dfc775afb40": Node name node2 is reserved by node cc5f253c-7498-4d8b-519b-6ad213139023 with name node2
2018/10/03 23:12:28 [WARN] consul.fsm: EnsureRegistration failed: failed inserting node: Error while renaming Node ID: "af57f171-44ed-b905-1452-8dfc775afb40": Node name node2 is reserved by node cc5f253c-7498-4d8b-519b-6ad213139023 with name node2
2018/10/03 23:12:28 [WARN] consul.fsm: EnsureRegistration failed: failed inserting node: Error while renaming Node ID: "af57f171-44ed-b905-1452-8dfc775afb40": Node name node2 is reserved by node cc5f253c-7498-4d8b-519b-6ad213139023 with name node2
2018/10/03 23:12:28 [WARN] consul.fsm: EnsureRegistration failed: failed inserting node: Error while renaming Node ID: "af57f171-44ed-b905-1452-8dfc775afb40": Node name node2 is reserved by node cc5f253c-7498-4d8b-519b-6ad213139023 with name node2
2018/10/03 23:12:28 [WARN] consul.fsm: EnsureRegistration failed: failed inserting node: Error while renaming Node ID: "af57f171-44ed-b905-1452-8dfc775afb40": Node name node2 is reserved by node cc5f253c-7498-4d8b-519b-6ad213139023 with name node2
2018/10/03 23:12:28 [WARN] consul.fsm: EnsureRegistration failed: failed inserting node: Error while renaming Node ID: "af57f171-44ed-b905-1452-8dfc775afb40": Node name node2 is reserved by node cc5f253c-7498-4d8b-519b-6ad213139023 with name node2
2018/10/03 23:12:28 [WARN] consul.fsm: EnsureRegistration failed: failed inserting node: Error while renaming Node ID: "af57f171-44ed-b905-1452-8dfc775afb40": Node name node2 is reserved by node cc5f253c-7498-4d8b-519b-6ad213139023 with name node2
2018/10/03 23:12:28 [WARN] consul.fsm: EnsureRegistration failed: failed inserting node: Error while renaming Node ID: "af57f171-44ed-b905-1452-8dfc775afb40": Node name node2 is reserved by node cc5f253c-7498-4d8b-519b-6ad213139023 with name node2
2018/10/03 23:12:28 [WARN] consul.fsm: EnsureRegistration failed: failed inserting node: Error while renaming Node ID: "af57f171-44ed-b905-1452-8dfc775afb40": Node name node2 is reserved by node cc5f253c-7498-4d8b-519b-6ad213139023 with name node2

@pierresouchay
Copy link
Contributor

Hello @JackXu2 & @mgresser

Actually, we also had this issue for a precise reason: we re-install nodes from scratch. When re-installing a node - starting with version 0.8.5 - the node id is no longer predictable by default - thus, the nodeid might be different, which leads to this message. We fixed this (on bare-metal hosts) by forcing the predictible nodeid per host, thus, when we re-install the node, Consul sees it as the same machine. Another way is to properly leave the cluster when the "first" node is being decommissioned.

The other possibility to trigger this is when using VMs, if you re-run a node with an identical node name, its ID is also different, thus, Consul refuses the new node to "steal" the node name as the previous node was seen not that much time ago.

Since I worked on this feature (since we had 2 nodes stealing their names) in #4415, I am quite interested by your specific usage: are you in one of those cases or is it another corner case?

@shantanugadgil
Copy link
Contributor

Hi, I had observed node id clashes (long time ago v0.9 ?) when I was deploying vms from a common image and not changing its SMBIOS field. (What "dmidecode" fetches from inside the VM)
I use Proxmox VE.

This was also observed (quite some time ago v0.9?) when I was using Consul inside LXDs.
Each LXD used to give the same dmidecode value and then all the node ids used to end up being identical.

@pearkes pearkes added the needs-investigation The issue described is detailed and complex. label Oct 8, 2018
@davidkarlsen
Copy link

davidkarlsen commented Oct 10, 2018

I'm seeing the same - we're running with the docker image from docker-hub. We set the node name fixed with the -node option.

[ERR] consul: "Catalog.Register" RPC failed to server 10.246.104.20:8300: rpc error making 
call: failed inserting node: Error while renaming Node ID: "f282db6c-f22d-d183-b078-fb5a821dd7ce": Node name alp-aot-ccm10.mydomain is reserved by node 2450fa9
3-3bae-f22e-3a82-08c4d71e5719 with name alp-aot-ccm10.mydomain

@jgornowich
Copy link

I am seeing the same issue with 1.2.3 as previously mentioned, getting error:

[ERR] agent: failed to sync remote state: failed inserting node: Error while renaming Node ID: "b2a874d2-4be5-246d-11d9-8d118af78d32": Node name "myNode" is reserved by node e374b7a1-4ab2-71a2-6fa2-cea6eaf7a1b4 with name "myNode"

After some investigation and trying the disable_host_node_id setting, I decided to go back to 1.2.2 consul version in our cluster and everything works as expected again. Looking through previous issues, this new behavior could be a result of #4415?

A little more information about my setup, its not uncommon for nodes to be restarted and brought back only relatively quickly from time to time, on the order of maybe a minute of total down time for any particular node. Our nodes also have fixed node names and each node is on the same physical hardware when taken down and brought back online.

@mgresser
Copy link
Author

@pierresouchay I am on bare metal so my host names and IPs tend to be long lasting in my environment. My configs for the consul client look something like this:

{
"datacenter": "east",
"domain": "consul",
"data_dir": "/var/lib/consul",
"enable_syslog": true,
"enable_script_checks": true,
"log_level": "ERR",
"node_name": "foo.bar.pvt",
"advertise_addr": "10.20.20.101",
"retry_join": [
"consul1.bar.pvt","consul2.bar.pvt","consul3.bar.pvt"],
"performance": {
"raft_multiplier": 1
}

@planetxpress
Copy link

Going to amend my previous comment on this after I did some extensive testing on the issue we are now hitting with 1.2.3, and if there is a suggested config path to avoid this issue for this version and beyond it would be much appreciated.

Our consul clients are typically Linux hosts in various cloud providers (Openstack, AWS, GCP), and while we typically re-use hostnames when we delete and re-create clients, they will typically get a new random IP address from the subnet. We have disable_host_node_id at default (true) as we can't use deterministic node ID's due to these hosts getting new IP addresses. We also have leave_on_terminate=true configured for our consul clients.

If a client gets a clean shutdown, we have no problem, because it leaves the cluster. If a client does not get a clean shutdown and remains registered in the cluster, but is rebuilt with a different IP, we also have no problem. The problem happens if a client does not get a clean shutdown, remains registered in the cluster, and is rebuilt with the same IP. Now we run into the node renaming error.

Since we can't guarantee a clean shutdown, nor guarantee that the IP will be different when rebuilding a host in an environment where we can't use deterministic node IDs, what would be the recommended configuration path in this case?

@planetxpress
Copy link

I was able to duplicate this issue on a node that did not get a clean shutdown and was rebuilt with a different IP address, so it does not seem like the IP it gets actually matters in this case, only that it has the same hostname.

At this point we aren't sure the proper way to handle consul clients that don't get a clean shutdown that need to be re-created. Shouldn't this be a common expected pattern of deployment and not require manual intervention of forcing clients to leave?

@pierresouchay
Copy link
Contributor

@planetxpress well, on our side, we are using baremetal nodes with fixed node-ID generation.

It means that when the node is renamed, it gets its new name, but this protects from having 2 nodes registering with the same name (which did broke our production twice).

When we replace a node by another one, we also give it another name, so not a real issue for us. For users that wish to reuse names quickly for different machines with different IDs not properly leaving the cluster before shutting down, it might create troubles however.

The solution might to be able to skip this test using a new configuration option. @mkeeler do you have an opinion on this?

@rafaelmagu
Copy link

So, to confirm, there's no fix for an existing node in a cloud environment that runs into this issue?

pierresouchay added a commit to pierresouchay/gopsutil that referenced this issue Nov 6, 2018
On Linux, most golang programs do not run as root (or at least, they should not),
by default, the kernels uses strict permissions, so most userland programs cannot
read `/sys/class/dmi/id/product_uuid`. However, programs such as Consul are relying
on it to get fixed IDs, instead they have a different ID on each boot.

We propose to use `/etc/machine-id` as fallback https://www.freedesktop.org/software/systemd/man/machine-id.html

In order to fix this, this patch does the following:
 - if `/sys/class/dmi/id/product_uuid` can be read, use it for HostID
 - else if `/etc/machine-id` exists and has 32 chars, use it and add '-' to have the same format as product_uuid
 - finally, if notthing works, use the `kernel.random.boot_id`

This will greatly increase the number of programs having correct behaviour when
those rely on having a fixed HostID.

This will fix the following issues:
 - shirou#350
 - hashicorp/consul#4741
@pierresouchay
Copy link
Contributor

There are actually several issues:

  • I discovered than when not run as root, Consul can not have a stable ID, even when using "disable_host_node_id": false => when re-installing a node, its ID will change, hopefully fixed by Have a real fixed HostID on Linux shirou/gopsutil#603
  • for those not using this flag, we should find something more clever.

I was thinking about adding an option for instance "allow_node_overwrite": true|false that would: allow renaming only when existing node is effectively DOWN (meaning, its Serf Health Status is DOWN). Using this option would keep the existing behaviour, but would allow to quickly replace a node with a node with similar name, but only if previous node is already dead (very useful for Cloud based workloads).

Still, I am wondering why disable_host_node_id was set to true... because it was not working or another reason?

@banks @mkeeler What do you think?

@TylerLubeck
Copy link

Any advice on downgrading to 1.2.2? When we do so, we get errors like

raft: Failed to restore snapshot 13931-7840952-1542219582505: Unrecognized msg type 15

I think this is because #4535 landed in 1.2.3 as well.

@banks
Copy link
Member

banks commented Nov 15, 2018

I think this is because #4535 landed in 1.2.3 as well.

@TylerLubeck you are exactly right. Unfortunately downgrading is not easy when there have been new messages added to the state store that the older version can't decode.

In some rare cases where we really know it can't be a problem we do have a way to mark some message types we add as "safe to ignore" which means older versions can downgrade.

In this specific case that wasn't possible because the change in 1.2.3 was to fix a bug where we forgot to store some pretty critical bits of state in the snapshot!

We can't make it safe to downgrade automatically in those cases because the state could become inconsistent if some new objects that were written get dropped and lost but other state that was written after or in same transaction is left.

This is why we recommend taking a snapshot before any upgrade so if you do need to revert you can use that, but I appreciate it's not always possible especially if you don't find the issue that needs a downgrade until much later.

Overall to others, we have this issue on our radar and it highlights how tricky some of the reasoning around "host identity" is.

The hard thing is trying to reconcile all of these common patterns - initially Consul was closely tied to hostnames and IPs alone and that caused problems for folks running on containers or VMs where the IP could change. It also caused problems for anyone who had to rename their hosts without changing IP.

We introduced IDs to try and get around this but then there is no good way to unique identify a host - the closest are the hardware identifiers or state on disk (how we manage node ids now) but then that changes if you VM dies and gets replaces - kinda exactly what it's meant to do.

It feels a bit like we fix a "bug" with one use case it causes other legitimate use-cases trouble in ways that were hard to anticipate (e.g. only if you fail to leave gracefully.

That said, there is clearly an issue here to resolve (or a few) so we will consider it more.

It might be that we just make the node name stealing protection configurable like Pierre suggested but that would likely come with caveats.

Optionally allowing stealing for nodes that are failing serf checks is also reasonable although it possibly leads to issues if the host is not really down.

@rafaelmagu there is no current configuration that just always works it seems even if your hosts go away permanently without gracefully leaving, but I think it would possible to have this work as far as I understand the situation if whatever automation/human is creating your replacement host could also issue a force-leave for the host name. In that case you know for sure something that Consul doesn't - that the old host isn't coming back - and force-leave is the way to tell it that. Alternatively, consul will forget about the old node after 72 hours after which it's name is free. This is more or less working as designed and it was only fixing a "bug" that kinda allowed node stealing that it's become more apparent.

By that I don't mean we won't change behaviour here - we'll think some more but I think it's at least possible to operate Consul in a cloud environment if you ensure your process for replacing a node can forcefully remove the old one if it left ungracefully. Does that help you?

@pierresouchay
Copy link
Contributor

pierresouchay commented Nov 27, 2018

@mgresser @davidkarlsen @lanetxpress @rafaelmagu what do you think about my proposal in #5008 ?

@planetxpress
Copy link

@pierresouchay thank you having that option exposed would be fantastic.

@rafaelmagu
Copy link

@pierresouchay that would work for me as well.

@linydquantil
Copy link

how to avoid this problem temporarily?

@pierresouchay
Copy link
Contributor

consul leave ; service consul restart does the trick when we have this issue.

Other way, as we did (for bare-metal servers), compute the node-id with a fixed ID for nodes. We are doing this with the output of dmidecode on our side this way (which is the same way as Consul was supposed to work on Linux): #4914 (comment)

If you want something very simple, compute the nodeID using the hostname, example (in Ruby):

#!/usr/bin/env ruby
require 'digest'

def generate_node_id_for_consul(baseid)
  h = Digest::SHA512.hexdigest(baseid)
  "#{h[0..7]}-#{h[8..11]}-#{h[12..15]}-#{h[16..19]}-#{h[20..31]}"
end

nodeid=generate_node_id_for_consul(`hostname -f`.strip.downcase)

if ARGV.count > 0
  dst=ARGV[0]
  begin
    file = File.open(dst, "w")
    file.write(nodeid)
  rescue StandardError => e
    STDERR.puts "[ERROR] Cannot write to #{dst}: #{e}"
    exit 1
  ensure
    file.close unless file.nil?
  end
else
  puts nodeid
end
exit 0

Copy this ruby script somewhere and call it that way before starting consul:

/my/path/to/script/node-id-gen.rb /var/lib/consul/node-id

=> it will generate a predictable /var/lib/consul/node-id file base on the FQDN of your machine

@linydquantil
Copy link

Got it, thanks

@pierresouchay
Copy link
Contributor

@aashitvyas Vote for it, but I am not from Hashicorp, I cannot decide :)

@aashitvyas
Copy link

@pierresouchay oops !! sorry didn't know, I am newbie on Github. I will vote for it to get some attention

@EugenMayer
Copy link

EugenMayer commented Dec 19, 2018

Same issue here, specifically when upgrading from 1.2.2 to 1.2.3 and also using -disable-host-node-id=false - in my case it happens when i stop/rm a docker container ( service ) and then readd it again, while the consul server keeps it's state.

I will add a test case here for that ( i actually already have one https://github.com/EugenMayer/consul-docker-stability-tests )

@a-zagaevskiy
Copy link
Contributor

Hi, everyone! We've also bumped into the issue. But what was strange: every next trying to register a client had finished with an error, that node name was reserved by a previously created client.

Here's filtered logs from a normally working client:

==> Starting Consul agent...
==> Consul agent running!
           Version: 'bf500a2'
           Node ID: 'b1431f6d-5337-27e1-b045-4ed8615642af'
         Node name: 'QA-T49'
        Datacenter: 'primary' (Segment: '')
...
    2019/01/30 13:48:58 [DEBUG] agent: Using random ID "b1431f6d-5337-27e1-b045-4ed8615642af" as node ID

But after some problems with cluster state, next client's launches from this host ended up with errors:

==> Starting Consul agent...
==> Consul agent running!
           Version: 'bf500a2'
           Node ID: '4683b4c4-e50d-63f0-efca-ff4e5da9b0e6'
         Node name: 'QA-T49'
        Datacenter: 'primary' (Segment: '')
...
    2019/01/30 13:51:09 [DEBUG] agent: Using random ID "4683b4c4-e50d-63f0-efca-ff4e5da9b0e6" as node ID
    2019/01/30 13:51:12 [ERR] consul: "Catalog.Register" RPC failed to server 172.17.18.40:8300: rpc error making call: failed inserting node: Error while renaming Node ID: "4683b4c4-e50d-63f0-efca-ff4e5da9b0e6": Node name QA-T49 is reserved by node b1431f6d-5337-27e1-b045-4ed8615642af with name QA-T49
...

==> Starting Consul agent...
==> Consul agent running!
           Version: 'bf500a2'
           Node ID: '2b4f6ba0-20ff-2557-30cf-444d8460fb68'
         Node name: 'QA-T49'
        Datacenter: 'primary' (Segment: '')
...
    2019/01/30 14:21:56 [DEBUG] agent: Using random ID "2b4f6ba0-20ff-2557-30cf-444d8460fb68" as node ID
    2019/01/30 14:21:56 [ERR] consul: "Catalog.Register" RPC failed to server 172.17.9.30:8300: rpc error making call: rpc error making call: failed inserting node: Error while renaming Node ID: "2b4f6ba0-20ff-2557-30cf-444d8460fb68": Node name QA-T49 is reserved by node 4683b4c4-e50d-63f0-efca-ff4e5da9b0e6 with name QA-T49

... and so on. As you could note, next node could not register because of the previous one.

@pierresouchay
Copy link
Contributor

@AlexanderZagaevskiy yes, that's the same issue. You can either vote for
#5008 or simply generate a fixed node ID as explained here:
#4741 (comment) and #4741 (comment)

@pearkes pearkes modified the milestones: Upcoming, 1.4.3 Feb 19, 2019
@pearkes pearkes added type/enhancement Proposed improvement or new feature and removed needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release labels Feb 20, 2019
@pearkes
Copy link
Contributor

pearkes commented Feb 20, 2019

We're likely going to address this by utilizing Serf, but will also consider the full discussion in #5008. See this comment for more on that approach: #5008 (comment)

@koshatul
Copy link

I use this to generate the node-id to keep the node-ids the same across reboots and rebuilds.

uuidgen -N "$(hostname -s)" --namespace "@dns" --sha1

@avoidik
Copy link

avoidik commented Mar 27, 2019

may I ask for the status? has it been addressed with #4415 ?

@wimfabri
Copy link

wimfabri commented Apr 9, 2019

to generate a nodeid with ansible, I have this in the template for the config file:
"node_id": "{{ ansible_hostname | to_uuid }}",

@deeco
Copy link

deeco commented Apr 19, 2019

I am seeing this issue with 1.4.4, instance is destroyed without deregistering, when spin up again with same name and ip I get below, I am testing this in lab and new enough to consul setup.

2019/04/19 14:24:04 [ERR] agent: failed to sync remote state: rpc error making call: rpc error making call: failed inserting node: Error while renaming Node ID: "e1ba7d76-2da8-46ae-cc32-9b54c7713a01": Node name consultest is reserved by node e48aca4a-49d7-9207-744b-4224dcd25916 with name consultest

config.json contents below, also not refreshing http status when I stop the service says still ok.

{ "bootstrap": false, "datacenter": "dc1", "data_dir": "/var/consul", "leave_on_terminate": true, "disable_host_node_id": true, "encrypt": "Dt3P9SpKGAR/DIUN1cDirg==", "log_level": "INFO", "enable_syslog": true, "bind_addr": "172.20.20.50", "client_addr": "172.20.20.50", "start_join": ["172.20.20.10", "172.20.20.20", "172.20.20.30"], "service": { "name": "mywebservice", "port": 80, "meta": { "lb_type": "http", "service": "MyWebService" }, "checks": [ { "name": "HTTP /mywebservice on port 80", "http": "http://localhost:80", "tls_skip_verify": true, "interval": "30s", "timeout": "25s" } ] } }

@banks
Copy link
Member

banks commented Apr 26, 2019

@avoidik status is that #5485 should fix this which is blocked on hashicorp/memberlist#189 which is blocked un us all racing to get other stuff done for a release deadline!

So the fix is in sight but just needs last few tweaks and merges. I suspect it will be fixed in a patch release after 1.5.0, probably 1.5.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement Proposed improvement or new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.