-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate Node IDs after upgrading to 1.2.3 #4741
Comments
Did your node was cleaned up and re-installed? (I mean is the message possible?) Do you use the option leave on terminate? Do you use https://www.consul.io/docs/agent/options.html?#disable_host_node_id with false value (do you have fixed ID per node - I assume you are not)? |
I upgraded from 1.2.2 to 1.2.3 so the consul agent restarted for that upgrade. I don't have the leave on terminate or disable_host_node_id options on. I've upgraded many times with my current settings without issue. |
yes, i just install 1.2.3, when the node2 restart, also has this issue
|
Actually, we also had this issue for a precise reason: we re-install nodes from scratch. When re-installing a node - starting with version 0.8.5 - the node id is no longer predictable by default - thus, the nodeid might be different, which leads to this message. We fixed this (on bare-metal hosts) by forcing the predictible nodeid per host, thus, when we re-install the node, Consul sees it as the same machine. Another way is to properly leave the cluster when the "first" node is being decommissioned. The other possibility to trigger this is when using VMs, if you re-run a node with an identical node name, its ID is also different, thus, Consul refuses the new node to "steal" the node name as the previous node was seen not that much time ago. Since I worked on this feature (since we had 2 nodes stealing their names) in #4415, I am quite interested by your specific usage: are you in one of those cases or is it another corner case? |
Hi, I had observed node id clashes (long time ago v0.9 ?) when I was deploying vms from a common image and not changing its SMBIOS field. (What "dmidecode" fetches from inside the VM) This was also observed (quite some time ago v0.9?) when I was using Consul inside LXDs. |
I'm seeing the same - we're running with the docker image from docker-hub. We set the node name fixed with the -node option.
|
I am seeing the same issue with 1.2.3 as previously mentioned, getting error:
After some investigation and trying the disable_host_node_id setting, I decided to go back to 1.2.2 consul version in our cluster and everything works as expected again. Looking through previous issues, this new behavior could be a result of #4415? A little more information about my setup, its not uncommon for nodes to be restarted and brought back only relatively quickly from time to time, on the order of maybe a minute of total down time for any particular node. Our nodes also have fixed node names and each node is on the same physical hardware when taken down and brought back online. |
@pierresouchay I am on bare metal so my host names and IPs tend to be long lasting in my environment. My configs for the consul client look something like this: { |
Going to amend my previous comment on this after I did some extensive testing on the issue we are now hitting with 1.2.3, and if there is a suggested config path to avoid this issue for this version and beyond it would be much appreciated. Our consul clients are typically Linux hosts in various cloud providers (Openstack, AWS, GCP), and while we typically re-use hostnames when we delete and re-create clients, they will typically get a new random IP address from the subnet. We have disable_host_node_id at default (true) as we can't use deterministic node ID's due to these hosts getting new IP addresses. We also have leave_on_terminate=true configured for our consul clients. If a client gets a clean shutdown, we have no problem, because it leaves the cluster. If a client does not get a clean shutdown and remains registered in the cluster, but is rebuilt with a different IP, we also have no problem. The problem happens if a client does not get a clean shutdown, remains registered in the cluster, and is rebuilt with the same IP. Now we run into the node renaming error. Since we can't guarantee a clean shutdown, nor guarantee that the IP will be different when rebuilding a host in an environment where we can't use deterministic node IDs, what would be the recommended configuration path in this case? |
I was able to duplicate this issue on a node that did not get a clean shutdown and was rebuilt with a different IP address, so it does not seem like the IP it gets actually matters in this case, only that it has the same hostname. At this point we aren't sure the proper way to handle consul clients that don't get a clean shutdown that need to be re-created. Shouldn't this be a common expected pattern of deployment and not require manual intervention of forcing clients to leave? |
@planetxpress well, on our side, we are using baremetal nodes with fixed node-ID generation. It means that when the node is renamed, it gets its new name, but this protects from having 2 nodes registering with the same name (which did broke our production twice). When we replace a node by another one, we also give it another name, so not a real issue for us. For users that wish to reuse names quickly for different machines with different IDs not properly leaving the cluster before shutting down, it might create troubles however. The solution might to be able to skip this test using a new configuration option. @mkeeler do you have an opinion on this? |
So, to confirm, there's no fix for an existing node in a cloud environment that runs into this issue? |
On Linux, most golang programs do not run as root (or at least, they should not), by default, the kernels uses strict permissions, so most userland programs cannot read `/sys/class/dmi/id/product_uuid`. However, programs such as Consul are relying on it to get fixed IDs, instead they have a different ID on each boot. We propose to use `/etc/machine-id` as fallback https://www.freedesktop.org/software/systemd/man/machine-id.html In order to fix this, this patch does the following: - if `/sys/class/dmi/id/product_uuid` can be read, use it for HostID - else if `/etc/machine-id` exists and has 32 chars, use it and add '-' to have the same format as product_uuid - finally, if notthing works, use the `kernel.random.boot_id` This will greatly increase the number of programs having correct behaviour when those rely on having a fixed HostID. This will fix the following issues: - shirou#350 - hashicorp/consul#4741
There are actually several issues:
I was thinking about adding an option for instance Still, I am wondering why |
Any advice on downgrading to 1.2.2? When we do so, we get errors like
I think this is because #4535 landed in 1.2.3 as well. |
@TylerLubeck you are exactly right. Unfortunately downgrading is not easy when there have been new messages added to the state store that the older version can't decode. In some rare cases where we really know it can't be a problem we do have a way to mark some message types we add as "safe to ignore" which means older versions can downgrade. In this specific case that wasn't possible because the change in 1.2.3 was to fix a bug where we forgot to store some pretty critical bits of state in the snapshot! We can't make it safe to downgrade automatically in those cases because the state could become inconsistent if some new objects that were written get dropped and lost but other state that was written after or in same transaction is left. This is why we recommend taking a snapshot before any upgrade so if you do need to revert you can use that, but I appreciate it's not always possible especially if you don't find the issue that needs a downgrade until much later. Overall to others, we have this issue on our radar and it highlights how tricky some of the reasoning around "host identity" is. The hard thing is trying to reconcile all of these common patterns - initially Consul was closely tied to hostnames and IPs alone and that caused problems for folks running on containers or VMs where the IP could change. It also caused problems for anyone who had to rename their hosts without changing IP. We introduced IDs to try and get around this but then there is no good way to unique identify a host - the closest are the hardware identifiers or state on disk (how we manage node ids now) but then that changes if you VM dies and gets replaces - kinda exactly what it's meant to do. It feels a bit like we fix a "bug" with one use case it causes other legitimate use-cases trouble in ways that were hard to anticipate (e.g. only if you fail to leave gracefully. That said, there is clearly an issue here to resolve (or a few) so we will consider it more. It might be that we just make the node name stealing protection configurable like Pierre suggested but that would likely come with caveats. Optionally allowing stealing for nodes that are failing serf checks is also reasonable although it possibly leads to issues if the host is not really down. @rafaelmagu there is no current configuration that just always works it seems even if your hosts go away permanently without gracefully leaving, but I think it would possible to have this work as far as I understand the situation if whatever automation/human is creating your replacement host could also issue a By that I don't mean we won't change behaviour here - we'll think some more but I think it's at least possible to operate Consul in a cloud environment if you ensure your process for replacing a node can forcefully remove the old one if it left ungracefully. Does that help you? |
@mgresser @davidkarlsen @lanetxpress @rafaelmagu what do you think about my proposal in #5008 ? |
@pierresouchay thank you having that option exposed would be fantastic. |
@pierresouchay that would work for me as well. |
how to avoid this problem temporarily? |
Other way, as we did (for bare-metal servers), compute the node-id with a fixed ID for nodes. We are doing this with the output of dmidecode on our side this way (which is the same way as Consul was supposed to work on Linux): #4914 (comment) If you want something very simple, compute the nodeID using the hostname, example (in Ruby): #!/usr/bin/env ruby
require 'digest'
def generate_node_id_for_consul(baseid)
h = Digest::SHA512.hexdigest(baseid)
"#{h[0..7]}-#{h[8..11]}-#{h[12..15]}-#{h[16..19]}-#{h[20..31]}"
end
nodeid=generate_node_id_for_consul(`hostname -f`.strip.downcase)
if ARGV.count > 0
dst=ARGV[0]
begin
file = File.open(dst, "w")
file.write(nodeid)
rescue StandardError => e
STDERR.puts "[ERROR] Cannot write to #{dst}: #{e}"
exit 1
ensure
file.close unless file.nil?
end
else
puts nodeid
end
exit 0 Copy this ruby script somewhere and call it that way before starting consul: /my/path/to/script/node-id-gen.rb /var/lib/consul/node-id => it will generate a predictable |
Got it, thanks |
@aashitvyas Vote for it, but I am not from Hashicorp, I cannot decide :) |
@pierresouchay oops !! sorry didn't know, I am newbie on Github. I will vote for it to get some attention |
Same issue here, specifically when upgrading from 1.2.2 to 1.2.3 and also using I will add a test case here for that ( i actually already have one https://github.com/EugenMayer/consul-docker-stability-tests ) |
Hi, everyone! We've also bumped into the issue. But what was strange: every next trying to register a client had finished with an error, that node name was reserved by a previously created client. Here's filtered logs from a normally working client:
But after some problems with cluster state, next client's launches from this host ended up with errors:
... and so on. As you could note, next node could not register because of the previous one. |
@AlexanderZagaevskiy yes, that's the same issue. You can either vote for |
We're likely going to address this by utilizing Serf, but will also consider the full discussion in #5008. See this comment for more on that approach: #5008 (comment) |
I use this to generate the node-id to keep the node-ids the same across reboots and rebuilds.
|
may I ask for the status? has it been addressed with #4415 ? |
to generate a nodeid with ansible, I have this in the template for the config file: |
I am seeing this issue with 1.4.4, instance is destroyed without deregistering, when spin up again with same name and ip I get below, I am testing this in lab and new enough to consul setup.
config.json contents below, also not refreshing http status when I stop the service says still ok.
|
@avoidik status is that #5485 should fix this which is blocked on hashicorp/memberlist#189 which is blocked un us all racing to get other stuff done for a release deadline! So the fix is in sight but just needs last few tweaks and merges. I suspect it will be fixed in a patch release after 1.5.0, probably 1.5.1. |
After upgrading to 1.2.3 hosts begin reporting the following when registering with the catalog. The node name for the host making the new reservation and the node name for the conflicted always match.
[ERR] consul: "Catalog.Register" RPC failed to server xxx.xxx.xxx.xxx:8300: rpc error making call: failed inserting node: Error while renaming Node ID: "4833aa15-8428-1d1a-46d8-9dba157dbc60": Node name xxx.xxx.xxx is reserved by node c5dc5b48-f105-79f0-7910-3de6629fddd0 with name xxx.xxx.xxx
Restarting consul on the host seems to resolve the issue at least temporarily. I was able to produce this on a mixed environment consisting of CentOS hosts ranging from major release 5 through 7 but most hosts on CentOS 7.
The text was updated successfully, but these errors were encountered: