Allow to rename nodes with IDs, will fix #3974 and #4413 #4415

pierresouchay · 2018-07-18T22:27:59Z

This change allow to rename any well behaving recent agent with an
ID to be renamed safely, ie: without taking the name of another one
with case insensitive comparison.

Deprecated behaviour warning

Due to asceding compatibility, it is still possible however to
"take" the name of another name by not providing any ID.

Note that when not providing any ID, it is possible to have 2 nodes
having similar names with case differences, ie: myNode and mynode
which might lead to DB corruption on Consul server side and
lead to server not properly restarting.

See #3983 and #4399 for Context about this change.

Disabling registration of nodes without IDs as specified in #4414
should probably be the way to go eventually.

…#4413 This change allow to rename any well behaving recent agent with an ID to be renamed safely, ie: without taking the name of another one with case insensitive comparison. Deprecated behaviour warning ---------------------------- Due to asceding compatibility, it is still possible however to "take" the name of another name by not providing any ID. Note that when not providing any ID, it is possible to have 2 nodes having similar names with case differences, ie: myNode and mynode which might lead to DB corruption on Consul server side and lead to server not properly restarting. See hashicorp#3983 and hashicorp#4399 for Context about this change. Disabling registration of nodes without IDs as specified in hashicorp#4414 should probably be the way to go eventually.

pierresouchay · 2018-07-18T22:36:10Z

@banks @mkeeler Ok, I implemented exactly what you said in #4399 (comment)

The only differences are (apart from being a single commit):

another test when performing a registration with an ID (see https://github.com/hashicorp/consul/pull/4415/files#diff-2271ca70cf81490eff6dd6060d3f35b5R396 ) => this test in the else was missing, so we still could create a node with a similar name but different case !
more unit tests coverage, with explicit comments about deprecated behavior

Hope you will be satisfied with this

Kind regards

block since it breaks the test TestAgentAntiEntropy_Services While the else case is probably legit, it will be fixed with hashicorp#4414 in a later release.

pierresouchay · 2018-07-18T22:50:20Z

@mkeeler I had to remove the else test for case insensitive search in my second commit since it broke test TestAgentAntiEntropy_Services. While that's probably fine for renaming, it means it won't fix properly #4413 :(

Still some room for further PRs then

enforce this test only for nodes having IDs. Thus most tests without any ID will work, and allows us fixing

pierresouchay · 2018-07-18T23:36:49Z

@mkeeler I think I got it right now. I just added && enode.ID != "" to avoid collisions with nodes without IDs and all tests do pass.

So, it basically fixes #4413 and #3974 again !

`TestStateStore_EnsureNode` now test registration and renaming with IDs `TestStateStore_EnsureNodeDeprecated` tests registration without IDs and tests removing an ID from a node as well as updated a node without its ID (deprecated behaviour kept for backwards compatibility)

mkeeler · 2018-07-19T14:18:26Z

agent/consul/state/catalog.go

+						node.ID, n.Node, node.Node)
+				}
+			}
+		} else {


The else statement here is one of the problematic ones. I would recommend replacing with this statement.

} // TODO: else statement missing to ensure name is not already taken by another node // If added right now it could break existing behavior which allows for adding // a UUID to a node that doesn't have one when the name is the same.

@mkeeler I think the else statement is now SAFE since s.ensureNoNodeWithSimilarNameTxn will not match any node without an ID. It means it will break ONLY if: a node having an ID clashes (case insensitive) with another node having an ID.

In all other cases, the behaviour is unchanged as shown in tests deprecatedEnsureNodeWithoutIDCanRegister of agent/consul/state/catalog_test.go.

What works:

When request has an ID:

renaming a node when having an ID, without colliding with the name of another node (case insensitive)

creating a node with an ID for which the name is not taken by anybody (case insensitive)

When request has not ID (more permissive regarding case insensitivity)

Create a new node without ID with name not colliding

Remove a node ID of a node with an ID if name EXACTLY (case sensitive) matches

Create a new node without ID with name colliding with case insensitive differences

What does not work anymore due to this change:

Create a node with an ID, but this name collides with another one (case insensitive) having an ID (works if other node DOES not have any ID)

Previously, given the following configuration

id42 - node1

Registering same name with different ID is blocked
id43 - node2 => previously would 'steal' node2 to id42 ==> now blocked
id43 - NoDe2 => previously would create a second node (id42: node2 ; id43: NoDe2) => the big crash I had described here #4413 => now the registration is blocked and id43 won't be able to join

Conclusion

A request with an ID can "steal" a node without any ID, but cannot steal a node with an ID if ID is different

A request without any ID can do whatever it wants

So, basically, the ONLY new possible issue we have is:

if you change manually the ID of a node, this node won't be able to join the cluster

if 2 nodes have different IDs and case insensitive differences, only ONE can join, which is good new for DNS as well as it avoid the DB corruption I had.

What do you think of this ?

I think I understand now.

We will error out if both if an existing node with a uuid != new uuid where there is a name match but we will still allow adding a uuid to a node that doesn't have one. You just moved the logic to only enforce it when they both have uuids into that function.

mkeeler · 2018-07-20T13:44:42Z

agent/consul/state/catalog.go

-				return fmt.Errorf("node ID %q for node %q aliases existing node %q",
-					node.ID, node.Node, n.Node)
+				// Lets first get all nodes and check whether name do match
+				dupNameError := s.ensureNoNodeWithSimilarNameTxn(tx, node)


I think there is a bug here.

Because the name check only will match if the other node has a UUID and the UUID doesn't match as well, this will potentially allow merging two existing nodes into 1.

An example of this is if there are two existing nodes: node foo has a uuid and node bar does not. If you attempt to rename foo to bar it will not error out for the name match since the bar node has no uuid. It will delete node foo and then add the uuid to the bar node and update all its info. This is definitely not desired.

Adding a uuid to a node (the else case where we had no uuid match) is okay to allow for upgrading to uuids. However we need to prevent this case where we have two existing nodes and we merge them into one.

@mkeeler Well spotted, I introduced this bug after I saw TestAgentAntiEntropy_Services not working => should be fixed in next patch

Renaming is not allowed when another node, including without nodeID exists in next patch.

…has no ID

pierresouchay · 2018-07-21T10:40:18Z

@mkeeler Something is wrong... I saw by adding some tests for renaming. It seems that index ("nodes", "uuid") is not properly updated... I think indexing using "uuid" for nodes does not work well.

I'll keep you updated

pierresouchay · 2018-07-21T16:57:41Z

@mkeeler The last patch fix an existing bug: GetNodeID() and similar lookups were not working properly since index uuid was UUIDIndex while lookups were performed on strings. Thus, we were never entering in the test for renaming.

This fixes both and include tests cases that includes all we discussed, this time should be the good one ! :-)

…searching node from its ID Thus, all tests about renaming were not working properly. Added the full test cas that allowed me to detect it.

pierresouchay · 2018-07-24T07:45:22Z

@mkeeler the last patch explains why I never got the "aliases existing node" in prod when renaming nodes, basically, the feature was not working at all since lookup per node uuid did not work at all, what do you think about the change now?

…GetNodeID()

pierresouchay · 2018-07-25T23:13:52Z

Hello @banks & @mkeeler

In my last patches, I just improved the error messages when GetNodeId() get a wrong UUID and improved the test coverage for GetNodeId() function

So basically this review:

"node ID %q for node %q aliases existing node %q" => was never working due to wrong lookup by UUID since GetNodeID() nor the test for entering this condition was working=> always in the case of renaming anyway (and not protected against case insensitivity)
GetNodeID() is now working as expected (was not working due to indexation working on UUID binary representation while lookup performed on string
Changed the behavior as explained above: Allow to rename nodes with IDs, will fix #3974 and #4413 #4415 (comment)

I will wait for a review anyway now, I think we are good

pierresouchay · 2018-07-25T23:26:09Z

Travis test are not stable once again :(

More complete test coverage for GetNodeID

banks

Hey @pierresouchay I could be wrong still but this is getting pretty complicated to understand. I think one of the specific breaking changes from before is at-best untested here and potentially broken again (although the tests it would incidentally have broken before seem to pass for now).

Can we add an explicit test both to document the intent and to be convinced this is safe now? I tried to explain what I mean inline - hope it's clear enough!

banks · 2018-07-26T15:10:17Z

agent/consul/state/catalog.go

+	for nodeIt := enodes.Next(); nodeIt != nil; nodeIt = enodes.Next() {
+		enode := nodeIt.(*structs.Node)
+		if strings.EqualFold(node.Node, enode.Node) && node.ID != enode.ID {
+			if !(enode.ID == "" && allowClashWithoutID) {


Does this cover the case where the new node (node.ID == "") but clashing node isn't? That is specifically what broke tests before.

I'll read on and may well find a reason why that can't happen here but just noting. Even if that is a true invariant might be worth commenting because we want to allow either the existing or the new node to have blank ID without changing behaviour.

This function does not, but basically yes since this function is called ONLY when node.ID != ""

In my next and (hopefuly) last patch, I did not change anything but added tests to this function.

To summarize

When ID is provided:

if the node already exists, you rename the node if no conflicts with another node

if node does not exists, check for conflicts, but ONLY for nodes having ID (thus we allow to "upgrade" a node without ID and add an ID)

When ID is not provided:

free lunch, you can "downgrade" a node by removing its ID - BUT ONLY if you have the exact same name as an existing node without ID (Kept for your tests and compatibility reasons)

you can create a node of course

So, it means this PR is fixing #4413 (not being able to have two different agents with with that differs only by their case) ONLY in clusters where everybody is playing nice and has an ID (which is your case in our clusters :-)

As long as you have agents that register without any ID, you are in danger, you may have node2 and NoDe2. But we keep all of this for the ascending compatibility with old clusters as well as your tests

banks · 2018-07-26T15:11:52Z

agent/consul/state/catalog.go

 			}
 		}
+		// TODO: a else statement is missing here to check we are not stealing
+		// a similar case-insensitive name


Is this comment still accurate? I thought that is what this PR is meant to fix?

I am removing it in next patch.

banks · 2018-07-26T15:15:31Z

agent/consul/state/catalog_test.go

+	if out.CreateIndex != 1 || out.ModifyIndex != 7 {
+		t.Fatalf("[DEPRECATED] bad CreateIndex/ModifyIndex returned: %#v", out)
+	}
+}


There is no test for the reverse case which is behaviour our test suite (and possibly other clients) rely on:

A node is registered with an ID (by a real agent starting)

Later tooling re-registers the same node (say to update it's node meta) but doesn't bother to specify the ID field (because it never had to before)

The above should continue to work and replace the existing node with the new one without an id.

All that said, CI shows that the same tests are not broken right now but they were only reliably broken before due to another change we reverted. They could still be non-deterministically broken since timing was involved.

Specifically in tests like TestPrepaparedQuery_Execute:

The test agent starts up assigns itself a UUID and in the background Anti-entroypy eventually registers that with the server (normally in milliseconds but that is enough of a delay to make it non-deterministic...)

Then actual test code like this:

consul/agent/consul/prepared_query_endpoint_test.go

Line 1992 in fa0d8af

if err := msgpackrpc.CallWithCodec(codec1, "Catalog.Register", &req, &reply); err != nil {

Re-registers the same node name to change it's health status. This registration happens not to have an ID so any of the following orderings can happen:

Agent AE gets there first so the existing node in state does have an ID while the incoming registration doesn't - this is the case that is missing an explicit test here I think

test code registers first with no ID and then the AE upgrades it to have an ID. Then subsequent code in setHealth registers again without an ID.

It's pretty clear that the test is racey anyway regardless of the ID thing, but this is why we think it's important that we don't break the case where and existing node DOES have an ID but a registration doesn't specify one. We can always fix this test and arguably should, but other people might have legitimately built something that does something similar to this test and we can't break them.

Is that clear?

Now it might be that your code is correct already and will always pass that test regardless of timing/interleaving - esp. given that it's not failing now. but I don't see an explicit test case here that verifies this behaviour and the logic is complicated enough I'm not sure I trust my self to reason about if it works alone 😄.

Could we add that?

@banks

Also fixed comments to be clearer after remarks from @banks

banks

Thanks Pierre, this is looking good. I still had to think really hard here and a minor tweak noted. I think Matt has another suggestion but looks like we are getting close so I'll approve my concerns here.

banks · 2018-07-30T13:26:11Z

agent/consul/state/catalog_test.go

+		t.Fatalf("Should return an error since another name with similar name exists")
+	}
+	if err := s.ensureNoNodeWithSimilarNameTxn(tx, node, true); err != nil {
+		t.Fatalf("Should return an error since another name with similar name exists")


Heh I was about to write a whole other thing about how this still doesn't exercise the no-ID free lunch case we've been talking about but then I looked closer and spotted this final case has err != nil 😄 .

The message here though still asserts the opposite of the condition though which is what confused me - it should be "Should not return an error since re-registering an identical node name without an ID should not be broken" or something.

Other than that I think this is good now.

@banks Thank you, fixed the error message in test case in next patch!

pierresouchay · 2018-07-30T18:19:45Z

@mkeeler Do you feel comfortable with the PR now?

Kind Regards

pierresouchay · 2018-08-03T09:35:13Z

@mkeeler Is the review Ok now?

We had 3 incidents (including one major one) to to the "renaming", this PR is really important to us.

mkeeler

Just the one remaining thing with using go-uuid to parse the UUID instead of custom code.

mkeeler · 2018-07-30T13:25:12Z

agent/consul/state/catalog.go

@@ -421,6 +463,32 @@ func (s *Store) GetNode(id string) (uint64, *structs.Node, error) {
 	return idx, nil, nil
 }

+func getNodeIDTxn(tx *memdb.Txn, id types.NodeID) (*structs.Node, error) {
+	strnode := string(id)


The uuid parsing should be done with the ParseUUID function from github.com/hashicorp/go-uuid. An example of this is in the Catalog Register RPC endpoint: https://github.com/hashicorp/consul/blob/master/agent/consul/catalog_endpoint.go#L37

mkeeler · 2018-08-03T11:37:42Z

Sorry about that @pierresouchay . Apparently I had done everything besides hit the "Submit Review" button on Monday

pierresouchay · 2018-08-03T12:31:26Z

Sorry about that @pierresouchay . Apparently I had done everything besides hit the "Submit Review" button on Monday

@mkeeler No problem :-) The fix is DONE regarding the parsing of uuid anyway.

Have a great Week-End or Holidays

mkeeler · 2018-08-03T13:59:53Z

Before merging this I am going to pull down the branch and make sure all unit tests run as expected. Travis seems to always have different errors each time.

pierresouchay · 2018-08-03T14:39:05Z

@mkeeler Sure... I am trying to fix Travis, but it takes time: #4472 :( almost there

pierresouchay · 2018-08-08T11:18:20Z

@mkeeler Since #4472 has been merged, do you want me to rebase so we can see whether the tests do pass now?

pierresouchay · 2018-08-10T14:42:46Z

@mkeeler any news on your tests?
That's currently the Consul bug that causes the most incidents in our clusters, we really would like to include this fix in next releases, that's would be great

mkeeler · 2018-08-10T15:29:38Z

@pierresouchay All tests pass locally. So I will go ahead and merge this.

pierresouchay · 2018-08-10T15:31:36Z

@mkeeler \o/ Thank you so much

pierresouchay mentioned this pull request Jul 18, 2018

Node renaming - Support for allow_node_renaming #4399

Closed

Removed the case-insensitive search when adding a node within the else

6161266

block since it breaks the test TestAgentAntiEntropy_Services While the else case is probably legit, it will be fixed with hashicorp#4414 in a later release.

Added again the test in the else to avoid duplicated names, but

1a837dc

enforce this test only for nodes having IDs. Thus most tests without any ID will work, and allows us fixing

mkeeler requested changes Jul 19, 2018

View reviewed changes

mkeeler approved these changes Jul 20, 2018

View reviewed changes

mkeeler requested changes Jul 20, 2018

View reviewed changes

Do not allow renaming in case of conflict, including when other node …

bbf5822

…has no ID

Fixed function GetNodeID that was not working due to wrong type when …

6f52314

…searching node from its ID Thus, all tests about renaming were not working properly. Added the full test cas that allowed me to detect it.

pierresouchay force-pushed the node_renaming_simple branch from aff4a82 to 6f52314 Compare July 21, 2018 17:35

Better error messages, more tests when nodeID is not a valid UUID in …

e323155

…GetNodeID()

pierresouchay force-pushed the node_renaming_simple branch from 49132e3 to 77f8781 Compare July 25, 2018 23:12

pierresouchay force-pushed the node_renaming_simple branch from 77f8781 to 8cba97c Compare July 25, 2018 23:24

pierresouchay force-pushed the node_renaming_simple branch 5 times, most recently from e602ccc to 4ec58fb Compare July 26, 2018 10:50

Added separate TestStateStore_GetNodeID to test GetNodeID.

55d3380

More complete test coverage for GetNodeID

pierresouchay force-pushed the node_renaming_simple branch from 4ec58fb to 55d3380 Compare July 26, 2018 11:02

banks requested changes Jul 26, 2018

View reviewed changes

Added new unit test TestStateStore_ensureNoNodeWithSimilarNameTxn

a58a977

Also fixed comments to be clearer after remarks from @banks

banks approved these changes Jul 30, 2018

View reviewed changes

Fixed error message in unit test to match test case

c717608

pierresouchay mentioned this pull request Aug 2, 2018

BUGFIX: Unit test relying on WaitForLeader() did not work due to wrong test #4472

Merged

mkeeler requested changes Aug 3, 2018

View reviewed changes

Use uuid.ParseUUID to parse Node.ID as requested by @mkeeler

05af3a1

mkeeler approved these changes Aug 3, 2018

View reviewed changes

mkeeler merged commit ef3b81a into hashicorp:master Aug 10, 2018

pierresouchay mentioned this pull request Sep 4, 2018

Allow to rename nodes with IDs criteo-forks/consul#5

Merged

pierresouchay mentioned this pull request Sep 14, 2018

Ensure not 2 nodes can have the same name with different case #4413

Closed

pierresouchay mentioned this pull request Oct 3, 2018

Duplicate Node IDs after upgrading to 1.2.3 #4741

Closed

pierresouchay mentioned this pull request Nov 16, 2018

disable_host_node_id": false does not work as advertised on Linux #4914

Closed

pierresouchay mentioned this pull request Feb 5, 2019

Case sensitive issue on snapshot restore #5316

Closed

Allow to rename nodes with IDs, will fix #3974 and #4413 #4415

Allow to rename nodes with IDs, will fix #3974 and #4413 #4415

Conversation

pierresouchay commented Jul 18, 2018

Deprecated behaviour warning

pierresouchay commented Jul 18, 2018 • edited Loading

pierresouchay commented Jul 18, 2018

pierresouchay commented Jul 18, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

What works:

When request has an ID:

When request has not ID (more permissive regarding case insensitivity)

What does not work anymore due to this change:

Previously, given the following configuration

Conclusion

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pierresouchay commented Jul 21, 2018

pierresouchay commented Jul 21, 2018

pierresouchay commented Jul 24, 2018

pierresouchay commented Jul 25, 2018

pierresouchay commented Jul 25, 2018

banks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

To summarize

When ID is provided:

When ID is not provided:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pierresouchay commented Jul 30, 2018

pierresouchay commented Aug 3, 2018

mkeeler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkeeler commented Aug 3, 2018

pierresouchay commented Aug 3, 2018

mkeeler commented Aug 3, 2018

pierresouchay commented Aug 3, 2018

pierresouchay commented Aug 8, 2018

pierresouchay commented Aug 10, 2018

mkeeler commented Aug 10, 2018 • edited Loading

pierresouchay commented Aug 10, 2018

pierresouchay commented Jul 18, 2018 •

edited

Loading

mkeeler commented Aug 10, 2018 •

edited

Loading