-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to rename nodes with IDs, will fix #3974 and #4413 #4415
Allow to rename nodes with IDs, will fix #3974 and #4413 #4415
Conversation
…#4413 This change allow to rename any well behaving recent agent with an ID to be renamed safely, ie: without taking the name of another one with case insensitive comparison. Deprecated behaviour warning ---------------------------- Due to asceding compatibility, it is still possible however to "take" the name of another name by not providing any ID. Note that when not providing any ID, it is possible to have 2 nodes having similar names with case differences, ie: myNode and mynode which might lead to DB corruption on Consul server side and lead to server not properly restarting. See hashicorp#3983 and hashicorp#4399 for Context about this change. Disabling registration of nodes without IDs as specified in hashicorp#4414 should probably be the way to go eventually.
@banks @mkeeler Ok, I implemented exactly what you said in #4399 (comment) The only differences are (apart from being a single commit):
Hope you will be satisfied with this Kind regards |
block since it breaks the test TestAgentAntiEntropy_Services While the else case is probably legit, it will be fixed with hashicorp#4414 in a later release.
enforce this test only for nodes having IDs. Thus most tests without any ID will work, and allows us fixing
`TestStateStore_EnsureNode` now test registration and renaming with IDs `TestStateStore_EnsureNodeDeprecated` tests registration without IDs and tests removing an ID from a node as well as updated a node without its ID (deprecated behaviour kept for backwards compatibility)
node.ID, n.Node, node.Node) | ||
} | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The else statement here is one of the problematic ones. I would recommend replacing with this statement.
}
// TODO: else statement missing to ensure name is not already taken by another node
// If added right now it could break existing behavior which allows for adding
// a UUID to a node that doesn't have one when the name is the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkeeler I think the else statement is now SAFE since s.ensureNoNodeWithSimilarNameTxn
will not match any node without an ID. It means it will break ONLY if: a node having an ID clashes (case insensitive) with another node having an ID.
In all other cases, the behaviour is unchanged as shown in tests deprecatedEnsureNodeWithoutIDCanRegister
of agent/consul/state/catalog_test.go.
What works:
When request has an ID:
- renaming a node when having an ID, without colliding with the name of another node (case insensitive)
- creating a node with an ID for which the name is not taken by anybody (case insensitive)
When request has not ID (more permissive regarding case insensitivity)
- Create a new node without ID with name not colliding
- Remove a node ID of a node with an ID if name EXACTLY (case sensitive) matches
- Create a new node without ID with name colliding with case insensitive differences
What does not work anymore due to this change:
- Create a node with an ID, but this name collides with another one (case insensitive) having an ID (works if other node DOES not have any ID)
Previously, given the following configuration
id42 - node1
Registering same name with different ID is blocked
id43 - node2 => previously would 'steal' node2 to id42 ==> now blocked
id43 - NoDe2 => previously would create a second node (id42: node2 ; id43: NoDe2) => the big crash I had described here #4413 => now the registration is blocked and id43 won't be able to join
Conclusion
- A request with an ID can "steal" a node without any ID, but cannot steal a node with an ID if ID is different
- A request without any ID can do whatever it wants
So, basically, the ONLY new possible issue we have is:
- if you change manually the ID of a node, this node won't be able to join the cluster
- if 2 nodes have different IDs and case insensitive differences, only ONE can join, which is good new for DNS as well as it avoid the DB corruption I had.
What do you think of this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand now.
We will error out if both if an existing node with a uuid != new uuid where there is a name match but we will still allow adding a uuid to a node that doesn't have one. You just moved the logic to only enforce it when they both have uuids into that function.
agent/consul/state/catalog.go
Outdated
return fmt.Errorf("node ID %q for node %q aliases existing node %q", | ||
node.ID, node.Node, n.Node) | ||
// Lets first get all nodes and check whether name do match | ||
dupNameError := s.ensureNoNodeWithSimilarNameTxn(tx, node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a bug here.
Because the name check only will match if the other node has a UUID and the UUID doesn't match as well, this will potentially allow merging two existing nodes into 1.
An example of this is if there are two existing nodes: node foo has a uuid and node bar does not. If you attempt to rename foo to bar it will not error out for the name match since the bar node has no uuid. It will delete node foo and then add the uuid to the bar node and update all its info. This is definitely not desired.
Adding a uuid to a node (the else case where we had no uuid match) is okay to allow for upgrading to uuids. However we need to prevent this case where we have two existing nodes and we merge them into one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkeeler Well spotted, I introduced this bug after I saw TestAgentAntiEntropy_Services
not working => should be fixed in next patch
Renaming is not allowed when another node, including without nodeID exists in next patch.
@mkeeler Something is wrong... I saw by adding some tests for renaming. It seems that index ("nodes", "uuid") is not properly updated... I think indexing using "uuid" for nodes does not work well. I'll keep you updated |
@mkeeler The last patch fix an existing bug: GetNodeID() and similar lookups were not working properly since index uuid was UUIDIndex while lookups were performed on strings. Thus, we were never entering in the test for renaming. This fixes both and include tests cases that includes all we discussed, this time should be the good one ! :-) |
…searching node from its ID Thus, all tests about renaming were not working properly. Added the full test cas that allowed me to detect it.
aff4a82
to
6f52314
Compare
@mkeeler the last patch explains why I never got the "aliases existing node" in prod when renaming nodes, basically, the feature was not working at all since lookup per node uuid did not work at all, what do you think about the change now? |
49132e3
to
77f8781
Compare
In my last patches, I just improved the error messages when So basically this review:
I will wait for a review anyway now, I think we are good |
77f8781
to
8cba97c
Compare
Travis test are not stable once again :( |
e602ccc
to
4ec58fb
Compare
More complete test coverage for GetNodeID
4ec58fb
to
55d3380
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @pierresouchay I could be wrong still but this is getting pretty complicated to understand. I think one of the specific breaking changes from before is at-best untested here and potentially broken again (although the tests it would incidentally have broken before seem to pass for now).
Can we add an explicit test both to document the intent and to be convinced this is safe now? I tried to explain what I mean inline - hope it's clear enough!
for nodeIt := enodes.Next(); nodeIt != nil; nodeIt = enodes.Next() { | ||
enode := nodeIt.(*structs.Node) | ||
if strings.EqualFold(node.Node, enode.Node) && node.ID != enode.ID { | ||
if !(enode.ID == "" && allowClashWithoutID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this cover the case where the new node (node.ID == "") but clashing node isn't? That is specifically what broke tests before.
I'll read on and may well find a reason why that can't happen here but just noting. Even if that is a true invariant might be worth commenting because we want to allow either the existing or the new node to have blank ID without changing behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function does not, but basically yes since this function is called ONLY when node.ID != ""
In my next and (hopefuly) last patch, I did not change anything but added tests to this function.
To summarize
When ID is provided:
- if the node already exists, you rename the node if no conflicts with another node
- if node does not exists, check for conflicts, but ONLY for nodes having ID (thus we allow to "upgrade" a node without ID and add an ID)
When ID is not provided:
- free lunch, you can "downgrade" a node by removing its ID - BUT ONLY if you have the exact same name as an existing node without ID (Kept for your tests and compatibility reasons)
- you can create a node of course
So, it means this PR is fixing #4413 (not being able to have two different agents with with that differs only by their case) ONLY in clusters where everybody is playing nice and has an ID (which is your case in our clusters :-)
As long as you have agents that register without any ID, you are in danger, you may have node2
and NoDe2
. But we keep all of this for the ascending compatibility with old clusters as well as your tests
agent/consul/state/catalog.go
Outdated
} | ||
} | ||
// TODO: a else statement is missing here to check we are not stealing | ||
// a similar case-insensitive name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment still accurate? I thought that is what this PR is meant to fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am removing it in next patch.
if out.CreateIndex != 1 || out.ModifyIndex != 7 { | ||
t.Fatalf("[DEPRECATED] bad CreateIndex/ModifyIndex returned: %#v", out) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no test for the reverse case which is behaviour our test suite (and possibly other clients) rely on:
- A node is registered with an ID (by a real agent starting)
- Later tooling re-registers the same node (say to update it's node meta) but doesn't bother to specify the
ID
field (because it never had to before)
The above should continue to work and replace the existing node with the new one without an id.
All that said, CI shows that the same tests are not broken right now but they were only reliably broken before due to another change we reverted. They could still be non-deterministically broken since timing was involved.
Specifically in tests like TestPrepaparedQuery_Execute
:
- The test agent starts up assigns itself a UUID and in the background Anti-entroypy eventually registers that with the server (normally in milliseconds but that is enough of a delay to make it non-deterministic...)
- Then actual test code like this:
if err := msgpackrpc.CallWithCodec(codec1, "Catalog.Register", &req, &reply); err != nil { - Re-registers the same node name to change it's health status. This registration happens not to have an ID so any of the following orderings can happen:
- Agent AE gets there first so the existing node in state does have an ID while the incoming registration doesn't - this is the case that is missing an explicit test here I think
- test code registers first with no ID and then the AE upgrades it to have an ID. Then subsequent code in
setHealth
registers again without an ID.
It's pretty clear that the test is racey anyway regardless of the ID thing, but this is why we think it's important that we don't break the case where and existing node DOES have an ID but a registration doesn't specify one. We can always fix this test and arguably should, but other people might have legitimately built something that does something similar to this test and we can't break them.
Is that clear?
Now it might be that your code is correct already and will always pass that test regardless of timing/interleaving - esp. given that it's not failing now. but I don't see an explicit test case here that verifies this behaviour and the logic is complicated enough I'm not sure I trust my self to reason about if it works alone 😄.
Could we add that?
Also fixed comments to be clearer after remarks from @banks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Pierre, this is looking good. I still had to think really hard here and a minor tweak noted. I think Matt has another suggestion but looks like we are getting close so I'll approve my concerns here.
agent/consul/state/catalog_test.go
Outdated
t.Fatalf("Should return an error since another name with similar name exists") | ||
} | ||
if err := s.ensureNoNodeWithSimilarNameTxn(tx, node, true); err != nil { | ||
t.Fatalf("Should return an error since another name with similar name exists") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh I was about to write a whole other thing about how this still doesn't exercise the no-ID free lunch case we've been talking about but then I looked closer and spotted this final case has err != nil
😄 .
The message here though still asserts the opposite of the condition though which is what confused me - it should be "Should not return an error since re-registering an identical node name without an ID should not be broken" or something.
Other than that I think this is good now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@banks Thank you, fixed the error message in test case in next patch!
@mkeeler Do you feel comfortable with the PR now? Kind Regards |
@mkeeler Is the review Ok now? We had 3 incidents (including one major one) to to the "renaming", this PR is really important to us. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the one remaining thing with using go-uuid to parse the UUID instead of custom code.
@@ -421,6 +463,32 @@ func (s *Store) GetNode(id string) (uint64, *structs.Node, error) { | |||
return idx, nil, nil | |||
} | |||
|
|||
func getNodeIDTxn(tx *memdb.Txn, id types.NodeID) (*structs.Node, error) { | |||
strnode := string(id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The uuid parsing should be done with the ParseUUID function from github.com/hashicorp/go-uuid. An example of this is in the Catalog Register RPC endpoint: https://github.com/hashicorp/consul/blob/master/agent/consul/catalog_endpoint.go#L37
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
Sorry about that @pierresouchay . Apparently I had done everything besides hit the "Submit Review" button on Monday |
@mkeeler No problem :-) The fix is DONE regarding the parsing of uuid anyway. Have a great Week-End or Holidays |
Before merging this I am going to pull down the branch and make sure all unit tests run as expected. Travis seems to always have different errors each time. |
@mkeeler any news on your tests? |
@pierresouchay All tests pass locally. So I will go ahead and merge this. |
@mkeeler \o/ Thank you so much |
This change allow to rename any well behaving recent agent with an
ID to be renamed safely, ie: without taking the name of another one
with case insensitive comparison.
Deprecated behaviour warning
Due to asceding compatibility, it is still possible however to
"take" the name of another name by not providing any ID.
Note that when not providing any ID, it is possible to have 2 nodes
having similar names with case differences, ie: myNode and mynode
which might lead to DB corruption on Consul server side and
lead to server not properly restarting.
See #3983 and #4399 for Context about this change.
Disabling registration of nodes without IDs as specified in #4414
should probably be the way to go eventually.