-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli,rpc,server: verify cluster names #39270
Conversation
@bdarnell I notice that the heartbeat services' Was there a specific reason to have the service check the cluster ID and maxOffset and not the requestor? Why is the version checked on both sides? |
@mberhault I have not yet implemented the TLS field validation as I have still to figure out the flow of data that serves to verify node-node certs. Once I understand that, I plan to extend the PR with an additional commit to do the TLS validation too. |
I think it's fair to keep the use of the cluster name in cert validation for another PR. There's still a question of where it belongs (which field) and even whether said field is customizable. |
ok thanks |
aaf944f
to
9f4a2d6
Compare
added the missing tests |
9f4a2d6
to
710c373
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This will definitely need some new docs. Though the stricter requirements (and difficulty of changing the name) will come with more uses. As is, it only impacts new nodes which is quite nice.
pkg/cli/flags.go
Outdated
// clusterNameRe matches valid cluster names. | ||
// For example, "a", "a1.23" and "a-b" are OK, | ||
// but "0123", "a-" and "123a" are not OK. | ||
var clusterNameRe = regexp.MustCompile(`^[_a-zA-Z](?:[-._a-zA-Z0-9]*[_a-zA-Z0-9]|)$`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a good reason for it, just something fairly common: maybe keep _
out of the first character group?
I'm also wondering if we will eventually have the cluster name show up in DNS. If so, that would remove _
entirely. I can't currently think of a reason we would though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/cli/cliflags/flags.go
Outdated
Description: ` | ||
Sets a name to verify the identity of a remote node or cluster, when | ||
either the node or cluster, or both, have not yet been initialized | ||
yet.`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you mention that all nodes must have the same cluster name if set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/rpc/context_test.go
Outdated
wg.Wait() | ||
}) | ||
|
||
t.Run("different names", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work as expected from looking at the code, but could you add a test for one side with a cluster name, the other without?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/rpc/heartbeat.go
Outdated
func checkClusterName(clusterName string, peerName string) error { | ||
if clusterName != peerName { | ||
err := errors.Errorf( | ||
"local cluster name %q does not match peer cluster name %q", clusterName, peerName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we want clearer error messages in the case of peer cluster name is %q, but local cluster name is not set
(and vice versa). The quoted name should be sufficient, so up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
710c373
to
44ba8c9
Compare
44ba8c9
to
d858989
Compare
No that's not accurate. I don't think the code as-is can introduce a cluster name into a cluster that has been started without one. Should I update the code to make it possible? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As is, it only impacts new nodes which is quite nice.
No that's not accurate. I don't think the code as-is can introduce a cluster name into a cluster that has been started without one. Should I update the code to make it possible?
Sorry, that's what I meant.
With the current use, mismatches in --cluster-name
only matter when attempting to add a node. All mismatches (empty vs not, different strings) fail the node join. As long as no nodes are being added, mismatches do not cause an issue so rolling restarts with different settings are ok. This situation will either change or need special logic once certificate checking is in place.
pkg/cli/cliflags/flags.go
Outdated
Sets a name to verify the identity of a remote node or cluster. The | ||
value must match between this node and the remote node(s) specified | ||
via --join. | ||
This can be used as an additional verification when either either the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"either either"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
d858989
to
c2dd65b
Compare
I have changed the logic to support rolling upgrades via a temporary flag |
c2dd65b
to
9c15373
Compare
pkg/rpc/heartbeat.go
Outdated
@@ -56,6 +60,31 @@ type HeartbeatService struct { | |||
testingAllowNamedRPCToAnonymousServer bool | |||
} | |||
|
|||
func checkClusterName(disablePeerNameVerification bool, clusterName string, peerName string) error { | |||
if disablePeerNameVerification && (clusterName == "" || peerName == "") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you want to OR the local and remote disablePeerNameVerification
?
I'm thinking of the following situation:
- node A has flags with
--cluster-name=foo
and--disable-peer-name-verification
- node B has not had the new flags applied yet
node A is restarted as part of the name rollout.
node B restarts on its own, but still has not had the new flags applied.
At this point, node B will ping node A, but will fail due to disablePeerNameVerification = false
. This will go into the clusterName != peerName
check and fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OR is there, performed by the caller.
It's in the RFC and also checked in the unit test in the same package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops. My bad. I had it in my head that checkClusterName
's signature was (localDisable, remoteDisable, localName, remoteName)
. Sorry about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np. so good to go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good, but I'd like @bdarnell to take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 13 of 21 files at r1, 8 of 8 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, and @mberhault)
pkg/base/config.go, line 171 at r2 (raw file):
// both sides match. This is meant for use while rolling an // existing cluster into using a new cluster name. DisablePeerNameVerification bool
s/Peer/Cluster/
So this flag permits combining empty and non-empty, but non-empty values must match? This does make it possible to change from one name to another, but it's a very awkward process:
- Restart all nodes with disable cluster name verification flag (and the old name)
- Restart all nodes with disable flag and no name
- Restart all nodes with disable flag and new name
- Restart all nodes with new name without disable flag
I think this flag should disable the checking completely so it can be used to go directly from the old name to the new one. (which still leaves this a 3-step process, but I think it's more understandable without the "no name" step in the middle)
Maybe we should just permit a mix of empty and non-empty cluster names at all times, so that cluster names only prevent mixups when both clusters use them. I'm not sure it's worth having extra configuration to disable checks here.
pkg/cli/flags.go, line 219 at r2 (raw file):
// For example, "a", "a1.23" and "a-b" are OK, // but "0123", "a-" and "123a" are not OK. var clusterNameRe = regexp.MustCompile(`^[a-zA-Z](?:[-.a-zA-Z0-9]*[a-zA-Z0-9]|)$`)
Ooh, if these names are DNS-legal then maybe we can do some nice things with mDNS. I'd remove .
from the middle block too; we only need one separator character (and then we can translate -
to .
or _
when required by other formats, without ambiguity)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, and @mberhault)
pkg/base/config.go, line 171 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
s/Peer/Cluster/
So this flag permits combining empty and non-empty, but non-empty values must match? This does make it possible to change from one name to another, but it's a very awkward process:
- Restart all nodes with disable cluster name verification flag (and the old name)
- Restart all nodes with disable flag and no name
- Restart all nodes with disable flag and new name
- Restart all nodes with new name without disable flag
I think this flag should disable the checking completely so it can be used to go directly from the old name to the new one. (which still leaves this a 3-step process, but I think it's more understandable without the "no name" step in the middle)
Maybe we should just permit a mix of empty and non-empty cluster names at all times, so that cluster names only prevent mixups when both clusters use them. I'm not sure it's worth having extra configuration to disable checks here.
The name change is document in the RFC and 1) there's one less step needed than you say here 2) this can be later automated, should we want this to become a common operation, by introducing a new cluster RPC (which is also explained in the RFC).
Additonnally, the reason why I chose to not allow empty and non-empty names by default is to:
- prevent someone from mistakenly adding a node without a name to a cluster that has a name (feature ask by marc)
- prevent someone from mistakenly adding a node with a name to a cluster without a name (e.g. a pre-19.2 cluster). This was also a feature ask by marc.
- to ensure that names are consistent across the cluster.
Really we need to prevent a realistic mistake: two or more 'active" nodes started with conflicting cluster names. If you have 2+ nodes with different cluster names, it becomes impossible to reliably activate the cluster name verification throughout the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, and @mberhault)
pkg/base/config.go, line 171 at r2 (raw file):
Previously, knz (kena) wrote…
The name change is document in the RFC and 1) there's one less step needed than you say here 2) this can be later automated, should we want this to become a common operation, by introducing a new cluster RPC (which is also explained in the RFC).
Additonnally, the reason why I chose to not allow empty and non-empty names by default is to:
- prevent someone from mistakenly adding a node without a name to a cluster that has a name (feature ask by marc)
- prevent someone from mistakenly adding a node with a name to a cluster without a name (e.g. a pre-19.2 cluster). This was also a feature ask by marc.
- to ensure that names are consistent across the cluster.
Really we need to prevent a realistic mistake: two or more 'active" nodes started with conflicting cluster names. If you have 2+ nodes with different cluster names, it becomes impossible to reliably activate the cluster name verification throughout the cluster.
In the RFC the disable flag completely disables the check (as I'm suggesting here), but in this implementation it only disables the check if either the old or the new name is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @mberhault)
pkg/base/config.go, line 171 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
In the RFC the disable flag completely disables the check (as I'm suggesting here), but in this implementation it only disables the check if either the old or the new name is empty.
let's discuss on the RFC then.
pkg/cli/flags.go, line 219 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Ooh, if these names are DNS-legal then maybe we can do some nice things with mDNS. I'd remove
.
from the middle block too; we only need one separator character (and then we can translate-
to.
or_
when required by other formats, without ambiguity)
Done.
I don't remember why it's organized this way, and it's probably just how the code evolved instead of having a good reason. |
9c15373
to
ba5e557
Compare
modified as discussed, rfal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 21 of 22 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, and @mberhault)
pkg/rpc/heartbeat.go, line 66 at r3 (raw file):
disableClusterNameVerification bool, clusterName string, peerName string, ) error { if disableClusterNameVerification {
Nit: Since this argument now just makes the method a no-op, I'd remove the argument and change the call site to if !disableClusterNameVerification { checkClusterName(...) }
pkg/rpc/heartbeat.proto, line 69 at r3 (raw file):
optional string cluster_name = 4 [(gogoproto.nullable) = false]; // Skip cluster name check if either side's name is empty / not configured. optional bool disable_cluster_name_verification = 5 [(gogoproto.nullable) = false];
I think I'd get rid of this - it's a little bit easier to think about things when each node makes its decisions based on its own flag instead of passing this over the network. It's a little less convenient for the "name change" scenario since you have to rollout the disable flag and then change the name in a separate pass, but I think that's OK.
ba5e557
to
d447250
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @mberhault)
pkg/rpc/heartbeat.go, line 66 at r3 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Nit: Since this argument now just makes the method a no-op, I'd remove the argument and change the call site to
if !disableClusterNameVerification { checkClusterName(...) }
Done.
pkg/rpc/heartbeat.proto, line 69 at r3 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
I think I'd get rid of this - it's a little bit easier to think about things when each node makes its decisions based on its own flag instead of passing this over the network. It's a little less convenient for the "name change" scenario since you have to rollout the disable flag and then change the name in a separate pass, but I think that's OK.
I tried that and it does not work (reliably). That's because the originator of the first heartbeat is not always the node you're just joining or restarting, oftentimes another node connects into the new node before that. We want the new node with the --disable flag to be able to make that other-initiated heartbeat succeed.
d447250
to
e52e523
Compare
This patch introduces a "cluster name" checked in addition to the cluster ID (and especially when the cluster ID is not known yet) to ensure that nodes don't join the wrong cluster by mistake. This cluster name is not persisted and only configurable via the command line. Also it cannot be modified as long as one node in the cluster is still running (as this node will be checking connections from any new peer). It can only be modified by stopping all nodes, changing the command-line parameter value, and restarting the nodes. It is also accessible from SQL via `crdb_internal.cluster_name()` (each node will report its local view on the value, which may differ across nodes) and `crdb_internal.gossip_nodes` (where different values on different nodes can be distinguished). Release note (cli change): CockroachDB will now check that the value passed via `--cluster-name`, if provided, matches when a node connects to a cluster. This feature is meant for use in combination with the cluster ID, which is still checked in any case, to protect newly created nodes (that don't have a cluster ID yet) from joining the wrong cluster. To introduce a cluster name into an existing cluster without one, the new flag can be temporarily paired with `--disable-cluster-name-verification`. See the documentation for details.
e52e523
to
0236b25
Compare
TFYRs bors r+ |
39196: rfc: cluster name r=knz a=knz 39270: cli,rpc,server: verify cluster names r=knz a=knz Fixes #16784. Implements #39196. This patch introduces a "cluster name" checked in addition to the cluster ID (and especially when the cluster ID is not known yet) to ensure that nodes don't join the wrong cluster by mistake. This cluster name is not persisted and only configurable via the command line. Also it cannot be modified as long as one node in the cluster is still running (as this node will be checking connections from any new peer). It can only be modified by stopping all nodes, changing the command-line parameter value, and restarting the nodes. It is also accessible from SQL via `crdb_internal.cluster_name()` (each node will report its local view on the value, which may differ across nodes) and `crdb_internal.gossip_nodes` (where different values on different nodes can be distinguished). Release note (cli change): CockroachDB will now check that the value passed via `--cluster-name`, if provided, matches when a node connects to a cluster. This feature is meant for use in combination with the cluster ID, which is still checked in any case, to protect newly created nodes (that don't have a cluster ID yet) from joining the wrong cluster. Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
Build succeeded |
Fixes #16784.
Implements #39196.
This patch introduces a "cluster name" checked in addition to the
cluster ID (and especially when the cluster ID is not known yet) to
ensure that nodes don't join the wrong cluster by mistake.
This cluster name is not persisted and only configurable via the
command line. Also it cannot be modified as long as one node in the
cluster is still running (as this node will be checking connections
from any new peer). It can only be modified by stopping all nodes,
changing the command-line parameter value, and restarting the nodes.
It is also accessible from SQL via
crdb_internal.cluster_name()
(each node will report its local view on the value, which may differ
across nodes) and
crdb_internal.gossip_nodes
(where differentvalues on different nodes can be distinguished).
Release note (cli change): CockroachDB will now check that the value
passed via
--cluster-name
, if provided, matches when a node connectsto a cluster. This feature is meant for use in combination with the
cluster ID, which is still checked in any case, to protect newly
created nodes (that don't have a cluster ID yet) from joining the
wrong cluster.