Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli,rpc,server: verify cluster names #39270

Merged
merged 1 commit into from
Aug 6, 2019

Conversation

knz
Copy link
Contributor

@knz knz commented Aug 2, 2019

Fixes #16784.
Implements #39196.

This patch introduces a "cluster name" checked in addition to the
cluster ID (and especially when the cluster ID is not known yet) to
ensure that nodes don't join the wrong cluster by mistake.

This cluster name is not persisted and only configurable via the
command line. Also it cannot be modified as long as one node in the
cluster is still running (as this node will be checking connections
from any new peer). It can only be modified by stopping all nodes,
changing the command-line parameter value, and restarting the nodes.

It is also accessible from SQL via crdb_internal.cluster_name()
(each node will report its local view on the value, which may differ
across nodes) and crdb_internal.gossip_nodes (where different
values on different nodes can be distinguished).

Release note (cli change): CockroachDB will now check that the value
passed via --cluster-name, if provided, matches when a node connects
to a cluster. This feature is meant for use in combination with the
cluster ID, which is still checked in any case, to protect newly
created nodes (that don't have a cluster ID yet) from joining the
wrong cluster.

@knz knz requested review from bdarnell and mberhault August 2, 2019 17:39
@knz knz requested a review from a team as a code owner August 2, 2019 17:39
@knz knz requested a review from a team August 2, 2019 17:39
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor Author

knz commented Aug 2, 2019

@bdarnell I notice that the heartbeat services' Ping() method verifies the cluster ID, version and maxOffset, and separately the requestor (rpc.Context.runHeartbeat() also checks the version.

Was there a specific reason to have the service check the cluster ID and maxOffset and not the requestor?

Why is the version checked on both sides?

@knz
Copy link
Contributor Author

knz commented Aug 2, 2019

@mberhault I have not yet implemented the TLS field validation as I have still to figure out the flow of data that serves to verify node-node certs. Once I understand that, I plan to extend the PR with an additional commit to do the TLS validation too.

@mberhault
Copy link
Contributor

I think it's fair to keep the use of the cluster name in cert validation for another PR. There's still a question of where it belongs (which field) and even whether said field is customizable.

@knz
Copy link
Contributor Author

knz commented Aug 2, 2019

ok thanks

@knz
Copy link
Contributor Author

knz commented Aug 2, 2019

added the missing tests

@knz knz changed the title cli,rpc,server: verify cluster names before the cluster ID is known cli,rpc,server: verify cluster names Aug 2, 2019
Copy link
Contributor

@mberhault mberhault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This will definitely need some new docs. Though the stricter requirements (and difficulty of changing the name) will come with more uses. As is, it only impacts new nodes which is quite nice.

pkg/cli/flags.go Outdated
// clusterNameRe matches valid cluster names.
// For example, "a", "a1.23" and "a-b" are OK,
// but "0123", "a-" and "123a" are not OK.
var clusterNameRe = regexp.MustCompile(`^[_a-zA-Z](?:[-._a-zA-Z0-9]*[_a-zA-Z0-9]|)$`)
Copy link
Contributor

@mberhault mberhault Aug 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a good reason for it, just something fairly common: maybe keep _ out of the first character group?

I'm also wondering if we will eventually have the cluster name show up in DNS. If so, that would remove _ entirely. I can't currently think of a reason we would though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Description: `
Sets a name to verify the identity of a remote node or cluster, when
either the node or cluster, or both, have not yet been initialized
yet.`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you mention that all nodes must have the same cluster name if set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

wg.Wait()
})

t.Run("different names", func(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work as expected from looking at the code, but could you add a test for one side with a cluster name, the other without?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

func checkClusterName(clusterName string, peerName string) error {
if clusterName != peerName {
err := errors.Errorf(
"local cluster name %q does not match peer cluster name %q", clusterName, peerName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we want clearer error messages in the case of peer cluster name is %q, but local cluster name is not set (and vice versa). The quoted name should be sufficient, so up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@knz knz added the docs-todo label Aug 2, 2019
@knz knz requested review from a team August 2, 2019 18:30
@knz
Copy link
Contributor Author

knz commented Aug 2, 2019

As is, it only impacts new nodes which is quite nice.

No that's not accurate. I don't think the code as-is can introduce a cluster name into a cluster that has been started without one. Should I update the code to make it possible?

Copy link
Contributor

@mberhault mberhault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is, it only impacts new nodes which is quite nice.

No that's not accurate. I don't think the code as-is can introduce a cluster name into a cluster that has been started without one. Should I update the code to make it possible?

Sorry, that's what I meant.
With the current use, mismatches in --cluster-name only matter when attempting to add a node. All mismatches (empty vs not, different strings) fail the node join. As long as no nodes are being added, mismatches do not cause an issue so rolling restarts with different settings are ok. This situation will either change or need special logic once certificate checking is in place.

Sets a name to verify the identity of a remote node or cluster. The
value must match between this node and the remote node(s) specified
via --join.
This can be used as an additional verification when either either the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"either either"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@knz
Copy link
Contributor Author

knz commented Aug 2, 2019

I have changed the logic to support rolling upgrades via a temporary flag --disable-peer-name-verification. I also have updated the RFC to explain why it exists and how it works.

@@ -56,6 +60,31 @@ type HeartbeatService struct {
testingAllowNamedRPCToAnonymousServer bool
}

func checkClusterName(disablePeerNameVerification bool, clusterName string, peerName string) error {
if disablePeerNameVerification && (clusterName == "" || peerName == "") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want to OR the local and remote disablePeerNameVerification?
I'm thinking of the following situation:

  • node A has flags with --cluster-name=foo and --disable-peer-name-verification
  • node B has not had the new flags applied yet

node A is restarted as part of the name rollout.
node B restarts on its own, but still has not had the new flags applied.

At this point, node B will ping node A, but will fail due to disablePeerNameVerification = false. This will go into the clusterName != peerName check and fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OR is there, performed by the caller.
It's in the RFC and also checked in the unit test in the same package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops. My bad. I had it in my head that checkClusterName's signature was (localDisable, remoteDisable, localName, remoteName). Sorry about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np. so good to go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good, but I'd like @bdarnell to take a look.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 13 of 21 files at r1, 8 of 8 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, and @mberhault)


pkg/base/config.go, line 171 at r2 (raw file):

	// both sides match. This is meant for use while rolling an
	// existing cluster into using a new cluster name.
	DisablePeerNameVerification bool

s/Peer/Cluster/

So this flag permits combining empty and non-empty, but non-empty values must match? This does make it possible to change from one name to another, but it's a very awkward process:

  1. Restart all nodes with disable cluster name verification flag (and the old name)
  2. Restart all nodes with disable flag and no name
  3. Restart all nodes with disable flag and new name
  4. Restart all nodes with new name without disable flag

I think this flag should disable the checking completely so it can be used to go directly from the old name to the new one. (which still leaves this a 3-step process, but I think it's more understandable without the "no name" step in the middle)

Maybe we should just permit a mix of empty and non-empty cluster names at all times, so that cluster names only prevent mixups when both clusters use them. I'm not sure it's worth having extra configuration to disable checks here.


pkg/cli/flags.go, line 219 at r2 (raw file):

// For example, "a", "a1.23" and "a-b" are OK,
// but "0123", "a-" and "123a" are not OK.
var clusterNameRe = regexp.MustCompile(`^[a-zA-Z](?:[-.a-zA-Z0-9]*[a-zA-Z0-9]|)$`)

Ooh, if these names are DNS-legal then maybe we can do some nice things with mDNS. I'd remove . from the middle block too; we only need one separator character (and then we can translate - to . or _ when required by other formats, without ambiguity)

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, and @mberhault)


pkg/base/config.go, line 171 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/Peer/Cluster/

So this flag permits combining empty and non-empty, but non-empty values must match? This does make it possible to change from one name to another, but it's a very awkward process:

  1. Restart all nodes with disable cluster name verification flag (and the old name)
  2. Restart all nodes with disable flag and no name
  3. Restart all nodes with disable flag and new name
  4. Restart all nodes with new name without disable flag

I think this flag should disable the checking completely so it can be used to go directly from the old name to the new one. (which still leaves this a 3-step process, but I think it's more understandable without the "no name" step in the middle)

Maybe we should just permit a mix of empty and non-empty cluster names at all times, so that cluster names only prevent mixups when both clusters use them. I'm not sure it's worth having extra configuration to disable checks here.

The name change is document in the RFC and 1) there's one less step needed than you say here 2) this can be later automated, should we want this to become a common operation, by introducing a new cluster RPC (which is also explained in the RFC).

Additonnally, the reason why I chose to not allow empty and non-empty names by default is to:

  • prevent someone from mistakenly adding a node without a name to a cluster that has a name (feature ask by marc)
  • prevent someone from mistakenly adding a node with a name to a cluster without a name (e.g. a pre-19.2 cluster). This was also a feature ask by marc.
  • to ensure that names are consistent across the cluster.

Really we need to prevent a realistic mistake: two or more 'active" nodes started with conflicting cluster names. If you have 2+ nodes with different cluster names, it becomes impossible to reliably activate the cluster name verification throughout the cluster.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, and @mberhault)


pkg/base/config.go, line 171 at r2 (raw file):

Previously, knz (kena) wrote…

The name change is document in the RFC and 1) there's one less step needed than you say here 2) this can be later automated, should we want this to become a common operation, by introducing a new cluster RPC (which is also explained in the RFC).

Additonnally, the reason why I chose to not allow empty and non-empty names by default is to:

  • prevent someone from mistakenly adding a node without a name to a cluster that has a name (feature ask by marc)
  • prevent someone from mistakenly adding a node with a name to a cluster without a name (e.g. a pre-19.2 cluster). This was also a feature ask by marc.
  • to ensure that names are consistent across the cluster.

Really we need to prevent a realistic mistake: two or more 'active" nodes started with conflicting cluster names. If you have 2+ nodes with different cluster names, it becomes impossible to reliably activate the cluster name verification throughout the cluster.

In the RFC the disable flag completely disables the check (as I'm suggesting here), but in this implementation it only disables the check if either the old or the new name is empty.

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @mberhault)


pkg/base/config.go, line 171 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

In the RFC the disable flag completely disables the check (as I'm suggesting here), but in this implementation it only disables the check if either the old or the new name is empty.

let's discuss on the RFC then.


pkg/cli/flags.go, line 219 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Ooh, if these names are DNS-legal then maybe we can do some nice things with mDNS. I'd remove . from the middle block too; we only need one separator character (and then we can translate - to . or _ when required by other formats, without ambiguity)

Done.

@knz knz mentioned this pull request Aug 5, 2019
@bdarnell
Copy link
Contributor

bdarnell commented Aug 5, 2019

@bdarnell I notice that the heartbeat services' Ping() method verifies the cluster ID, version and maxOffset, and separately the requestor (rpc.Context.runHeartbeat() also checks the version.

Was there a specific reason to have the service check the cluster ID and maxOffset and not the requestor?

Why is the version checked on both sides?

I don't remember why it's organized this way, and it's probably just how the code evolved instead of having a good reason.

@knz
Copy link
Contributor Author

knz commented Aug 5, 2019

modified as discussed, rfal

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 21 of 22 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, and @mberhault)


pkg/rpc/heartbeat.go, line 66 at r3 (raw file):

	disableClusterNameVerification bool, clusterName string, peerName string,
) error {
	if disableClusterNameVerification {

Nit: Since this argument now just makes the method a no-op, I'd remove the argument and change the call site to if !disableClusterNameVerification { checkClusterName(...) }


pkg/rpc/heartbeat.proto, line 69 at r3 (raw file):

  optional string cluster_name = 4 [(gogoproto.nullable) = false];
  // Skip cluster name check if either side's name is empty / not configured.
  optional bool disable_cluster_name_verification = 5 [(gogoproto.nullable) = false];

I think I'd get rid of this - it's a little bit easier to think about things when each node makes its decisions based on its own flag instead of passing this over the network. It's a little less convenient for the "name change" scenario since you have to rollout the disable flag and then change the name in a separate pass, but I think that's OK.

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @mberhault)


pkg/rpc/heartbeat.go, line 66 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Nit: Since this argument now just makes the method a no-op, I'd remove the argument and change the call site to if !disableClusterNameVerification { checkClusterName(...) }

Done.


pkg/rpc/heartbeat.proto, line 69 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think I'd get rid of this - it's a little bit easier to think about things when each node makes its decisions based on its own flag instead of passing this over the network. It's a little less convenient for the "name change" scenario since you have to rollout the disable flag and then change the name in a separate pass, but I think that's OK.

I tried that and it does not work (reliably). That's because the originator of the first heartbeat is not always the node you're just joining or restarting, oftentimes another node connects into the new node before that. We want the new node with the --disable flag to be able to make that other-initiated heartbeat succeed.

This patch introduces a "cluster name" checked in addition to the
cluster ID (and especially when the cluster ID is not known yet) to
ensure that nodes don't join the wrong cluster by mistake.

This cluster name is not persisted and only configurable via the
command line. Also it cannot be modified as long as one node in the
cluster is still running (as this node will be checking connections
from any new peer). It can only be modified by stopping all nodes,
changing the command-line parameter value, and restarting the nodes.

It is also accessible from SQL via `crdb_internal.cluster_name()`
(each node will report its local view on the value, which may differ
across nodes) and `crdb_internal.gossip_nodes` (where different
values on different nodes can be distinguished).

Release note (cli change): CockroachDB will now check that the value
passed via `--cluster-name`, if provided, matches when a node connects
to a cluster. This feature is meant for use in combination with the
cluster ID, which is still checked in any case, to protect newly
created nodes (that don't have a cluster ID yet) from joining the
wrong cluster. To introduce a cluster name into an existing cluster
without one, the new flag can be temporarily paired with
`--disable-cluster-name-verification`. See the documentation for details.
@knz
Copy link
Contributor Author

knz commented Aug 6, 2019

TFYRs

bors r+

craig bot pushed a commit that referenced this pull request Aug 6, 2019
39196: rfc: cluster name r=knz a=knz



39270: cli,rpc,server: verify cluster names r=knz a=knz

Fixes #16784.
Implements #39196.

This patch introduces a "cluster name" checked in addition to the
cluster ID (and especially when the cluster ID is not known yet) to
ensure that nodes don't join the wrong cluster by mistake.

This cluster name is not persisted and only configurable via the
command line. Also it cannot be modified as long as one node in the
cluster is still running (as this node will be checking connections
from any new peer). It can only be modified by stopping all nodes,
changing the command-line parameter value, and restarting the nodes.

It is also accessible from SQL via `crdb_internal.cluster_name()`
(each node will report its local view on the value, which may differ
across nodes) and `crdb_internal.gossip_nodes` (where different
values on different nodes can be distinguished).

Release note (cli change): CockroachDB will now check that the value
passed via `--cluster-name`, if provided, matches when a node connects
to a cluster. This feature is meant for use in combination with the
cluster ID, which is still checked in any case, to protect newly
created nodes (that don't have a cluster ID yet) from joining the
wrong cluster.

Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Aug 6, 2019

Build succeeded

@craig craig bot merged commit 0236b25 into cockroachdb:master Aug 6, 2019
@knz knz deleted the 20190802-cluster-name branch August 6, 2019 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

core: user-provided cluster name.
4 participants