Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc,security: use the tenant client cert for pod-pod communication #71248

Merged
merged 1 commit into from
Oct 11, 2021

Conversation

knz
Copy link
Contributor

@knz knz commented Oct 6, 2021

Fixes #71106
Alternative design to #71190
Epic: SEC-665

As of this patch, we have the following file usage:

  • KV nodes on host cluster:

    • ui.crt (optional):
      • used as server cert for HTTP
    • ui-ca.crt (optional):
      • used in unit tests to verify the server's identity for HTTP conns
    • node.crt:
      • used as client cert for node-to-node comms
      • used as server cert for node-to-node comms
      • used as server cert for SQL clients
      • used as server cert for incoming conns from SQL tenant servers
      • used as server cert for HTTP, if ui.crt doesn't exist
    • tenant-client-ca.crt (optional):
      • used to verify client certs for SQL tenant servers
    • client-ca.crt (optional);
      • used to verify client certs for SQL clients
      • used to verify client certs for SQL tenant servers, if tenant-client-ca.crt doesn't exist
    • ca.crt:
      • used to verify other node client certs for node-to-node comms
      • used in unit tests to verify the server's identity for SQL and RPC conns
      • used to verify client certs for SQL clients, if client-ca.crt doesn't exist
      • used to verify client certs for SQL tenant servers, if neither tenant-client.ca.crt nor client-ca.crt exist
  • SQL servers:

    • ui.crt (optional):
      • used as server cert for HTTP
    • ui-ca.crt (optional):
      • used in unit tests to verify the server's identity for HTTP conns
    • client-tenant.NN.crt:
      • used as client cert for node-to-node comms (SQL server to SQL server)
      • used as server cert for node-to-node comms (SQL server to SQL server)
      • used as client cert for conns to KV nodes
      • used as server cert for SQL clients
      • used as server cert for HTTP, if ui.crt doesn't exist
    • tenant-client-ca.crt (optional):
      • used to verify client certs for SQL tenant servers
    • client-ca.crt (optional);
      • used to verify client certs for SQL clients
      • used to verify client certs for SQL tenant servers, if tenant-client-ca.crt doesn't exist
    • ca.crt:
      • used to verify other SQL server certs for node-to-node comms, if tenant-client-ca.crt doens't exist
      • used to verify client certs for SQL clients, if client-ca.crt doesn't exist
      • used to verify client certs for SQL tenant servers, if neither tenant-client.ca.crt nor client-ca.crt exist
      • used in unit tests to verify the server's identity for SQL and RPC conns

Release note (security update): Multitenant SQL servers now reuse
the tenant client certificate (client-tenant.NN.crt) for SQL-to-SQL
communication. Existing deployments must regenerate the certificates
with dual purpose (client and server authentication).

@knz knz requested a review from a team as a code owner October 6, 2021 22:32
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@catj-cockroach catj-cockroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass looks good to me!

I have a few suggestions:

  • let's rename every instance of TenantClient to Tenant
  • let's assume that the CA used for tenants is the same as the CA used for the SQL layer and remove support for the TenantCACertificate. I'd prefer we add it back if we need it rather than having a distinct CA for every potential layer of the infrastructure.

We then end up with:

  • ca.crt - used for infrastructure
  • client-ca.crt - used for SQL clients

Reviewed 29 of 31 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @catj-cockroach and @knz)


pkg/rpc/auth.go, line 150 at r1 (raw file):

		// Is this a connection from another SQL tenant server?
		subj := tlsInfo.State.PeerCertificates[0].Subject
		if !security.Contains(subj.OrganizationalUnit, security.TenantsOU) {

Could we throw this logic and the above logic into an exported function in the security package?

For example:

// IsTenantCertificate returns true if the passed certificate indicates an
// inbound Tenant connection.
func IsTenantCertificate(cert *x509.Certificate) bool {
		return Contains(cert.Subject.OrganizationalUnit, TenantsOU)
}

pkg/security/certificate_manager.go, line 778 at r1 (raw file):

	var nodeCert *CertInfo
	if cm.tenantClientCert == nil {

Could I suggest that we do a search and replace for all instances of "tenantClient", replacing it with "tenant"?


pkg/security/certificate_manager.go, line 873 at r1 (raw file):

	if cm.clientCACert == nil {
		// No client CA: use general CA.
		if cm.tenantClientCert == nil {

I am not 100% certain what's going on here so I'll layout what my assumption is:

This function checks to see if the CA certificate exists in the CertificateManager.

  • If it does not, it checks to see if the Tenant Client Certificate exists.
  • If the Tenant Certificate has not been loaded into the certificate manager (it points to nil), it returns the global CA certificate.
  • If the Tenant Certificate has been loaded (is not nil), it returns the Tenant CA certificate.
  • If the Client CA certificate does exist, we verify that it is valid. If it is valid, we return it. If it is not valid we return an error.

Is this correct?


pkg/security/certificate_manager.go, line 963 at r1 (raw file):

// certificates. Use the tenant client CA if it exists, otherwise fall back to
// client CA. cm.mu must be held.
func (cm *CertificateManager) getTenantClientCACertLocked() (*CertInfo, error) {

Given that the Tenant Client Certificate is no longer just used as a client, I'd think it would make sense to rename this to getTenantCACertLocked()


pkg/security/certificate_manager.go, line 976 at r1 (raw file):

// getTenantClientCertLocked returns the tenant node cert.
// cm.mu must be held.
func (cm *CertificateManager) getTenantClientCertLocked() (*CertInfo, error) {

Likewise with above, could we rename this to getTenantCertLocked()


pkg/security/certificate_manager.go, line 986 at r1 (raw file):

// GetTenantClientTLSConfig returns the most up-to-date tenant client
// tls.Config.
func (cm *CertificateManager) GetTenantClientTLSConfig() (*tls.Config, error) {

Is this still intended for outbound connections as a client?


pkg/security/certificate_manager.go, line 1018 at r1 (raw file):

// GetClientTLSConfig returns the most up-to-date client tls.Config.
// Returns the dual-purpose node certs if user == NodeUser and there is no

Should this say dual-purpose node certificates (or tenant certificates)?


pkg/security/certs.go, line 488 at r1 (raw file):

	}

	// Load the tenant client CA cert info. Note that this falls back to the regular client CA which in turn falls

What do we intend on using as ca.crt in the SQL servers? Is it the tenant CA certificate?

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's assume that the CA used for tenants is the same as the CA used for the SQL layer and remove support for the TenantCACertificate.

I've tried that and there is waay too much code changes needed. I think I want to de-risk this and avoid boiling the ocean so close to the release.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @catj-cockroach)


pkg/rpc/auth.go, line 150 at r1 (raw file):

Could we throw this logic and the above logic into an exported function in the security package?

Probably we can yes. Do you want me to do it too?


pkg/security/certificate_manager.go, line 778 at r1 (raw file):

Previously, catj-cockroach (Cat J) wrote…

Could I suggest that we do a search and replace for all instances of "tenantClient", replacing it with "tenant"?

Yes you could suggest that. Is this also what you want to happen?


pkg/security/certificate_manager.go, line 873 at r1 (raw file):

Previously, catj-cockroach (Cat J) wrote…

I am not 100% certain what's going on here so I'll layout what my assumption is:

This function checks to see if the CA certificate exists in the CertificateManager.

  • If it does not, it checks to see if the Tenant Client Certificate exists.
  • If the Tenant Certificate has not been loaded into the certificate manager (it points to nil), it returns the global CA certificate.
  • If the Tenant Certificate has been loaded (is not nil), it returns the Tenant CA certificate.
  • If the Client CA certificate does exist, we verify that it is valid. If it is valid, we return it. If it is not valid we return an error.

Is this correct?

that sounds about right with the current code. I'm not sure that the code we should have there though.


pkg/security/certificate_manager.go, line 963 at r1 (raw file):

Previously, catj-cockroach (Cat J) wrote…

Given that the Tenant Client Certificate is no longer just used as a client, I'd think it would make sense to rename this to getTenantCACertLocked()

Done.


pkg/security/certificate_manager.go, line 976 at r1 (raw file):

Previously, catj-cockroach (Cat J) wrote…

Likewise with above, could we rename this to getTenantCertLocked()

probably we could yes. if you want it to happen, please spell it out.


pkg/security/certificate_manager.go, line 986 at r1 (raw file):

Previously, catj-cockroach (Cat J) wrote…

Is this still intended for outbound connections as a client?

it's used for outbound connections from tenant to KV node.


pkg/security/certificate_manager.go, line 1018 at r1 (raw file):

Previously, catj-cockroach (Cat J) wrote…

Should this say dual-purpose node certificates (or tenant certificates)?

I don't know?
If you have a specific idea, please spell it out.


pkg/security/certs.go, line 488 at r1 (raw file):

Previously, catj-cockroach (Cat J) wrote…

What do we intend on using as ca.crt in the SQL servers? Is it the tenant CA certificate?

I don't know - what do you think?
If you need this to change in any particular way, please spell it out.

@knz knz requested a review from a team as a code owner October 7, 2021 17:00
@knz
Copy link
Contributor Author

knz commented Oct 7, 2021

RFAL

I also updated the commit message to indicate how the various files are used, based on (my understanding of) my reverse engineering of the code.

Copy link
Contributor

@catj-cockroach catj-cockroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


-- commits, line 18 at r2:
Can we reword this to:

- used to verify certificates from SQL tenant servers connecting as clients

-- commits, line 40 at r2:
Can we reword this to:

used to verify certs from other SQL tenant servers

pkg/rpc/auth.go, line 150 at r1 (raw file):

Previously, knz (kena) wrote…

Could we throw this logic and the above logic into an exported function in the security package?

Probably we can yes. Do you want me to do it too?

Yes please!


pkg/security/certificate_manager.go, line 778 at r1 (raw file):

Previously, knz (kena) wrote…

Yes you could suggest that. Is this also what you want to happen?

Yes please!


pkg/security/certificate_manager.go, line 873 at r1 (raw file):

Previously, knz (kena) wrote…

that sounds about right with the current code. I'm not sure that the code we should have there though.

I'd argue for this specific function, let's return client-ca.crt or ca.crt if clientCACert doesn't exist. Let's treat Tenant Client CA as only used for tenant certificates.

What if we have CertificateManager have a IsTenant function? We could have it return true if we have a tenant certificate (and if we want to share the tenant ID with the certificate manager, we could also require that be set as well), and then use that to determine what certificate to return? (This is part of why I suggested we remove the TenantClientCACert).


pkg/security/certificate_manager.go, line 894 at r1 (raw file):

			return cm.getCACertLocked()
		}
		return cm.getTenantClientCACertLocked()

To simplify things, let's only use ui-ca.crt, client-ca.crt, or ca.crt for the UI everywhere.

Could you remove references to the Tenant CA from the UI related functions in this file?


pkg/security/certificate_manager.go, line 921 at r1 (raw file):

			return cm.getNodeCertLocked()
		}
		return cm.getTenantClientCertLocked()

And here!


pkg/security/certificate_manager.go, line 976 at r1 (raw file):

Previously, knz (kena) wrote…

probably we could yes. if you want it to happen, please spell it out.

OK! Let's rename it to getTenantCertLocked() and add a function that calls it directly called getTenantClientCertLocked.


pkg/security/certificate_manager.go, line 986 at r1 (raw file):

Previously, knz (kena) wrote…

it's used for outbound connections from tenant to KV node.

Are we also using it for Tenant-to-Tenant communication? If so, let's rename this to GetTenantTLSConfig(), if not, let's rename it to GetTenantToKVTLSConfig()


pkg/security/certificate_manager.go, line 1018 at r1 (raw file):

Previously, knz (kena) wrote…

I don't know?
If you have a specific idea, please spell it out.

I was looking at this locally and I think we should split this function into two functions.

func (cm *CertificateManager) getUserTLSConfig(user SQLUsername) (*tls.Config, error) {
	cm.mu.Lock()
	defer cm.mu.Unlock()

	// We always need the CA cert.
	var ca *CertInfo
	var err error

	ca, err = cm.getCACertLocked()
	if err != nil {
		return nil, err
	}
	
	clientCert, err := cm.getClientCertLocked(user)
	if err != nil {
		return nil, err
	}

	cfg, err := newClientTLSConfig(
		cm.tlsSettings,
		clientCert.FileContents,
		clientCert.KeyFileContents,
		ca.FileContents)
	if err != nil {
		return nil, err
	}

	return cfg, nil
}

// exported so we can refactor code that always wants the NodeTLSConfig directly.
func (cm *CertificateManager) GetNodeTLSConfig() (*tls.Config, error) {
	cm.mu.Lock()
	defer cm.mu.Unlock()

	// We always need the CA cert.
	ca, err = cm.getCACertLocked()
	if err != nil {
		return nil, err
	}

	// We're the node user:
	// Return the cached config if we have one.
	if cm.clientConfig != nil {
		return cm.clientConfig, nil
	}

	clientCert, err := cm.getNodeClientCertLocked()
	if err != nil {
		return nil, err
	}

	cfg, err := newClientTLSConfig(
		cm.tlsSettings,
		clientCert.FileContents,
		clientCert.KeyFileContents,
		ca.FileContents)
	if err != nil {
		return nil, err
	}

	// Cache the config.
	cm.clientConfig = cfg
	return cfg, nil

}

// GetClientTLSConfig returns the most up-to-date client tls.Config.
// Returns the dual-purpose node certs if user == NodeUser and there is no
// separate client cert for 'node'.
func (cm *CertificateManager) GetClientTLSConfig(user SQLUsername) (*tls.Config, error) {
	if !user.IsNodeUser() {
		return cm.getUserTLSConfig(user)
	}

	return cm.GetNodeTLSConfig()
}

pkg/security/certificate_manager.go, line 954 at r2 (raw file):

			return cm.getNodeCertLocked()
		}
		return cm.getTenantClientCertLocked()

We're only using Node to refer to KV layer/non-multi-tenant nodes, could we remove the SQL tenant logic from this function? Below I've a request to rename and alias the existing getTenantClientCertLocked(), and I'd prefer we explicitly use that function rather than having this function try to return the correct certificate.


pkg/security/certs.go, line 488 at r1 (raw file):

Previously, knz (kena) wrote…

I don't know - what do you think?
If you need this to change in any particular way, please spell it out.

Oh, sorry I meant this as a "how do we intend on this working" question

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've processed your review comments, but still working to fix a few thigns. I'll give a signal when this is ready to review again.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @catj-cockroach and @chrisseto)


-- commits, line 18 at r2:

Previously, catj-cockroach (Cat J) wrote…

Can we reword this to:

- used to verify certificates from SQL tenant servers connecting as clients

Done.


-- commits, line 40 at r2:

Previously, catj-cockroach (Cat J) wrote…

Can we reword this to:

used to verify certs from other SQL tenant servers

Done.


pkg/rpc/auth.go, line 150 at r1 (raw file):

Previously, catj-cockroach (Cat J) wrote…

Yes please!

Done.


pkg/security/certificate_manager.go, line 778 at r1 (raw file):

Previously, catj-cockroach (Cat J) wrote…

Yes please!

Done.


pkg/security/certificate_manager.go, line 873 at r1 (raw file):

What if we have CertificateManager have a IsTenant function?

Done.


pkg/security/certificate_manager.go, line 894 at r1 (raw file):

Previously, catj-cockroach (Cat J) wrote…

To simplify things, let's only use ui-ca.crt, client-ca.crt, or ca.crt for the UI everywhere.

Could you remove references to the Tenant CA from the UI related functions in this file?

@chrisseto suggests this wouldn't be smart, because our deployment code does not yet provision separate certs/CA for the HTTP service.


pkg/security/certificate_manager.go, line 921 at r1 (raw file):

Previously, catj-cockroach (Cat J) wrote…

And here!

ditto above


pkg/security/certificate_manager.go, line 976 at r1 (raw file):

Previously, catj-cockroach (Cat J) wrote…

OK! Let's rename it to getTenantCertLocked() and add a function that calls it directly called getTenantClientCertLocked.

Done.


pkg/security/certificate_manager.go, line 986 at r1 (raw file):

Previously, catj-cockroach (Cat J) wrote…

Are we also using it for Tenant-to-Tenant communication? If so, let's rename this to GetTenantTLSConfig(), if not, let's rename it to GetTenantToKVTLSConfig()

Done.


pkg/security/certificate_manager.go, line 1018 at r1 (raw file):

Previously, catj-cockroach (Cat J) wrote…

I was looking at this locally and I think we should split this function into two functions.

func (cm *CertificateManager) getUserTLSConfig(user SQLUsername) (*tls.Config, error) {
	cm.mu.Lock()
	defer cm.mu.Unlock()

	// We always need the CA cert.
	var ca *CertInfo
	var err error

	ca, err = cm.getCACertLocked()
	if err != nil {
		return nil, err
	}
	
	clientCert, err := cm.getClientCertLocked(user)
	if err != nil {
		return nil, err
	}

	cfg, err := newClientTLSConfig(
		cm.tlsSettings,
		clientCert.FileContents,
		clientCert.KeyFileContents,
		ca.FileContents)
	if err != nil {
		return nil, err
	}

	return cfg, nil
}

// exported so we can refactor code that always wants the NodeTLSConfig directly.
func (cm *CertificateManager) GetNodeTLSConfig() (*tls.Config, error) {
	cm.mu.Lock()
	defer cm.mu.Unlock()

	// We always need the CA cert.
	ca, err = cm.getCACertLocked()
	if err != nil {
		return nil, err
	}

	// We're the node user:
	// Return the cached config if we have one.
	if cm.clientConfig != nil {
		return cm.clientConfig, nil
	}

	clientCert, err := cm.getNodeClientCertLocked()
	if err != nil {
		return nil, err
	}

	cfg, err := newClientTLSConfig(
		cm.tlsSettings,
		clientCert.FileContents,
		clientCert.KeyFileContents,
		ca.FileContents)
	if err != nil {
		return nil, err
	}

	// Cache the config.
	cm.clientConfig = cfg
	return cfg, nil

}

// GetClientTLSConfig returns the most up-to-date client tls.Config.
// Returns the dual-purpose node certs if user == NodeUser and there is no
// separate client cert for 'node'.
func (cm *CertificateManager) GetClientTLSConfig(user SQLUsername) (*tls.Config, error) {
	if !user.IsNodeUser() {
		return cm.getUserTLSConfig(user)
	}

	return cm.GetNodeTLSConfig()
}

That's a slightly too ambitious refactor for this time. I'll leave a TODO.


pkg/security/certificate_manager.go, line 954 at r2 (raw file):

Previously, catj-cockroach (Cat J) wrote…

We're only using Node to refer to KV layer/non-multi-tenant nodes, could we remove the SQL tenant logic from this function? Below I've a request to rename and alias the existing getTenantClientCertLocked(), and I'd prefer we explicitly use that function rather than having this function try to return the correct certificate.

Done.


pkg/security/certs.go, line 488 at r1 (raw file):

Previously, catj-cockroach (Cat J) wrote…

Oh, sorry I meant this as a "how do we intend on this working" question

I think the commit message describes what the current behavior is. I don't know what a good endpoint should be, but then it's not exactly in-scope for this PR.

@knz
Copy link
Contributor Author

knz commented Oct 7, 2021

Ok this is ready for another look. There was a major "oops" in rpc/auth.go in my previous implementation.

Copy link
Contributor

@catj-cockroach catj-cockroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 15 of 17 files at r3, 6 of 7 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/security/certificate_manager.go, line 611 at r4 (raw file):

		case NodePem:
			// Don't even bother loading the node cert if we are not in the
			// host cluster.

Oooo good call!


pkg/security/certificate_manager.go, line 938 at r5 (raw file):

// cm.mu must be held.
//
// TODO(catj): split the logic here into two functions.

Thank you! :)


pkg/security/certificate_manager.go, line 957 at r5 (raw file):

		// but only if we are in the host cluster.
		if cm.IsForTenant() {
			return nil, errors.New("no node client cert for a SQL server")

Oh this is much clearer, thank you!


pkg/security/certificate_manager.go, line 1116 at r5 (raw file):

	caBlob := uiCA.FileContents

	// If it's available, we also include the tenant CA.

Out of curiosity, why do we include the tenant CA as well? Is it signing anything related to the UI?


pkg/security/certs.go, line 488 at r1 (raw file):

Previously, knz (kena) wrote…

I think the commit message describes what the current behavior is. I don't know what a good endpoint should be, but then it's not exactly in-scope for this PR.

kk!


pkg/security/tls.go, line 44 at r5 (raw file):

// Embedded certificates specific to multi-tenancy testing.
const (
	EmbeddedTenantCACert = "ca-client-tenant.crt" // CA for client connections

This doesn't need to happen now but we should rename this to ca-tenant.crt as well!

Copy link
Contributor

@catj-cockroach catj-cockroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @catj-cockroach)


pkg/security/certificate_manager.go, line 1116 at r5 (raw file):

Previously, catj-cockroach (Cat J) wrote…

Out of curiosity, why do we include the tenant CA as well? Is it signing anything related to the UI?

In unit tests, we have the HTTP client validate the server's identity.
Since the server may be using the tenant cert, we need it to know about the tenant CA too.


pkg/security/tls.go, line 44 at r5 (raw file):

Previously, catj-cockroach (Cat J) wrote…

This doesn't need to happen now but we should rename this to ca-tenant.crt as well!

I fear there are hardcoded file paths in other places. Let's not boil the ocean.

@chrisseto
Copy link
Contributor

@knz I've tried out this build in a test cluster and it seems that I'm getting panics due to the lack of node certificates. I wonder if the server path is lacking unit tests or if using the same certificate path has hidden this issue?

It seems that the issue stems from this call to rpc.NewServer

E211008 16:39:27.830694 1 1@util/log/logcrash/crash_reporting.go:140  [-] 5 +  | github.com/cockroachdb/cockroach/pkg/cli.Main
E211008 16:39:27.830694 1 1@util/log/logcrash/crash_reporting.go:140  [-] 5 +  |     /go/src/github.com/cockroachdb/cockroach/pkg/cli/cli.go:59
E211008 16:39:27.830694 1 1@util/log/logcrash/crash_reporting.go:140  [-] 5 +  | main.main
E211008 16:39:27.830694 1 1@util/log/logcrash/crash_reporting.go:140  [-] 5 +  |     /go/src/github.com/cockroachdb/cockroach/pkg/cmd/cockroach/main.go:26
E211008 16:39:27.830694 1 1@util/log/logcrash/crash_reporting.go:140  [-] 5 +  | runtime.main
E211008 16:39:27.830694 1 1@util/log/logcrash/crash_reporting.go:140  [-] 5 +  |     /usr/local/go/src/runtime/proc.go:225
E211008 16:39:27.830694 1 1@util/log/logcrash/crash_reporting.go:140  [-] 5 +  | runtime.goexit
E211008 16:39:27.830694 1 1@util/log/logcrash/crash_reporting.go:140  [-] 5 +  |     /usr/local/go/src/runtime/asm_amd64.s:1371
E211008 16:39:27.830694 1 1@util/log/logcrash/crash_reporting.go:140  [-] 5 +Wraps: (2) problem with node certificate: not found
E211008 16:39:27.830694 1 1@util/log/logcrash/crash_reporting.go:140  [-] 5 +Error types: (1) *withstack.withStack (2) *security.Error
fluent:tcp://172.19.0.8:5170: error dialing network logger: dial tcp 172.19.0.8:5170: connect: connection refused
{"tag":"cockroach.ops","channel_numeric":1,"channel":"OPS","timestamp":"1633711167.830694823","severity_numeric":3,"severity":"ERROR","goroutine":1,"file":"util/log/logcrash/crash_reporting.go","line":140,"e
E211008 16:39:28.853216 1 1@util/log/channels.go:58 ⋮ [-]   logging error: dial tcp ‹172.19.0.8:5170›: connect: connection refused
fluent:tcp://172.19.0.8:5170: error dialing network logger: dial tcp 172.19.0.8:5170: connect: connection refused
{"tag":"cockroach.ops","channel_numeric":1,"channel":"OPS","timestamp":"1633711168.853216867","severity_numeric":3,"severity":"ERROR","goroutine":1,"file":"util/log/channels.go","line":58,"entry_counter":0,"
panic: problem with node certificate: not found [recovered]
    panic: problem with node certificate: not found [recovered]
    panic: problem with node certificate: not found
goroutine 1 [running]:
github.com/cockroachdb/cockroach/pkg/util/log/logcrash.RecoverAndReportPanic(0x8b098a0, 0xc000078048, 0xc000b9a000)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/log/logcrash/crash_reporting.go:90 +0xa5
panic(0x4a4b240, 0xc000d08440)
    /usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop(0xc000164a00, 0x8b098a0, 0xc000078048)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:502 +0x2d3
panic(0x4a4b240, 0xc000d08440)
    /usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/cockroachdb/cockroach/pkg/rpc.NewServer(0xc0007038c0, 0x0, 0x0, 0x0, 0x8b71998)
    /go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:149 +0xe17
github.com/cockroachdb/cockroach/pkg/server.makeTenantSQLServerArgs(0xc000164a00, 0x0, 0x0, 0xc000b9a000, 0xc0001d2b60, 0xc000ef7f80, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/server/tenant.go:425 +0x1065
github.com/cockroachdb/cockroach/pkg/server.StartTenant(0x8b098a0, 0xc000078048, 0xc000164a00, 0x0, 0x0, 0xc000b9a000, 0xc0001d2b60, 0xc000ef7f80, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/server/tenant.go:66 +0x13b
github.com/cockroachdb/cockroach/pkg/cli.runStartSQL(0xb127360, 0xc0009d1dc0, 0x0, 0xe, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/cli/mt_start_sql.go:114 +0x37a
github.com/cockroachdb/cockroach/pkg/cli/clierrorplus.MaybeDecorateError.func1(0xb127360, 0xc0009d1dc0, 0x0, 0xe, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/cli/clierrorplus/decorate_error.go:67 +0x79
github.com/spf13/cobra.(*Command).execute(0xb127360, 0xc0009d1ce0, 0xe, 0xe, 0xb127360, 0xc0009d1ce0)
    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852 +0x472
github.com/spf13/cobra.(*Command).ExecuteC(0xb122860, 0xc000078048, 0xc0007243e2, 0xc)
    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
github.com/cockroachdb/cockroach/pkg/cli.Run(...)
    /go/src/github.com/cockroachdb/cockroach/pkg/cli/cli.go:278
github.com/cockroachdb/cockroach/pkg/cli.doMain(0xb127360, 0xc0007243e2, 0xc, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/cli/cli.go:125 +0x11a
github.com/cockroachdb/cockroach/pkg/cli.Main()
    /go/src/github.com/cockroachdb/cockroach/pkg/cli/cli.go:59 +0x13a
main.main()
    /go/src/github.com/cockroachdb/cockroach/pkg/cmd/cockroach/main.go:26 +0x25

@knz
Copy link
Contributor Author

knz commented Oct 11, 2021

@chrisseto can you provide repro steps?

I just tried myself with a mt server that doesn't have node certs, and it appears that the SQL server starts fine.

@knz
Copy link
Contributor Author

knz commented Oct 11, 2021

@chrisseto I think it's vaguely possible you're trying to mix and match different crdb executable versions?

As of this patch, we have the following file usage:

- KV nodes on host cluster:
  - ui.crt (optional):
    - used as server cert for HTTP
  - ui-ca.crt (optional):
    - used in unit tests to verify the server's identity for HTTP conns
  - node.crt:
    - used as client cert for node-to-node comms
    - used as server cert for node-to-node comms
    - used as server cert for SQL clients
    - used as server cert for incoming conns from SQL tenant servers
    - used as server cert for HTTP, if ui.crt doesn't exist
  - tenant-client-ca.crt (optional):
    - used to verify certificates from SQL tenant servers connecting as clients
  - client-ca.crt (optional);
    - used to verify client certs for SQL clients
    - used to verify client certs for SQL tenant servers, if tenant-client-ca.crt doesn't exist
  - ca.crt:
    - used to verify other node client certs for node-to-node comms
    - used in unit tests to verify the server's identity for SQL and RPC conns
    - used to verify client certs for SQL clients, if client-ca.crt doesn't exist
    - used to verify client certs for SQL tenant servers, if neither tenant-client.ca.crt nor client-ca.crt exist

- SQL servers:
  - ui.crt (optional):
    - used as server cert for HTTP
  - ui-ca.crt (optional):
    - used in unit tests to verify the server's identity for HTTP conns
  - client-tenant.NN.crt:
    - used as client cert for node-to-node comms (SQL server to SQL server)
    - used as server cert for node-to-node comms (SQL server to SQL server)
    - used as client cert for conns to KV nodes
    - used as server cert for SQL clients
    - used as server cert for HTTP, if ui.crt doesn't exist
  - tenant-client-ca.crt (optional):
    - used to verify certs from other SQL tenant servers
  - client-ca.crt (optional);
    - used to verify client certs for SQL clients
    - used to verify client certs for SQL tenant servers, if tenant-client-ca.crt doesn't exist
  - ca.crt:
    - used to verify other SQL server certs for node-to-node comms, if tenant-client-ca.crt doens't exist
    - used to verify client certs for SQL clients, if client-ca.crt doesn't exist
    - used to verify client certs for SQL tenant servers, if neither tenant-client.ca.crt nor client-ca.crt exist
    - used in unit tests to verify the server's identity for SQL and  RPC conns

Release note (security update): Multitenant SQL servers now reuse
the tenant client certificate (`client-tenant.NN.crt`) for SQL-to-SQL
communication. Existing deployments must regenerate the certificates
with dual purpose (client and server authentication).
@knz
Copy link
Contributor Author

knz commented Oct 11, 2021

I'm going to go ahead and merge this, on account of Catherine's positive review and that CI is satisfied.

My purpose in merging this now is to give a chance to our roachtests to exercise the changes thoroughly in the nightly run.

We can adjust the logic further in followup PRs.

bors r=catj-cockroach

@craig
Copy link
Contributor

craig bot commented Oct 11, 2021

Build succeeded:

@craig craig bot merged commit 8f546a6 into cockroachdb:master Oct 11, 2021
@knz
Copy link
Contributor Author

knz commented Oct 11, 2021

blathers backport 21.2

@knz knz deleted the 20211006-tls3 branch October 11, 2021 15:49
craig bot pushed a commit that referenced this pull request Oct 15, 2021
71548: security: prevent PEM decoding errors by careful newline insertation r=catj-cockroach a=catj-cockroach

Fixes #71588

> Previously, we assumed that the Tenant CA certificate and the CA
> certificate would not be the same, and that they would be saved as
> files with an end of file newline character.
> 
> This change adds support for not loading the Tenant CA Certificate
> if it will just be a second copy of the CA certificate.
> 
> This change also adds a helper function to concatenate certificates
> with a newline separating each, to ensure that the easiest way to
> make a certificate bundle is the correct way.
> 
> Release note: bugfix for certificate loading logic

The bug in question was introduced in #71248 (and back ported in #71402).

Co-authored-by: Cat J <87989074+catj-cockroach@users.noreply.github.com>
rafiss added a commit to rafiss/cockroach-go that referenced this pull request Nov 15, 2021
Since cockroachdb/cockroach#71248 the server
address must be passed in as a positional argument.

Technically, the version check makes cockroach-go tenant servers not
work for commits between a0c04c620ddb67cf44265eaef61bcd38c13e4fe7 and
d7569dae28d405437f5af670346d45c2d6deb42a, but I think this is an
acceptable breakage since those are unreleased commits.
rafiss added a commit to rafiss/cockroach-go that referenced this pull request Nov 15, 2021
Since cockroachdb/cockroach#71248 the server
address must be passed in as a positional argument.

Technically, the version check makes cockroach-go tenant servers not
work for commits between a0c04c620ddb67cf44265eaef61bcd38c13e4fe7 and
d7569dae28d405437f5af670346d45c2d6deb42a, but I think this is an
acceptable breakage since those are unreleased commits.
rafiss added a commit to rafiss/cockroach-go that referenced this pull request Nov 15, 2021
Since cockroachdb/cockroach#71248 the server
address must be passed in as a positional argument.

Technically, the version check makes cockroach-go tenant servers not
work for commits between a0c04c620ddb67cf44265eaef61bcd38c13e4fe7 and
d7569dae28d405437f5af670346d45c2d6deb42a, but I think this is an
acceptable breakage since those are unreleased commits.
rafiss added a commit to rafiss/cockroach-go that referenced this pull request Nov 16, 2021
Since cockroachdb/cockroach#71248 the server
address must be passed in as a positional argument.

Technically, the version check makes cockroach-go tenant servers not
work for commits between a0c04c620ddb67cf44265eaef61bcd38c13e4fe7 and
d7569dae28d405437f5af670346d45c2d6deb42a, but I think this is an
acceptable breakage since those are unreleased commits.
martingallagher pushed a commit to martingallagher/cockroach-go that referenced this pull request Jan 19, 2022
Since cockroachdb/cockroach#71248 the server
address must be passed in as a positional argument.

Technically, the version check makes cockroach-go tenant servers not
work for commits between a0c04c620ddb67cf44265eaef61bcd38c13e4fe7 and
d7569dae28d405437f5af670346d45c2d6deb42a, but I think this is an
acceptable breakage since those are unreleased commits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

server: pod-to-pod communication should use same TLS cert as pod-kv comms, not node.crt
4 participants