Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes using externally managed certs for kubeadm #68296

Merged

Conversation

liztio
Copy link
Contributor

@liztio liztio commented Sep 5, 2018

What this PR does / why we need it:
The certificates overhaul caused a regression when using external certificates. This fixes that issue so external CAs no longer require a key if all certificates exist.

Walk the certificate tree, at each step checking for a CACert.
If the CACert is found, try to use it to generate certificates.
Otherwise, generate a new CA cert.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes kubernetes/kubeadm#918

Special notes for your reviewer:

Release note:

External CAs can now be used for kubeadm with only a certificate, as long as all required certificates already exist.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. area/kubeadm sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 5, 2018
@liztio
Copy link
Contributor Author

liztio commented Sep 5, 2018

/sig cluster-lifecycle

@liztio
Copy link
Contributor Author

liztio commented Sep 5, 2018

/cc @timothysc

@neolit123
Copy link
Member

a release note is possible needed.

wasn't the cert overhaul after this issue chronologically:

Fixes kubernetes/kubeadm#918

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 5, 2018
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some nits.
thanks for the bug fix @liztio


writePKIFiles(t, tmpdir, test.files)

err := CreatePKIAssets(cfg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err := CreatePKIAssets(cfg)
if err != nil {
	if test.expectError {
		return
	}
	t.Fatalf("Unexpected error: %v", err)
}
if test.expectError {
	t.Fatal("Expected error from CreatePKIAssets, got none")
}

can this be refactored to the following:

if err := CreatePKIAssets(cfg); (err != nil) != test.expectError {
	t.Fatalf("expected error: %v, got: %v, error: %v", test.expectError, (err != nil), err)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turns out to not quite be the case: if there's an error and we expect it, we don't want to continue to run assertCertsExist


for caCert, certs := range tree {
if err := validateCACert(certKeyLocation{dir, caCert.BaseName, "", caCert.Name}); err != nil {
t.Errorf("Couldn't validate CA certificate %v: %v", caCert.Name, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't -> couldn't

uxName: leaf.Name,
}
if err := validateSignedCertWithCA(cl, caCert); err != nil {
return fmt.Errorf("couldn't load expected cert %q and key %q does not exist: %v", leaf.Name, ca.Name, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly mod the message to:
could not load expected certificate %q or validate the existence of key %q for it: %v
🤔 ?

if err == nil {
// Cert exists already, make sure it's valid
if !caCert.IsCA {
return fmt.Errorf("certificate %q should be a CA but is not", ca.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

certificate %q should be a CA but is not
->
certificate %q is not a CA
?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with the statement.

@timothysc timothysc added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Sep 5, 2018
@timothysc timothysc added this to the v1.12 milestone Sep 5, 2018
Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@liztio needs a release note if it was originally reported in 1.11 and fixed in 1.12

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 6, 2018
@liztio
Copy link
Contributor Author

liztio commented Sep 6, 2018

@neolit123 huh, maybe it was! I honestly just assumed that I broke it!

@liztio liztio changed the title Fixes regression in using externally managed certs for kubeadm Fixes using externally managed certs for kubeadm Sep 6, 2018
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Sep 6, 2018
Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 6, 2018
Walk the certificate tree, at each step checking for a CACert.
If the CACert is found, try to use it to generate certificates.
Otherwise, generate a new CA cert.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 6, 2018
}

// validateSignedCertWithCA tries to load a certificate and validate it with the given caCert
func validateSignedCertWithCA(l certKeyLocation, caCert *x509.Certificate) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not urgent, but might be good to get test cases for this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a slight generalisation of the existing validateSignedCert so I didn't bother.

@timothysc
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 6, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liztio, timothysc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 68087, 68256, 64621, 68299, 68296). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.

@k8s-github-robot k8s-github-robot merged commit dd14bc5 into kubernetes:master Sep 6, 2018
lkpdn pushed a commit to lkpdn/kubernetes that referenced this pull request Sep 8, 2018
PR kubernetes#68296 got ahead of this.

This reverts:
- commit 17b4c19
- commit 2fd01a0

----------[corresponding planned issue]---------
TITLE: kubeadm HA setup: silently broken certificate chains issue, leading to failed etcd clustering

<!-- This form is for bug reports and feature requests ONLY!

If you're looking for help check [Stack Overflow](https://stackoverflow.com/questions/tagged/kubernetes) and the [troubleshooting guide](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/).

If the matter is security related, please disclose it privately via https://kubernetes.io/security/.
-->

**Is this a BUG REPORT or FEATURE REQUEST?**:
/kind bug

**What happened**:
Following the [HA setup documentation](https://kubernetes.io/docs/setup/independent/high-availability/) and having done the part `5. Run the kubeadm phase commands to bootstrap the kubelet`, etcd starts complaining about the certificate issue, with the cluster health never be reaching ok.

   - CP0 (CP0_IP: 192.168.122.27)
   ```
     ----(snip)----
     2018-09-08 01:50:36.892124 I | etcdserver/membership: added member 10309eb7b886aec3 [https://192.168.122.30:2380] to cluster 50244b6b5ec9586c
     2018-09-08 01:50:36.892205 I | rafthttp: starting peer 10309eb7b886aec3...
     2018-09-08 01:50:36.892446 I | rafthttp: started HTTP pipelining with peer 10309eb7b886aec3
     2018-09-08 01:50:36.894198 I | rafthttp: started peer 10309eb7b886aec3
     2018-09-08 01:50:36.894236 I | rafthttp: added peer 10309eb7b886aec3
     2018-09-08 01:50:36.894429 I | rafthttp: started streaming with peer 10309eb7b886aec3 (writer)
     2018-09-08 01:50:36.894443 I | rafthttp: started streaming with peer 10309eb7b886aec3 (writer)
     2018-09-08 01:50:36.894474 I | rafthttp: started streaming with peer 10309eb7b886aec3 (stream MsgApp v2 reader)
     2018-09-08 01:50:36.894663 I | rafthttp: started streaming with peer 10309eb7b886aec3 (stream Message reader)
     2018-09-08 01:50:37.845351 W | etcdserver: failed to reach the peerURL(https://192.168.122.30:2380) of member 10309eb7b886aec3 (Get https://192.168.122.30:2380/version: dial tcp 192.168.122.30:2380: getsockopt: connection refused)
     2018-09-08 01:50:37.845466 W | etcdserver: cannot get the version of member 10309eb7b886aec3 (Get https://192.168.122.30:2380/version: dial tcp 192.168.122.30:2380: getsockopt: connection refused)
     2018-09-08 01:50:37.900635 W | raft: d9e6643beae8a97f stepped down to follower since quorum is not active
     2018-09-08 01:50:37.900702 I | raft: d9e6643beae8a97f became follower at term 2
     2018-09-08 01:50:37.900741 I | raft: raft.node: d9e6643beae8a97f lost leader d9e6643beae8a97f at term 2
     2018-09-08 01:50:38.901030 I | raft: d9e6643beae8a97f is starting a new election at term 2
     2018-09-08 01:50:38.901193 I | raft: d9e6643beae8a97f became candidate at term 3
     2018-09-08 01:50:38.901261 I | raft: d9e6643beae8a97f received MsgVoteResp from d9e6643beae8a97f at term 3
     2018-09-08 01:50:38.901319 I | raft: d9e6643beae8a97f [logterm: 2, index: 553] sent MsgVote request to 10309eb7b886aec3 at term 3
     /*
      * etcd on CP1 boots up, but its client cert is invalid..
      */
> 2018-09-08 01:50:39.673266 I | etcdmain: rejected connection from "192.168.122.30:41466" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"etcd-ca\")", ServerName "")
     2018-09-08 01:50:40.500956 I | raft: d9e6643beae8a97f is starting a new election at term 3
     2018-09-08 01:50:40.501314 I | raft: d9e6643beae8a97f became candidate at term 4
     2018-09-08 01:50:40.501751 I | raft: d9e6643beae8a97f received MsgVoteResp from d9e6643beae8a97f at term 4
     2018-09-08 01:50:40.505545 I | raft: d9e6643beae8a97f [logterm: 2, index: 553] sent MsgVote request to 10309eb7b886aec3 at term 4
> 2018-09-08 01:50:40.938004 I | etcdmain: rejected connection from "192.168.122.30:41468" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"etcd-ca\")", ServerName "")
     2018-09-08 01:50:41.801044 I | raft: d9e6643beae8a97f is starting a new election at term 4
     2018-09-08 01:50:41.801112 I | raft: d9e6643beae8a97f became candidate at term 5
     2018-09-08 01:50:41.801137 I | raft: d9e6643beae8a97f received MsgVoteResp from d9e6643beae8a97f at term 5
     2018-09-08 01:50:41.801221 I | raft: d9e6643beae8a97f [logterm: 2, index: 553] sent MsgVote request to 10309eb7b886aec3 at term 5
     ----(snip)----
```

  - CP1 (CP1_IP: 192.168.122.30)
   ```
     ----(snip)----
     2018-09-08 14:53:32.763499 W | etcdserver: could not get cluster response from https://192.168.122.27:2380: Get https://192.168.122.27:2380/members: remote error: tls: bad certificate
     2018-09-08 14:53:32.768630 I | etcdmain: rejected connection from "192.168.122.27:33996" (error  remote error: tls: bad certificate", ServerName "")
     2018-09-08 14:53:32.771694 C | etcdmain: cannot fetch cluster info from peer urls: could not retrieve cluster information from the given urls
     /*
      * and k8s_etcd_etcd dies..
      */
     ----(snip)----
```

This is because the scp-ed CA root certificate for etcd clustering is not going to be used when generating the daughter certificate so etcd in the first stacked control plane cannot verify it.

This problem is not confined to etcd, but all the other PKI assets which are expected to be signed by corresponding scp-ed CAs as well. `kubeadm init` and `kubeadm join` paths seem not to be affected.

In my opinion, there are two options we can take:
- ensure the capability to utilize the pre-installed CA with kubeadm certs command, to be consistent with the current documentation.
- log some warning or error message saying that the certificate chains has been broken with the command.

and I think the former is reasonable, so I will post PR for that later.

**What you expected to happen**:
etcd clustering successfully starts with properly setup certificates chains.

**How to reproduce it (as minimally and precisely as possible)**:
As mentioned in **What happened:** section above.

**Anything else we need to know?**:
Nothing in particular.

**Environment**:
- Kubernetes version (use `kubectl version`): v1.13.0-alpha.0.1072+2c933695fa61d5
- Cloud provider or hardware configuration: hardware configuration (qemu-kvm)
- OS (e.g. from /etc/os-release): fedora 28
- Kernel (e.g. `uname -a`): 4.19.0-rc2
- Install tools: kubeadm
- Others:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"kubeadm init" doesn't accommodate an external front proxy CA
6 participants