Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd : Gen_certs | run cert generation script fails on SSL #2343

Closed
matq007 opened this issue Feb 14, 2018 · 21 comments
Closed

etcd : Gen_certs | run cert generation script fails on SSL #2343

matq007 opened this issue Feb 14, 2018 · 21 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@matq007
Copy link

matq007 commented Feb 14, 2018

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG

Environment:

  • Cloud provider or hardware configuration:
    hardware

  • OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
    Linux 3.10.0-693.2.2.el7.x86_64 x86_64
    NAME="CentOS Linux"
    VERSION="7 (Core)"
    ID="centos"
    ID_LIKE="rhel fedora"
    VERSION_ID="7"
    PRETTY_NAME="CentOS Linux 7 (Core)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:centos:centos:7"
    HOME_URL="https://www.centos.org/"
    BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

  • Version of Ansible (ansible --version):
    ansible 2.4.2.0

Kubespray version (commit) (git rev-parse --short HEAD):
v2.4.0

Network plugin used:
calico

Copy of your inventory file:

nc-kub-m01 ansible_ssh_host=10.0.55.165 ip=10.0.55.165
nc-kub-s01 ansible_ssh_host=10.0.55.163 ip=10.0.55.163

[kube-master]
nc-kub-m01

[etcd]
nc-kub-m01

[kube-node]
nc-kub-s01

[k8s-cluster:children]
kube-node
kube-master

Command used to invoke ansible:
ansible-playbook -i inventory/hosts.ini cluster.yml -b -K -v --user=kubernetes --private-key=~/.ssh/kubernetes.pem --ask-sudo-pass

Output of ansible run:

{
	"changed": true,
	"cmd": [
		"bash",
		"-x",
		"/usr/local/bin/etcd-scripts/make-ssl-etcd.sh",
		"-f",
		"/etc/ssl/etcd/openssl.conf",
		"-d",
		"/etc/ssl/etcd/ssl"
	],
	"delta": "0:00:00.403411",
	"end": "2018-02-14 19:53:14.477595",
	"msg": "non-zero return code",
	"rc": 1,
	"start": "2018-02-14 19:53:14.074184",
	"stderr": "+ set -o errexit\n+ set -o pipefail\n+ (( 4 ))\n+ case \"$1\" in\n+ CONFIG=/etc/ssl/etcd/openssl.conf\n+ shift 2\n+ (( 2 ))\n+ case \"$1\" in\n+ SSLDIR=/etc/ssl/etcd/ssl\n+ shift 2\n+ (( 0 ))\n+ '[' -z /etc/ssl/etcd/openssl.conf ']'\n+ '[' -z /etc/ssl/etcd/ssl ']'\n++ mktemp -d /tmp/etcd_cacert.XXXXXX\n+ tmpdir=/tmp/etcd_cacert.Vfkpq5\n+ trap 'rm -rf \"${tmpdir}\"' EXIT\n+ cd /tmp/etcd_cacert.Vfkpq5\n+ mkdir -p /etc/ssl/etcd/ssl\n+ '[' -e /etc/ssl/etcd/ssl/ca-key.pem ']'\n+ cp /etc/ssl/etcd/ssl/ca.pem /etc/ssl/etcd/ssl/ca-key.pem .\n+ '[' -n '  ' ']'\n+ '[' -n '  nc-kub-s01    ' ']'\n+ for host in '$HOSTS'\n+ cn=nc-kub-s01\n+ openssl genrsa -out node-nc-kub-s01-key.pem 2048\n+ openssl req -new -key node-nc-kub-s01-key.pem -out node-nc-kub-s01.csr -subj /CN=etcd-node-nc-kub-s01\n+ openssl x509 -req -in node-nc-kub-s01.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial -out node-nc-kub-s01.pem -days 3650 -extensions ssl_client -extfile /etc/ssl/etcd/openssl.conf\n+ rm -rf /tmp/etcd_cacert.Vfkpq5",
	"stderr_lines": [
		"+ set -o errexit",
		"+ set -o pipefail",
		"+ (( 4 ))",
		"+ case \"$1\" in",
		"+ CONFIG=/etc/ssl/etcd/openssl.conf",
		"+ shift 2",
		"+ (( 2 ))",
		"+ case \"$1\" in",
		"+ SSLDIR=/etc/ssl/etcd/ssl",
		"+ shift 2",
		"+ (( 0 ))",
		"+ '[' -z /etc/ssl/etcd/openssl.conf ']'",
		"+ '[' -z /etc/ssl/etcd/ssl ']'",
		"++ mktemp -d /tmp/etcd_cacert.XXXXXX",
		"+ tmpdir=/tmp/etcd_cacert.Vfkpq5",
		"+ trap 'rm -rf \"${tmpdir}\"' EXIT",
		"+ cd /tmp/etcd_cacert.Vfkpq5",
		"+ mkdir -p /etc/ssl/etcd/ssl",
		"+ '[' -e /etc/ssl/etcd/ssl/ca-key.pem ']'",
		"+ cp /etc/ssl/etcd/ssl/ca.pem /etc/ssl/etcd/ssl/ca-key.pem .",
		"+ '[' -n '  ' ']'",
		"+ '[' -n '  nc-kub-s01    ' ']'",
		"+ for host in '$HOSTS'",
		"+ cn=nc-kub-s01",
		"+ openssl genrsa -out node-nc-kub-s01-key.pem 2048",
		"+ openssl req -new -key node-nc-kub-s01-key.pem -out node-nc-kub-s01.csr -subj /CN=etcd-node-nc-kub-s01",
		"+ openssl x509 -req -in node-nc-kub-s01.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial -out node-nc-kub-s01.pem -days 3650 -extensions ssl_client -extfile /etc/ssl/etcd/openssl.conf",
		"+ rm -rf /tmp/etcd_cacert.Vfkpq5"
	],
	"stdout": "",
	"stdout_lines": []
}

Anything else do we need to know:

Also related to issue #1445.

@woopstar
Copy link
Member

could you try running the command manually:

/usr/local/bin/etcd-scripts/make-ssl-etcd.sh -f /etc/ssl/etcd/openssl.conf -d /etc/ssl/etcd/ssl

@matq007
Copy link
Author

matq007 commented Feb 15, 2018

The command worked fine :). What I found out was that when running the ansible multiple times I guess it doesn't create new certificates. I removed the old certificates in /etc/ssl/etcd manually, reran the ansible again and everything worked perfectly. I'm not sure but maybe adding preinstall rule to check if the certificates are already generated should fix this?

@woopstar
Copy link
Member

Ah. So you already had certificates . I see now from the log output.

The command failling is openssl x509 -req -in node-nc-kub-s01.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial -out node-nc-kub-s01.pem -days 3650 -extensions ssl_client -extfile /etc/ssl/etcd/openssl.conf

The rm -rf /tmp/etcd_cacert.Vfkpq5 happens because of the EXIT trap to cleanup

@woopstar
Copy link
Member

This happens possible because of set -o errexit or set -o pipefail - since the openssl command exits with another code than zero.

@matq007
Copy link
Author

matq007 commented Feb 15, 2018

Awesome, I guess we can close the issue then :)

@woopstar
Copy link
Member

Well. The issue is not really fixed. It should not stop the process

@missnebun
Copy link

missnebun commented Mar 10, 2018

I have the same problem. I removed the ssl keys ... I reset the cluster ... just fails. Not sure what is going on there.

@missnebun
Copy link

[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
subjectAltName = @alt_names
[alt_names]
DNS.1 = kubernetes
DNS.2 = kubernetes.default
DNS.3 = kubernetes.default.svc
DNS.4 = kubernetes.default.svc.cluster.local
DNS.5 = localhost
DNS.6 = nyvm3331
DNS.7 = nyvm3332
DNS.8 = nyvm3333
DNS.9 = k8sapi.a2tz.com
IP.1 = 10.250.69.67
IP.2 = 10.250.69.67
IP.3 = 10.250.69.68
IP.4 = 10.250.69.68
IP.5 = 10.250.69.69
IP.6 = 10.250.69.69
IP.7 = False
IP.8 = 10.250.69.73
IP.9 = 127.0.0.1
IP.10 = 10.250.69.74

For some reason the openssl.conf is getting False as an IP address and this fails to generate the certificate. Not I need to see where that's coming from

@woopstar
Copy link
Member

How did you configure

#loadbalancer_apiserver:
#  address: 1.2.3.4
#  port: 1234

?

@missnebun
Copy link

External LB example config

apiserver_loadbalancer_domain_name: "k8sapi.a2tz.com"
loadbalancer_apiserver:
address: 10.250.69.73
port: 6443

Internal loadbalancers for apiservers

loadbalancer_apiserver_localhost: false

@missnebun
Copy link

@woopstar Any idea ?

@woopstar
Copy link
Member

No, it seems right.

Are you running the latest Kubespray from master branch?

It seems odd the ip's are there multiple times as you only have two hosts in your inventory file. Are you sure you are running the right inventory file? DNS in the openssl.conf says hostname nyvm3331 but that is not in your inventory file?

@nwsparks
Copy link

nwsparks commented Jul 11, 2018

Not sure if this is the cause of your issue, but I ran into the same error as well and it was due to an invalid entry in hosts file. I had specified a DNS name for the api load balancer and it added a hosts file entry that looks like this:

/etc/hosts
internal-blah-123123123.us-east-1.elb.amazonaws.com blah.com

group_vars/all.yml
apiserver_loadbalancer_domain_name: "blah.com"
loadbalancer_apiserver:
address: internal-blah-123123123.us-east-1.elb.amazonaws.com
port: 6443

The documentation should probably be updated for how to properly define aws loadbalancers which need to be referenced by name and not ip.

@ykfq
Copy link

ykfq commented Mar 29, 2019

Same issue, I have no idea why the MASTERS and NODES varibles are both NULL, the -n match null so no etcd node and member certs generated, then mv '*.pem' /etc/ssl/etcd/ssl/" exited with an error:

  "stderr_lines": [
    "+ set -o errexit",
    "+ set -o pipefail",
    "+ (( 4 ))",
    "+ case \"$1\" in",
    "+ CONFIG=/etc/ssl/etcd/openssl.conf",
    "+ shift 2",
    "+ (( 2 ))",
    "+ case \"$1\"in",
    "+ SSLDIR=/etc/ssl/etcd/ssl",
    "+ shift 2",
    "+ (( 0 ))",
    "+ '[' -z /etc/ssl/etcd/openssl.conf ']'",
    "+ '[' -z /etc/ssl/etcd/ssl ']'",
    "++ mktemp -d /tmp/etcd_cacert.XXXXXX",
    "+ tmpdir=/tmp/etcd_cacert.uIu8DZ",
    "+ trap 'rm -rf \"${tmpdir}\"' EXIT",
    "+ cd /tmp/etcd_cacert.uIu8DZ",
    "+ mkdir -p /etc/ssl/etcd/ssl",
    "+ '[' -e /etc/ssl/etcd/ssl/ca-key.pem ']'",
    "+ cp /etc/ssl/etcd/ssl/ca.pem /etc/ssl/etcd/ssl/ca-key.pem .",
    "+ '[' -n '      ' ']'",
    "+ '[' -n '                  ' ']'",
    "+ '[' -e /etc/ssl/etcd/ssl/ca-key.pem ']'",
    "+ rm -f ca.pem ca-key.pem",
    "+ mv '*.pem' /etc/ssl/etcd/ssl/",
    "mv: cannot stat ‘*.pem’: No such file or directory",
    "+ rm -rf /tmp/etcd_cacert.uIu8DZ"
  ],

@ykfq
Copy link

ykfq commented Apr 3, 2019

Well, found and solved the problem finally.

The script make-ssl-etcd.sh itself works fine without logic issue. But why this happened ?

Why

Kubespray set kubeadm as default deployment mode since v2.8.0, and there is a switch kubeadm_enabled: true in all.yml. But, many tasks will only execute when: not kubeadm_enabled, which includes the role kubernetes/node that depends on role kubernetes/secrects which will run task Gen_certs | run cert generation script, in this task, all master and hosts cert will be generated if not exist.

So, as kubeadm_enabled is default to true, these roles will never be executed and this error will never triggered.

I met this error when I run the playbook cluster.yaml with extra parameters to scale master nodes:

 --extra-vars "k8s-secrets=true, gen_certs=true,sync_certs=true, gen_master_certs=true"

This made the task ran but check_certs.yml set gen_node_certs to false, variable MASTERS and HOSTS got NULL and no pem certs generated, so error mv: cannot stat ‘*.pem’: No such file or directory" occured and whole task exited.

How

I think the kubespray did not welly support the scaling master nodes feature, as I upgraded my k8s cluster using kubespray from v2.6.0 to v2.8.3, the old deplyment mode and kubeadm mode were both used and this messed up my environment(old files exist and some tasks skiped and some services loaded old style config files) and leads my scaling failed.

If I run the playbook without any extra parameters, I would miss the x509: certificate is valid for xx.x.x.x, not yy.y.y.y error cause newly added master node not included in old certs duto the gen_certs not executed. what a dead loop!

Two ways to solve my problem:

  • Backup and delete /etc/kubernetes directory and re-run playbook kubeadm_enabled: false
  • Backup etcd, reset the cluster and re-run playbook with kubeadm_enabled: true

Relate codes:

  • ./roles/kubernetes/node/meta/main.yml
---
dependencies:
  - role: kubernetes/secrets
    when: not kubeadm_enabled
    tags:
      - k8s-secrets
  • ./roles/etcd/tasks/gen_certs_script.yml
- name: Gen_certs | run cert generation script
  command: "bash -x {{ etcd_script_dir }}/make-ssl-etcd.sh -f {{ etcd_config_dir }}/openssl.conf -d {{ etcd_cert_dir }}"
  environment:
    - MASTERS: "{% for m in groups['etcd'] %}
                  {% if gen_node_certs[m] %}
                    {{ m }}
                  {% endif %}
                {% endfor %}"
    - HOSTS: "{% for h in (groups['k8s-cluster'] + groups['calico-rr']|default([]))|unique %}
                {% if gen_node_certs[h] %}
                    {{ h }}
                {% endif %}
              {% endfor %}"
  run_once: yes
  delegate_to: "{{groups['etcd'][0]}}"
  when:
    - gen_certs|default(false)
    - inventory_hostname == groups['etcd'][0]
  notify: set etcd_secret_changed
  • ./roles/etcd/tasks/check_certs.yml
- name: "Check_certs | Set 'gen_node_certs' to true"
  set_fact:
    gen_node_certs: |-
      {
      {% set all_etcd_hosts = groups['k8s-cluster']|union(groups['etcd'])|union(groups['calico-rr']|default([]))|unique|sort -%}
      {% set existing_certs = etcdcert_master.files|map(attribute='path')|list|sort %}
      {% for host in all_etcd_hosts -%}
        {% set host_cert = "%s/node-%s-key.pem"|format(etcd_cert_dir, host) %}
        {% if host_cert in existing_certs -%}
        "{{ host }}": False,
        {% else -%}
        "{{ host }}": True,
        {% endif -%}
      {% endfor %}
      }
  run_once: true

@juliorenner
Copy link

In our case just executing the workarounds above was not enough. Even after deleting and regenerating the certificates the error remained. Then we figured out that there is a docker container which runs etcd and that does not have its certificates updated when we re-execute the playbook. So if the error remains, delete the etcd container and rerun the playbook.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 16, 2019
@rnwolfe
Copy link

rnwolfe commented Jul 24, 2019

I am also running into this, or something similar, saying file does not exist. When looking on node manually, directory /usr/local/bin/etcd-scripts/ does not exist.

TASK [etcd : Gen_certs | run cert generation script] ***********************************************************************************************************************************************************
Wednesday 24 July 2019  11:45:52 -0400 (0:00:00.061)       0:07:12.863 ********
fatal: [master1 -> 10.12.100.105]: FAILED! => {"changed": true, "cmd": ["bash", "-x", "/usr/local/bin/etcd-scripts/make-ssl-etcd.sh", "-f", "/etc/ssl/etcd/openssl.conf", "-d", "/etc/ssl/etcd/ssl"], "delta": "0:00:00.010068", "end": "2019-07-24 11:45:52.997581", "msg": "non-zero return code", "rc": 127, "start": "2019-07-24 11:45:52.987513", "stderr": "bash: /usr/local/bin/etcd-scripts/make-ssl-etcd.sh: No such file or directory", "stderr_lines": ["bash: /usr/local/bin/etcd-scripts/make-ssl-etcd.sh: No such file or directory"], "stdout": "", "stdout_lines": []}

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 23, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

9 participants