etcd : Gen_certs | run cert generation script fails on SSL #2343

matq007 · 2018-02-14T19:04:34Z

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG

Environment:

Cloud provider or hardware configuration:
hardware
OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
Linux 3.10.0-693.2.2.el7.x86_64 x86_64
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Version of Ansible (ansible --version):
ansible 2.4.2.0

Kubespray version (commit) (git rev-parse --short HEAD):
v2.4.0

Network plugin used:
calico

Copy of your inventory file:

nc-kub-m01 ansible_ssh_host=10.0.55.165 ip=10.0.55.165
nc-kub-s01 ansible_ssh_host=10.0.55.163 ip=10.0.55.163

[kube-master]
nc-kub-m01

[etcd]
nc-kub-m01

[kube-node]
nc-kub-s01

[k8s-cluster:children]
kube-node
kube-master

Command used to invoke ansible:
ansible-playbook -i inventory/hosts.ini cluster.yml -b -K -v --user=kubernetes --private-key=~/.ssh/kubernetes.pem --ask-sudo-pass

Output of ansible run:

{
	"changed": true,
	"cmd": [
		"bash",
		"-x",
		"/usr/local/bin/etcd-scripts/make-ssl-etcd.sh",
		"-f",
		"/etc/ssl/etcd/openssl.conf",
		"-d",
		"/etc/ssl/etcd/ssl"
	],
	"delta": "0:00:00.403411",
	"end": "2018-02-14 19:53:14.477595",
	"msg": "non-zero return code",
	"rc": 1,
	"start": "2018-02-14 19:53:14.074184",
	"stderr": "+ set -o errexit\n+ set -o pipefail\n+ (( 4 ))\n+ case \"$1\" in\n+ CONFIG=/etc/ssl/etcd/openssl.conf\n+ shift 2\n+ (( 2 ))\n+ case \"$1\" in\n+ SSLDIR=/etc/ssl/etcd/ssl\n+ shift 2\n+ (( 0 ))\n+ '[' -z /etc/ssl/etcd/openssl.conf ']'\n+ '[' -z /etc/ssl/etcd/ssl ']'\n++ mktemp -d /tmp/etcd_cacert.XXXXXX\n+ tmpdir=/tmp/etcd_cacert.Vfkpq5\n+ trap 'rm -rf \"${tmpdir}\"' EXIT\n+ cd /tmp/etcd_cacert.Vfkpq5\n+ mkdir -p /etc/ssl/etcd/ssl\n+ '[' -e /etc/ssl/etcd/ssl/ca-key.pem ']'\n+ cp /etc/ssl/etcd/ssl/ca.pem /etc/ssl/etcd/ssl/ca-key.pem .\n+ '[' -n '  ' ']'\n+ '[' -n '  nc-kub-s01    ' ']'\n+ for host in '$HOSTS'\n+ cn=nc-kub-s01\n+ openssl genrsa -out node-nc-kub-s01-key.pem 2048\n+ openssl req -new -key node-nc-kub-s01-key.pem -out node-nc-kub-s01.csr -subj /CN=etcd-node-nc-kub-s01\n+ openssl x509 -req -in node-nc-kub-s01.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial -out node-nc-kub-s01.pem -days 3650 -extensions ssl_client -extfile /etc/ssl/etcd/openssl.conf\n+ rm -rf /tmp/etcd_cacert.Vfkpq5",
	"stderr_lines": [
		"+ set -o errexit",
		"+ set -o pipefail",
		"+ (( 4 ))",
		"+ case \"$1\" in",
		"+ CONFIG=/etc/ssl/etcd/openssl.conf",
		"+ shift 2",
		"+ (( 2 ))",
		"+ case \"$1\" in",
		"+ SSLDIR=/etc/ssl/etcd/ssl",
		"+ shift 2",
		"+ (( 0 ))",
		"+ '[' -z /etc/ssl/etcd/openssl.conf ']'",
		"+ '[' -z /etc/ssl/etcd/ssl ']'",
		"++ mktemp -d /tmp/etcd_cacert.XXXXXX",
		"+ tmpdir=/tmp/etcd_cacert.Vfkpq5",
		"+ trap 'rm -rf \"${tmpdir}\"' EXIT",
		"+ cd /tmp/etcd_cacert.Vfkpq5",
		"+ mkdir -p /etc/ssl/etcd/ssl",
		"+ '[' -e /etc/ssl/etcd/ssl/ca-key.pem ']'",
		"+ cp /etc/ssl/etcd/ssl/ca.pem /etc/ssl/etcd/ssl/ca-key.pem .",
		"+ '[' -n '  ' ']'",
		"+ '[' -n '  nc-kub-s01    ' ']'",
		"+ for host in '$HOSTS'",
		"+ cn=nc-kub-s01",
		"+ openssl genrsa -out node-nc-kub-s01-key.pem 2048",
		"+ openssl req -new -key node-nc-kub-s01-key.pem -out node-nc-kub-s01.csr -subj /CN=etcd-node-nc-kub-s01",
		"+ openssl x509 -req -in node-nc-kub-s01.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial -out node-nc-kub-s01.pem -days 3650 -extensions ssl_client -extfile /etc/ssl/etcd/openssl.conf",
		"+ rm -rf /tmp/etcd_cacert.Vfkpq5"
	],
	"stdout": "",
	"stdout_lines": []
}

Anything else do we need to know:

Also related to issue #1445.

The text was updated successfully, but these errors were encountered:

woopstar · 2018-02-15T08:36:24Z

could you try running the command manually:

/usr/local/bin/etcd-scripts/make-ssl-etcd.sh -f /etc/ssl/etcd/openssl.conf -d /etc/ssl/etcd/ssl

matq007 · 2018-02-15T10:32:04Z

The command worked fine :). What I found out was that when running the ansible multiple times I guess it doesn't create new certificates. I removed the old certificates in /etc/ssl/etcd manually, reran the ansible again and everything worked perfectly. I'm not sure but maybe adding preinstall rule to check if the certificates are already generated should fix this?

woopstar · 2018-02-15T16:45:16Z

Ah. So you already had certificates . I see now from the log output.

The command failling is openssl x509 -req -in node-nc-kub-s01.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial -out node-nc-kub-s01.pem -days 3650 -extensions ssl_client -extfile /etc/ssl/etcd/openssl.conf

The rm -rf /tmp/etcd_cacert.Vfkpq5 happens because of the EXIT trap to cleanup

woopstar · 2018-02-15T16:48:14Z

This happens possible because of set -o errexit or set -o pipefail - since the openssl command exits with another code than zero.

matq007 · 2018-02-15T21:13:16Z

Awesome, I guess we can close the issue then :)

woopstar · 2018-02-15T21:14:03Z

Well. The issue is not really fixed. It should not stop the process

missnebun · 2018-03-10T14:40:17Z

I have the same problem. I removed the ssl keys ... I reset the cluster ... just fails. Not sure what is going on there.

missnebun · 2018-03-11T04:06:09Z

[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
subjectAltName = @alt_names
[alt_names]
DNS.1 = kubernetes
DNS.2 = kubernetes.default
DNS.3 = kubernetes.default.svc
DNS.4 = kubernetes.default.svc.cluster.local
DNS.5 = localhost
DNS.6 = nyvm3331
DNS.7 = nyvm3332
DNS.8 = nyvm3333
DNS.9 = k8sapi.a2tz.com
IP.1 = 10.250.69.67
IP.2 = 10.250.69.67
IP.3 = 10.250.69.68
IP.4 = 10.250.69.68
IP.5 = 10.250.69.69
IP.6 = 10.250.69.69
IP.7 = False
IP.8 = 10.250.69.73
IP.9 = 127.0.0.1
IP.10 = 10.250.69.74

For some reason the openssl.conf is getting False as an IP address and this fails to generate the certificate. Not I need to see where that's coming from

woopstar · 2018-03-12T08:40:19Z

How did you configure

#loadbalancer_apiserver:
#  address: 1.2.3.4
#  port: 1234

?

missnebun · 2018-03-12T14:28:54Z

External LB example config

apiserver_loadbalancer_domain_name: "k8sapi.a2tz.com"
loadbalancer_apiserver:
address: 10.250.69.73
port: 6443

Internal loadbalancers for apiservers

loadbalancer_apiserver_localhost: false

missnebun · 2018-03-14T01:50:10Z

@woopstar Any idea ?

woopstar · 2018-03-14T08:05:55Z

No, it seems right.

Are you running the latest Kubespray from master branch?

It seems odd the ip's are there multiple times as you only have two hosts in your inventory file. Are you sure you are running the right inventory file? DNS in the openssl.conf says hostname nyvm3331 but that is not in your inventory file?

nwsparks · 2018-07-11T13:56:43Z

Not sure if this is the cause of your issue, but I ran into the same error as well and it was due to an invalid entry in hosts file. I had specified a DNS name for the api load balancer and it added a hosts file entry that looks like this:

/etc/hosts
internal-blah-123123123.us-east-1.elb.amazonaws.com blah.com

group_vars/all.yml
apiserver_loadbalancer_domain_name: "blah.com"
loadbalancer_apiserver:
address: internal-blah-123123123.us-east-1.elb.amazonaws.com
port: 6443

The documentation should probably be updated for how to properly define aws loadbalancers which need to be referenced by name and not ip.

ykfq · 2019-03-29T08:26:29Z

Same issue, I have no idea why the MASTERS and NODES varibles are both NULL, the -n match null so no etcd node and member certs generated, then mv '*.pem' /etc/ssl/etcd/ssl/" exited with an error:

  "stderr_lines": [
    "+ set -o errexit",
    "+ set -o pipefail",
    "+ (( 4 ))",
    "+ case \"$1\" in",
    "+ CONFIG=/etc/ssl/etcd/openssl.conf",
    "+ shift 2",
    "+ (( 2 ))",
    "+ case \"$1\"in",
    "+ SSLDIR=/etc/ssl/etcd/ssl",
    "+ shift 2",
    "+ (( 0 ))",
    "+ '[' -z /etc/ssl/etcd/openssl.conf ']'",
    "+ '[' -z /etc/ssl/etcd/ssl ']'",
    "++ mktemp -d /tmp/etcd_cacert.XXXXXX",
    "+ tmpdir=/tmp/etcd_cacert.uIu8DZ",
    "+ trap 'rm -rf \"${tmpdir}\"' EXIT",
    "+ cd /tmp/etcd_cacert.uIu8DZ",
    "+ mkdir -p /etc/ssl/etcd/ssl",
    "+ '[' -e /etc/ssl/etcd/ssl/ca-key.pem ']'",
    "+ cp /etc/ssl/etcd/ssl/ca.pem /etc/ssl/etcd/ssl/ca-key.pem .",
    "+ '[' -n '      ' ']'",
    "+ '[' -n '                  ' ']'",
    "+ '[' -e /etc/ssl/etcd/ssl/ca-key.pem ']'",
    "+ rm -f ca.pem ca-key.pem",
    "+ mv '*.pem' /etc/ssl/etcd/ssl/",
    "mv: cannot stat ‘*.pem’: No such file or directory",
    "+ rm -rf /tmp/etcd_cacert.uIu8DZ"
  ],

ykfq · 2019-04-03T06:38:08Z

Well, found and solved the problem finally.

The script make-ssl-etcd.sh itself works fine without logic issue. But why this happened ?

Why

Kubespray set kubeadm as default deployment mode since v2.8.0, and there is a switch kubeadm_enabled: true in all.yml. But, many tasks will only execute when: not kubeadm_enabled, which includes the role kubernetes/node that depends on role kubernetes/secrects which will run task Gen_certs | run cert generation script, in this task, all master and hosts cert will be generated if not exist.

So, as kubeadm_enabled is default to true, these roles will never be executed and this error will never triggered.

I met this error when I run the playbook cluster.yaml with extra parameters to scale master nodes:

 --extra-vars "k8s-secrets=true, gen_certs=true,sync_certs=true, gen_master_certs=true"

This made the task ran but check_certs.yml set gen_node_certs to false, variable MASTERS and HOSTS got NULL and no pem certs generated, so error mv: cannot stat ‘*.pem’: No such file or directory" occured and whole task exited.

How

I think the kubespray did not welly support the scaling master nodes feature, as I upgraded my k8s cluster using kubespray from v2.6.0 to v2.8.3, the old deplyment mode and kubeadm mode were both used and this messed up my environment(old files exist and some tasks skiped and some services loaded old style config files) and leads my scaling failed.

If I run the playbook without any extra parameters, I would miss the x509: certificate is valid for xx.x.x.x, not yy.y.y.y error cause newly added master node not included in old certs duto the gen_certs not executed. what a dead loop!

Two ways to solve my problem:

Backup and delete /etc/kubernetes directory and re-run playbook kubeadm_enabled: false
Backup etcd, reset the cluster and re-run playbook with kubeadm_enabled: true

Relate codes:

./roles/kubernetes/node/meta/main.yml

---
dependencies:
  - role: kubernetes/secrets
    when: not kubeadm_enabled
    tags:
      - k8s-secrets

./roles/etcd/tasks/gen_certs_script.yml

- name: Gen_certs | run cert generation script
  command: "bash -x {{ etcd_script_dir }}/make-ssl-etcd.sh -f {{ etcd_config_dir }}/openssl.conf -d {{ etcd_cert_dir }}"
  environment:
    - MASTERS: "{% for m in groups['etcd'] %}
                  {% if gen_node_certs[m] %}
                    {{ m }}
                  {% endif %}
                {% endfor %}"
    - HOSTS: "{% for h in (groups['k8s-cluster'] + groups['calico-rr']|default([]))|unique %}
                {% if gen_node_certs[h] %}
                    {{ h }}
                {% endif %}
              {% endfor %}"
  run_once: yes
  delegate_to: "{{groups['etcd'][0]}}"
  when:
    - gen_certs|default(false)
    - inventory_hostname == groups['etcd'][0]
  notify: set etcd_secret_changed

./roles/etcd/tasks/check_certs.yml

- name: "Check_certs | Set 'gen_node_certs' to true"
  set_fact:
    gen_node_certs: |-
      {
      {% set all_etcd_hosts = groups['k8s-cluster']|union(groups['etcd'])|union(groups['calico-rr']|default([]))|unique|sort -%}
      {% set existing_certs = etcdcert_master.files|map(attribute='path')|list|sort %}
      {% for host in all_etcd_hosts -%}
        {% set host_cert = "%s/node-%s-key.pem"|format(etcd_cert_dir, host) %}
        {% if host_cert in existing_certs -%}
        "{{ host }}": False,
        {% else -%}
        "{{ host }}": True,
        {% endif -%}
      {% endfor %}
      }
  run_once: true

juliorenner · 2019-04-17T17:58:32Z

In our case just executing the workarounds above was not enough. Even after deleting and regenerating the certificates the error remained. Then we figured out that there is a docker container which runs etcd and that does not have its certificates updated when we re-execute the playbook. So if the error remains, delete the etcd container and rerun the playbook.

fejta-bot · 2019-07-16T18:14:34Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

rnwolfe · 2019-07-24T15:58:05Z

I am also running into this, or something similar, saying file does not exist. When looking on node manually, directory /usr/local/bin/etcd-scripts/ does not exist.

TASK [etcd : Gen_certs | run cert generation script] ***********************************************************************************************************************************************************
Wednesday 24 July 2019  11:45:52 -0400 (0:00:00.061)       0:07:12.863 ********
fatal: [master1 -> 10.12.100.105]: FAILED! => {"changed": true, "cmd": ["bash", "-x", "/usr/local/bin/etcd-scripts/make-ssl-etcd.sh", "-f", "/etc/ssl/etcd/openssl.conf", "-d", "/etc/ssl/etcd/ssl"], "delta": "0:00:00.010068", "end": "2019-07-24 11:45:52.997581", "msg": "non-zero return code", "rc": 127, "start": "2019-07-24 11:45:52.987513", "stderr": "bash: /usr/local/bin/etcd-scripts/make-ssl-etcd.sh: No such file or directory", "stderr_lines": ["bash: /usr/local/bin/etcd-scripts/make-ssl-etcd.sh: No such file or directory"], "stdout": "", "stdout_lines": []}

fejta-bot · 2019-08-23T16:21:10Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-09-22T17:04:40Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-09-22T17:04:48Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ledroide mentioned this issue Jun 3, 2019

ETCD master certs missing "file not found: /etc/ssl/etcd/ssl/ca.pem" #4831

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 16, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 23, 2019

tvainutis mentioned this issue Aug 29, 2019

Master nodes upgrade #5137

Closed

k8s-ci-robot closed this as completed Sep 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd : Gen_certs | run cert generation script fails on SSL #2343

etcd : Gen_certs | run cert generation script fails on SSL #2343

matq007 commented Feb 14, 2018 •

edited

Loading

woopstar commented Feb 15, 2018

matq007 commented Feb 15, 2018

woopstar commented Feb 15, 2018

woopstar commented Feb 15, 2018

matq007 commented Feb 15, 2018

woopstar commented Feb 15, 2018

missnebun commented Mar 10, 2018 •

edited

Loading

missnebun commented Mar 11, 2018

woopstar commented Mar 12, 2018

missnebun commented Mar 12, 2018

missnebun commented Mar 14, 2018

woopstar commented Mar 14, 2018

nwsparks commented Jul 11, 2018 •

edited

Loading

ykfq commented Mar 29, 2019

ykfq commented Apr 3, 2019 •

edited

Loading

juliorenner commented Apr 17, 2019

fejta-bot commented Jul 16, 2019

rnwolfe commented Jul 24, 2019

fejta-bot commented Aug 23, 2019

fejta-bot commented Sep 22, 2019

k8s-ci-robot commented Sep 22, 2019

etcd : Gen_certs | run cert generation script fails on SSL #2343

etcd : Gen_certs | run cert generation script fails on SSL #2343

Comments

matq007 commented Feb 14, 2018 • edited Loading

woopstar commented Feb 15, 2018

matq007 commented Feb 15, 2018

woopstar commented Feb 15, 2018

woopstar commented Feb 15, 2018

matq007 commented Feb 15, 2018

woopstar commented Feb 15, 2018

missnebun commented Mar 10, 2018 • edited Loading

missnebun commented Mar 11, 2018

woopstar commented Mar 12, 2018

missnebun commented Mar 12, 2018

External LB example config

Internal loadbalancers for apiservers

missnebun commented Mar 14, 2018

woopstar commented Mar 14, 2018

nwsparks commented Jul 11, 2018 • edited Loading

ykfq commented Mar 29, 2019

ykfq commented Apr 3, 2019 • edited Loading

Why

How

Relate codes:

juliorenner commented Apr 17, 2019

fejta-bot commented Jul 16, 2019

rnwolfe commented Jul 24, 2019

fejta-bot commented Aug 23, 2019

fejta-bot commented Sep 22, 2019

k8s-ci-robot commented Sep 22, 2019

matq007 commented Feb 14, 2018 •

edited

Loading

missnebun commented Mar 10, 2018 •

edited

Loading

nwsparks commented Jul 11, 2018 •

edited

Loading

ykfq commented Apr 3, 2019 •

edited

Loading