Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETCD regenerated its certificate within 15 days of installation #10436

Closed
vpramod0310 opened this issue Jan 29, 2019 · 23 comments
Closed

ETCD regenerated its certificate within 15 days of installation #10436

vpramod0310 opened this issue Jan 29, 2019 · 23 comments

Comments

@vpramod0310
Copy link

Problem:
I am running the etcd version 3.3.9 on top of coreos with rkt pods. ETCD was up and running from Nov 7th and it had 1 year of validity. But on Nov 22nd after 15 days etcd automatically regenerated its certificates and was accepting the connections with old client certs till yesterday. Yesterday at some point in time it started to reject connection with "tls:bad certificate" I don't have much logs at that point of time when certs are regenerated.

Some logs which i have are pasted below.

-- Logs begin at Thu 2018-11-22 15:46:05 UTC, end at Tue 2019-01-29 02:16:46 UTC. -- Nov 22 15:46:05 rkt[27757]: Downloading sha256:0dc0248c936 651 KB / 651 KB Nov 22 15:46:05 rkt[27757]: Downloading sha256:60d10af8752 369 B / 369 B Nov 22 15:46:05 rkt[27757]: Downloading sha256:c99d4145780 5.83 MB / 110 MB Nov 22 15:46:05 rkt[27757]: Downloading sha256:291f3eef6bb 5.59 MB / 45.7 MB Nov 22 15:46:05 rkt[27757]: Downloading sha256:d093dc0d702 8.04 MB / 17.9 MB Nov 22 15:46:05 rkt[27757]: Downloading sha256:5f2ca80930a 4.3 KB / 4.3 KB Nov 22 15:46:05 rkt[27757]: Downloading sha256:2b6ac55109d 398 B / 398 B Nov 22 15:46:05 rkt[27757]: Downloading sha256:7460c29cea5 7.47 MB / 15.7 MB Nov 22 15:46:05 rkt[27757]: Downloading sha256:0dc0248c936 651 KB / 651 KB Nov 22 15:46:05 rkt[27757]: Downloading sha256:60d10af8752 369 B / 369 B .... Nov 22 15:47:23 etcd-wrapper[2167]: 2018-11-22 15:47:23.000265 W | etcdserver: read-only range request "key:\"/registry/*******\" count_only:true " took too long (1.760117035s) to execute Nov 22 15:47:23 etcd-wrapper[2167]: 2018-11-22 15:47:23.000676 W | etcdserver: read-only range request "key:\"/registry/\" " took too long (1.012883935s) to execute Nov 22 15:47:23 etcd-wrapper[2167]: 2018-11-22 15:47:23.000692 W | etcdserver: read-only range request "key:\"/registry/*****" " took too long (1.039691933s) to execute Nov 22 15:47:23 etcd-wrapper[2167]: 2018-11-22 15:47:23.000733 W | etcdserver: read-only range request "key:\"/registry/*****\" " took too long (1.571963253s) to execute Nov 22 15:47:23 etcd-wrapper[2167]: 2018-11-22 15:47:23.000768 W | etcdserver: read-only range request "key:\"/registry/****\" " took too long (177.812904ms) to execute Nov 22 15:47:23 etcd-wrapper[2167]: 2018-11-22 15:47:23.000799 W | etcdserver: read-only range request "key:\"/registry/****** took too long (1.57192529s) to execute Nov 22 15:47:23 etcd-wrapper[2167]: 2018-11-22 15:47:23.000839 W | etcdserver: read-only range request "key:\"/registry/******\" " took too long (989.986857ms) to execute Nov 22 15:47:23 etcd-wrapper[2167]: 2018-11-22 15:47:23.000873 W | etcdserver: read-only range request "key:\"/registry/******\"/registry/validatingwebhookconfigurationt\" count_only> Nov 22 15:47:23 etcd-wrapper[2167]: 2018-11-22 15:47:23.000950 W | etcdserver: read-only range request "key:\"/registry/******\" " took too long (984.033524ms) to execute Nov 22 15:47:23 etcd-wrapper[2167]: 2018-11-22 15:47:23.001016 W | etcdserver: read-only range request "key:\"/registry/*****\" " took too long (1.57193116s) to e> Nov 22 15:47:23 etcd-wrapper[2167]: 2018-11-22 15:47:23.001043 W | etcdserver: read-only range request "key:\"/registry/******\" " took too long (1.010760633s) to execute Nov 22 15:47:23 etcd-wrapper[2167]: 2018-11-22 15:47:23.001095 W | etcdserver: read-only range request "key:\"/registry/****\" " took too long (1.57192198s) to execute Nov 22 15:47:23 etcd-wrapper[2167]: 2018-11-22 15:47:23.001148 W | etcdserver: read-only range request "key:\"/registry/******\" " took too long (984.109373ms) to execute Nov 22 15:47:23 etcd-wrapper[2167]: 2018-11-22 15:47:23.001190 W | etcdserver: read-only range request "key:\"/registry/**********\" " took too long (1.571964522s) to execute Nov 22 15:47:23 rkt[27757]: sha512-d20266172092543414f77999b0a43614 Nov 22 15:47:23 systemd[1]: Started Download bootkube. Nov 22 15:50:17 etcd-wrapper[2167]: 2018-11-22 15:50:17.085915 I | mvcc: store.index: compact 2792125 Nov 22 15:50:17 etcd-wrapper[2167]: 2018-11-22 15:50:17.088152 I | mvcc: finished scheduled compaction at 2792125 (took 1.327087ms) Nov 22 15:52:18 systemd-networkd[1116]: enp0s29u1u1u5: DHCP: No routes received from DHCP server: No data available Nov 22 15:55:17 etcd-wrapper[2167]: 2018-11-22 15:55:17.089263 I | mvcc: store.index: compact 2792622 Nov 22 15:55:17 etcd-wrapper[2167]: 2018-11-22 15:55:17.091663 I | mvcc: finished scheduled compaction at 2792622 (took 1.408709ms) Nov 22 15:57:18 systemd-networkd[1116]: enp0s29u1u1u5: DHCP: No routes received from DHCP server: No data available Nov 22 16:00:17 etcd-wrapper[2167]: 2018-11-22 16:00:17.092611 I | mvcc: store.index: compact 2793121 Nov 22 16:00:17 etcd-wrapper[2167]: 2018-11-22 16:00:17.094868 I | mvcc: finished scheduled compaction at 2793121 (took 1.368126ms)

etcd logs

Jan 28 05:00:10 etcd-wrapper[2167]: 2019-01-28 05:00:10.845957 I | embed: rejected connection from "x.x.x.x:52938" (error "remote error: tls: bad certificate", ServerName "server") Jan 28 05:00:42 etcd-wrapper[2167]: 2019-01-28 05:00:42.085582 I | embed: rejected connection from "x.x.x.x:53916" (error "remote error: tls: bad certificate", ServerName "server") Jan 28 05:01:15 etcd-wrapper[2167]: 2019-01-28 05:01:15.517653 I | embed: rejected connection from "x.x.x.x:55844" (error "remote error: tls: bad certificate", ServerName "server") Jan 28 05:02:04 etcd-wrapper[2167]: 2019-01-28 05:02:04.328073 I | embed: rejected connection from "x.x.x.x:36194" (error "remote error: tls: bad certificate", ServerName "server") Jan 28 05:03:17 etcd-wrapper[2167]: 2019-01-28 05:03:17.267019 I | embed: rejected connection from "x.x.x.x:52518" (error "remote error: tls: bad certificate", ServerName "server")

old cert

Validity Not Before: Nov 7 17:56:36 2018 GMT Not After : Nov 7 17:56:36 2019 GMT Subject: O=etcd, CN=etcd-ca

New cert

Validity Not Before: Nov 22 15:40:27 2018 GMT Not After : Nov 22 15:40:27 2019 GMT Subject: O=etcd, CN=etcd-ca

@hexfusion
Copy link
Contributor

hexfusion commented Jan 29, 2019 via email

@hexfusion
Copy link
Contributor

hexfusion commented Jan 29, 2019 via email

@vpramod0310
Copy link
Author

We are not using auto-tls.
`[Unit]
Description=etcd (System Application Container)
Documentation=https://github.com/coreos/etcd
Wants=network-online.target network.target
After=network-online.target
Conflicts=etcd.service
Conflicts=etcd2.service

[Service]
Type=notify
Restart=on-failure
RestartSec=10s
TimeoutStartSec=0
LimitNOFILE=40000

Environment="ETCD_IMAGE_TAG=v3.3.9"
Environment="ETCD_NAME=%m"
Environment="ETCD_USER=etcd"
Environment="ETCD_DATA_DIR=/var/lib/etcd"
Environment="RKT_RUN_ARGS=--uuid-file-save=/var/lib/coreos/etcd-member-wrapper.uuid"

ExecStartPre=/usr/bin/mkdir --parents /var/lib/coreos
ExecStartPre=-/usr/bin/rkt rm --uuid-file=/var/lib/coreos/etcd-member-wrapper.uuid
ExecStart=/usr/lib/coreos/etcd-wrapper $ETCD_OPTS
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/lib/coreos/etcd-member-wrapper.uuid

[Install]
WantedBy=multi-user.target`

This is my startup script.

Also one thing even though certs were changed 2 months back how etcd was working with old certs until yesterday?

@hexfusion
Copy link
Contributor

hexfusion commented Jan 29, 2019 via email

@vpramod0310
Copy link
Author

vpramod0310 commented Jan 29, 2019

I have not upgraded manually . Probably an auto upgrade?

Its clear that we were running 3.3.4 before

quay.io/coreos/etcd:v3.3.4 running 4 months ago 4 months ago

does 3.3.4 also has this issue?

@hexfusion
Copy link
Contributor

The fix was released with v3.3.4. So it would affect etcd prior in 3.3.x.

@vpramod0310
Copy link
Author

Ok then we were running the fixed version at that time.

@hexfusion
Copy link
Contributor

hexfusion commented Jan 29, 2019

Well as I said etcd does not generate these certs, what changed them and can you be sure that was not the cause of the problem? There seemed some question as to what changed the certs hence the issue so can you elaborate on that?

Jan 28 05:03:17 etcd-wrapper[2167]: 2019-01-28 05:03:17.267019 I | embed: rejected connection from "x.x.x.x:52518" (error "remote error: tls: bad certificate", ServerName "server")

Basically if you ran openssl x509 -in $cert -text -noout on those 2 certs I suspect one of them has "server" in the SAN. "Since v3.2.0, server resolves TLS DNSNames when checking SAN." So if before you had all IPs in SAN then changed to "server" and it could not resolve the DNS for that IP and resulted in the errors above.

@hexfusion
Copy link
Contributor

Also this could probably happen if DNS changed or was not available giving the impression it was the cert.

@vpramod0310
Copy link
Author

We are not using SAN. Also no DNS updates. These are certs.

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            XXXXXXXXXXXXXXXXXXXX
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: O=etcd, CN=etcd-ca
        Validity
            Not Before: Nov 22 15:40:27 2018 GMT
            Not After : Nov 22 15:40:27 2019 GMT
        Subject: O=etcd, CN=etcd-ca
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)

Old One

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            XXXXXXXXXXXXXXXXXXXXX
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: O=etcd, CN=etcd-ca
        Validity
            Not Before: Nov  7 17:56:36 2018 GMT
            Not After : Nov  7 17:56:36 2019 GMT
        Subject: O=etcd, CN=etcd-ca
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)
                Modulus:

@vpramod0310
Copy link
Author

vpramod0310 commented Jan 29, 2019

Yes
Not sure on what changed it though still could not see any logs or symptoms on that. Thought it could be because of an etcd auto update initially by rkt ? As i saw some downloading stuff with the logs available

@hexfusion
Copy link
Contributor

So your certs have no X509v3 extensions:? I am most curious in the cert info below where you show.

@vpramod0310
Copy link
Author

Sorry Pasting the complete certs here

Certificate:
Data:
Version: 3 (0x2)
Serial Number:
XXXXXXXXXXXXXXXXXXXXX
Signature Algorithm: sha256WithRSAEncryption
Issuer: O=etcd, CN=etcd-ca
Validity
Not Before: Nov 22 15:40:27 2018 GMT
Not After : Nov 22 15:40:27 2019 GMT
Subject: O=etcd, CN=etcd-ca
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
Public-Key: (2048 bit)
Modulus:
XXXXXXXXXXXXXXXXXXXXXXX
Exponent: 65537 (0x10001)
X509v3 extensions:
X509v3 Key Usage: critical
Digital Signature, Key Encipherment, Certificate Sign
X509v3 Basic Constraints: critical
CA:TRUE
X509v3 Subject Key Identifier:
BA:CC:76:BE:55:80:71:84:F5:FC:96:CD:5C:FE:D7:51:D0:1B:E6:C2
Signature Algorithm: sha256WithRSAEncryption
XXXXXXXXXXXXXXXXXXXXXXX


Certificate:
Data:
Version: 3 (0x2)
Serial Number:
XXXXXXXXXXXXXXXXXXXX
Signature Algorithm: sha256WithRSAEncryption
Issuer: O=etcd, CN=etcd-ca
Validity
Not Before: Nov 7 17:56:36 2018 GMT
Not After : Nov 7 17:56:36 2019 GMT
Subject: O=etcd, CN=etcd-ca
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
Public-Key: (2048 bit)
Modulus:
XXXXXXXXXXXXXXXXX
Exponent: 65537 (0x10001)
X509v3 extensions:
X509v3 Key Usage: critical
Digital Signature, Key Encipherment, Certificate Sign
X509v3 Basic Constraints: critical
CA:TRUE
X509v3 Subject Key Identifier:
E6:FA:86:5A:DB:9E:10:C5:17:48:87:22:5C:40:B4:91:6F:DD:99:D6
Signature Algorithm: sha256WithRSAEncryption
XXXXXXXXXXXXXXXXXXXXXXX

@hexfusion
Copy link
Contributor

Not sure on what changed it though still could not see any logs or symptoms on that. Thought it could be because of an etcd auto update initially by rkt ? As i saw some downloading stuff with the logs available

rkt was probably downloading etcd, but I am not a rkt expert.

rkt[27757]: sha512-d20266172092543414f77999b0a43614 Nov 22 15:47:23 systemd[1]: Started Download bootkube.

Or maybe bootkube you have not talked about that. Basically etcd is reporting it sees "server" in SAN and it doesn't match the calling IP. this is expected, the rest of the equation is not etcd in my opinion so you need to figure that out.

@hexfusion
Copy link
Contributor

Digital Signature, Key Encipherment, Certificate Sign

This is your CA we need your cert output

@vpramod0310
Copy link
Author

vpramod0310 commented Jan 29, 2019

Ok Here you go. There is the cert

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            b6:30:8c:a5:17:a9:cd:b5:af:cd:88:dd:ab:31:1e:a9
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: O=etcd, CN=etcd-ca
        Validity
            Not Before: Nov 22 15:40:27 2018 GMT
            Not After : Nov 22 15:40:27 2019 GMT
        Subject: O=etcd, CN=etcd-client
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)
                Modulus:
					XXXXXXXXXXXXXXXXXX
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage:
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Authority Key Identifier:
                keyid:BA:CC:76:BE:55:80:71:84:F5:FC:96:CD:5C:FE:D7:51:D0:1B:E6:C2

            X509v3 Subject Alternative Name:
                DNS:server, DNS:localhost, IP Address:127.0.0.1
    Signature Algorithm: sha256WithRSAEncryption
				XXXXXXXXXXXXXXXXX
---------------------

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            4c:eb:13:1e:95:1e:24:6a:a8:fa:26:2c:e6:21:b5:61
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: O=etcd, CN=etcd-ca
        Validity
            Not Before: Nov  7 17:56:36 2018 GMT
            Not After : Nov  7 17:56:36 2019 GMT
        Subject: O=etcd, CN=etcd-client
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)
                Modulus:
                    XXXXXXXXXXXXXXX
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage:
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Authority Key Identifier:
                keyid:E6:FA:86:5A:DB:9E:10:C5:17:48:87:22:5C:40:B4:91:6F:DD:99:D6

            X509v3 Subject Alternative Name:
                DNS:server, DNS:localhost, IP Address:127.0.0.1
    Signature Algorithm: sha256WithRSAEncryption
         XXXXXXXXXXXXXXXXXXX

@hexfusion
Copy link
Contributor

Here is your SAN

 X509v3 Subject Alternative Name:
            DNS:server, DNS:localhost, IP Address:127.0.0.1

@hexfusion
Copy link
Contributor

"server" needs to resolve correctly to the calling IP dig around and see why this is failing.

dig +short server @$DNSIP

@vpramod0310
Copy link
Author

Seems to be working fine. No issue with resolving to IP. and there is only one record.

Or maybe bootkube you have not talked about that. Basically etcd is reporting it sees "server" in SAN and it doesn't match the calling IP. this is expected, the rest of the equation is not etcd in my opinion so you need to figure that out.

This error was gone once i replaced with the new certs on calling service (kubernetes apiserver in my case).

@hexfusion
Copy link
Contributor

OK not sure what the issue is then. We can be sure of a few things.

  • etcd itself did not change the certs
  • it is working now

Is there anything that etcd did that you do not understand? My theory with the bad certificate is some sort of DNS issue, perhaps intermittent not sure.

@hexfusion
Copy link
Contributor

@vpramod0310 have we answered your questions?

@vpramod0310
Copy link
Author

@hexfusion Thanks. One thing still with 3.3.4 version tls auto reload should have been solved but why did this occur in our case.

@hexfusion
Copy link
Contributor

@hexfusion Thanks. One thing still with 3.3.4 version tls auto reload should have been solved but why did this occur in our case.

@vpramod0310 nothing here clearly tells me that etcd is to blame for the errors seen in the logs. The bug patched in 3.3.4 was for certs with IP only in SAN. Your certs have both IP and domain so would not be to blame.

From what I see here I feel the blame is probably on bootkube https://coreos.com/tectonic/docs/latest/troubleshooting/bootkube_recovery_tool.html#bootkube-recover

Unless you can recreate this I am afraid there is not much more we can do. Feel free to reopen if you get more details or have full reproduction steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants