Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad client RPC communication errors after cert rotation + SIGHUP #4413

Closed
SoMuchToGrok opened this issue Jun 13, 2018 · 9 comments
Closed

Comments

@SoMuchToGrok
Copy link

SoMuchToGrok commented Jun 13, 2018

Nomad version

Nomad v0.8.3 (c85483d)

Operating system and Environment details

Ubuntu 16.04
Consul v1.0.6
Vault v0.10.2

Issue

Since upgrading from v0.7.1 to v0.8.3, nomad client -> server RPC communication appears to error out continuously after I rotate my TLS certificates and send a SIGHUP to the nomad client PID. Additionally, all jobs with Vault integrations that are scheduled after the SIGHUP has been sent fail with similar RPC EOF errors. I have been able to reproduce this in two different environments.

Reproduction steps

  1. Deploy a cluster of nomad servers and nomad clients (>1 each).

  2. Generate TLS certificates and enable TLS for all servers and clients

tls {
  http = true
  rpc  = true

  ca_file   = "/usr/local/share/ca-certificates/vault.crt"
  cert_file = "/secrets/nomad-client/server.crt"
  key_file  = "/secrets/nomad-client/server.key"

  verify_server_hostname = true
}
  1. Generate a new certificate for a nomad client (signed by the same CA as the previously used certificate) and overwrite cert_file and key_file with the new contents (don't append to the existing files, overwrite them completely). The previously used certificate should still be valid (not expired).

  2. Send the nomad client process a SIGHUP

/bin/kill -HUP $MAINPID
  1. Validate that nomad client logs are throwing RPC errors when communicating to the nomad servers.

  2. Schedule a job that uses the Vault integration, and validate that tasks are unable to create vault tokens due to similar (perhaps same) RPC EOF errors.

Nomad Server logs

Could not find anything relevant. It appears that this issue is not present in the servers, as I have reproduced the same steps without any known issues.

Nomad Client logs

Jun 13 14:55:05 nomad-c-qa-1-181 systemd[1]: Reloading Nomad Agent.
Jun 13 14:55:05 nomad-c-qa-1-181 nomad[7171]: ==> Caught signal: hangup
Jun 13 14:55:05 nomad-c-qa-1-181 nomad[7171]: ==> Reloading configuration...
Jun 13 14:55:05 nomad-c-qa-1-181 systemd[1]: Reloaded Nomad Agent.
Jun 13 14:55:05 nomad-c-qa-1-181 nomad[7171]:     2018/06/13 14:55:05.909877 [ERR] nomad: "Node.GetClientAllocs" RPC failed to server 10.3.2.29:4647: rpc error: EOF
Jun 13 14:55:05 nomad-c-qa-1-181 nomad[7171]:     2018/06/13 14:55:05.909923 [ERR] client: failed to query for node allocations: rpc error: EOF
Jun 13 14:55:05 nomad-c-qa-1-181 nomad[7171]: nomad: "Node.GetClientAllocs" RPC failed to server 10.3.2.29:4647: rpc error: EOF
Jun 13 14:55:05 nomad-c-qa-1-181 nomad[7171]: client: failed to query for node allocations: rpc error: EOF

Task state logs (from vault-integrated jobs scheduled after the SIGHUP has been sent)

vault: failed to derive token: DeriveVaultToken RPC failed: rpc error: EOF	
@SoMuchToGrok SoMuchToGrok changed the title Nomad client RPC communication errors after TLS rotation + SIGHUP Nomad client RPC communication errors after server cert rotation + SIGHUP Jun 13, 2018
@SoMuchToGrok SoMuchToGrok changed the title Nomad client RPC communication errors after server cert rotation + SIGHUP Nomad client RPC communication errors after cert rotation + SIGHUP Jun 13, 2018
@dadgar
Copy link
Contributor

dadgar commented Jun 13, 2018

@SoMuchToGrok Can you provide sample certificates that demonstrate this behavior? Also are you seeing the EOF forever after sending the signal or just immediately after. We do close connections after reloading the certificates so this would be expected behavior. Sustained errors would be bad though.

@SoMuchToGrok
Copy link
Author

SoMuchToGrok commented Jun 13, 2018

I'm seeing the EOF forever (well...at least 24+ hours and counting; going to assume it will keep going on). I'm additionally seeing secrets expire on the Vault side, but nomad is continuing to use them (presumably it'll try to keep using the secret even after it fails to communicate with Vault).

I'll work on generating some sample certificates, including the full chain, and will update when I have them.

@cheeseprocedure
Copy link
Contributor

cheeseprocedure commented Jun 19, 2018

Just chiming in: I believe we've encountered this issue on a Nomad 0.8.4/Ubuntu 16.04 cluster.

After updating certificates (issued via Let's Encrypt) and issuing a SIGHUP to reload the Nomad server instances, clients began reporting rpc error: EOF errors. We were able to hit the HTTP API of two out of 3 Nomad servers OK during this time, but the third returned the same "rpc error: EOF." This was happening for at least fifteen minutes with no sign of recovery.

I'm not yet sure if it's relevant, but our tooling issued the SIGHUP against all three Nomad servers within about a second of one another.

We will try to establish a reproduction case and report back.

/edit restarting the Nomad server instances resolved the issue.

@cheeseprocedure
Copy link
Contributor

As a reproduction test case, I created and distributed self-signed certificates for every node (3x servers, 2x clients in our case) and issued a SIGHUP against all Nomad services.

After repeatedly swapping between Let's Encrypt and self-signed key material then triggering a reload, I've been unable to reproduce the scenario above. I'll keep trying.

@dmicanzerofox
Copy link

Helo, I'm working with @SoMuchToGrok and attached should be a full chain of certificates:

Root

$ vault read -format=json pki/root/cert/ca | jq .data.certificate -r
-----BEGIN CERTIFICATE-----
MIIC+DCCAn6gAwIBAgIUJpGRsioeKF0pDWAsd0TcJ7CtmIowCgYIKoZIzj0EAwIw
EjEQMA4GA1UEAxMHcm9vdCBjYTAeFw0xODA2MjkxMjQ2NTlaFw0xODA3MzExMjQ3
MjlaMBIxEDAOBgNVBAMTB3Jvb3QgY2EwdjAQBgcqhkjOPQIBBgUrgQQAIgNiAAS8
iIHzsTHzBxkrOfv1FIRHVdk9g70IZi+JkcDz3z8pWMNoaGE4FTv2o5lg60cBsmbG
sGpHQdex9d+zxcHohX83vJHE1vlsJQyka8PzlmXMVdmXwi9Xfx0ewc7UJoAhNxSj
ggGTMIIBjzAOBgNVHQ8BAf8EBAMCAQYwEgYDVR0TAQH/BAgwBgEB/wIBATAdBgNV
HQ4EFgQUeUj2Bv5D5EBQ6KA4InKlp2u9/CQwHwYDVR0jBBgwFoAUeUj2Bv5D5EBQ
6KA4InKlp2u9/CQwgZoGCCsGAQUFBwEBBIGNMIGKMEoGCCsGAQUFBzAChj5odHRw
czovL3ZhdWx0LnNlcnZpY2UuYXdzLXVzLXdlc3QtMi5jb25zdWw6ODIwMC92MS9w
a2kvcm9vdC9jYTA8BggrBgEFBQcwAoYwaHR0cHM6Ly92YXVsdC5zZXJ2aWNlLmNv
bnN1bDo4MjAwL3YxL3BraS9yb290L2NhMIGLBgNVHR8EgYMwgYAwRaBDoEGGP2h0
dHBzOi8vdmF1bHQuc2VydmljZS5hd3MtdXMtd2VzdC0yLmNvbnN1bDo4MjAwL3Yx
L3BraS9yb290L2NybDA3oDWgM4YxaHR0cHM6Ly92YXVsdC5zZXJ2aWNlLmNvbnN1
bDo4MjAwL3YxL3BraS9yb290L2NybDAKBggqhkjOPQQDAgNoADBlAjEAyj6bhInj
EFaBz1JZ2FG4vGmAAMRivoZYhPdupkBISnLqa/+BeFHp0c3WGqDVlFAcAjBU/D0X
z4FB9JGGjmPChVevL98xToGJO9wHhT/fZoEqmTvtfoV9YcSLV8+lY/F3GaI=
-----END CERTIFICATE-----

Nomad Server

$ vault write -format=json pki/aws-us-west-2/issue/nomad common_name=nomad-server.node.consul alt_names=alt_names=nomad-server.node.aws-us-west-2.consul,nomad-server.service.consul,nomad-server.service.aws-us-west-2.consul,server.global.nomad,localhost ip_sans=127.0.0.1 format=pem ttl=720h
{
  "request_id": "2cfa5cf5-2406-e67d-ddb5-7c3e88ba2600",
  "lease_id": "pki/aws-us-west-2/issue/nomad/62481b8f-3e31-8e06-4e69-a493ee27e635",
  "lease_duration": 2591999,
  "renewable": false,
  "data": {
    "ca_chain": [
      "-----BEGIN CERTIFICATE-----\nMIIDAjCCAoigAwIBAgIUGtWhDkGL1G+IwUzyzwYly+kWCKswCgYIKoZIzj0EAwIw\nEjEQMA4GA1UEAxMHcm9vdCBjYTAeFw0xODA2MjkxMjQ4MTNaFw0xODA3MzExMjQ4\nNDNaMBwxGjAYBgNVBAMTEUlDQSBhd3MtdXMtd2VzdC0yMHYwEAYHKoZIzj0CAQYF\nK4EEACIDYgAEOpVvTNjk/uBz+lAFgmX1o9Xiwu8OGXuCxwBjLEyxQ2bHZfWmg8fL\nc6DXhccn4+7vDzznPvkuVY8mNxkF5IVu9n/9t5c1iDiRHwLtHPt2qkfbX/y/q/w0\ne2GSn7lJ0n54o4IBkzCCAY8wDgYDVR0PAQH/BAQDAgEGMBIGA1UdEwEB/wQIMAYB\nAf8CAQAwHQYDVR0OBBYEFPFzZlARXtd54Jrfih+KszJZik2jMB8GA1UdIwQYMBaA\nFHlI9gb+Q+RAUOigOCJypadrvfwkMIGaBggrBgEFBQcBAQSBjTCBijBKBggrBgEF\nBQcwAoY+aHR0cHM6Ly92YXVsdC5zZXJ2aWNlLmF3cy11cy13ZXN0LTIuY29uc3Vs\nOjgyMDAvdjEvcGtpL3Jvb3QvY2EwPAYIKwYBBQUHMAKGMGh0dHBzOi8vdmF1bHQu\nc2VydmljZS5jb25zdWw6ODIwMC92MS9wa2kvcm9vdC9jYTCBiwYDVR0fBIGDMIGA\nMEWgQ6BBhj9odHRwczovL3ZhdWx0LnNlcnZpY2UuYXdzLXVzLXdlc3QtMi5jb25z\ndWw6ODIwMC92MS9wa2kvcm9vdC9jcmwwN6A1oDOGMWh0dHBzOi8vdmF1bHQuc2Vy\ndmljZS5jb25zdWw6ODIwMC92MS9wa2kvcm9vdC9jcmwwCgYIKoZIzj0EAwIDaAAw\nZQIxAI0OJZ395FMk1WXoKrY4DpYHlvInJwzLuxUzvBBtkxPuPnAgFGGtHq0Z9kw+\nnwRUUgIwRrjtCZWSYOTqjNfEtm70WDdj7JSEOzw6T6T44lLvslMWuGGmMh1SioH8\nta4O7+wq\n-----END CERTIFICATE-----"
    ],
    "certificate": "-----BEGIN CERTIFICATE-----\nMIIDtTCCAzqgAwIBAgIUc+s3scUkr8pvfaV3yUzAXFHBiqMwCgYIKoZIzj0EAwIw\nHDEaMBgGA1UEAxMRSUNBIGF3cy11cy13ZXN0LTIwHhcNMTgwNjI5MTI0OTEzWhcN\nMTgwNzI5MTI0OTQzWjAjMSEwHwYDVQQDExhub21hZC1zZXJ2ZXIubm9kZS5jb25z\ndWwwdjAQBgcqhkjOPQIBBgUrgQQAIgNiAAQRUB6RQBtV3pyRWgNpsVLvXOMeF7yz\ncxRuB9uwRheDK4znMVEdF3PY8YK18Dlkb6BeKVt4xhXSfZNISJXTwEqFzx+V8wWr\n/mOY1MUPH09Q+GmGFapMn8GLOebJU1Vic4KjggI0MIICMDAOBgNVHQ8BAf8EBAMC\nA6gwHQYDVR0lBBYwFAYIKwYBBQUHAwEGCCsGAQUFBwMCMB0GA1UdDgQWBBQr7X3A\nBELyx9saab7136CylBRxtjAfBgNVHSMEGDAWgBTxc2ZQEV7XeeCa34ofirMyWYpN\nozCBmgYIKwYBBQUHAQEEgY0wgYowSgYIKwYBBQUHMAKGPmh0dHBzOi8vdmF1bHQu\nc2VydmljZS5hd3MtdXMtd2VzdC0yLmNvbnN1bDo4MjAwL3YxL3BraS9yb290L2Nh\nMDwGCCsGAQUFBzAChjBodHRwczovL3ZhdWx0LnNlcnZpY2UuY29uc3VsOjgyMDAv\ndjEvcGtpL3Jvb3QvY2EwgZMGA1UdEQSBizCBiIIYbm9tYWQtc2VydmVyLm5vZGUu\nY29uc3Vsgglsb2NhbGhvc3SCKW5vbWFkLXNlcnZlci5zZXJ2aWNlLmF3cy11cy13\nZXN0LTIuY29uc3Vsghtub21hZC1zZXJ2ZXIuc2VydmljZS5jb25zdWyCE3NlcnZl\nci5nbG9iYWwubm9tYWSHBH8AAAEwgYsGA1UdHwSBgzCBgDBFoEOgQYY/aHR0cHM6\nLy92YXVsdC5zZXJ2aWNlLmF3cy11cy13ZXN0LTIuY29uc3VsOjgyMDAvdjEvcGtp\nL3Jvb3QvY3JsMDegNaAzhjFodHRwczovL3ZhdWx0LnNlcnZpY2UuY29uc3VsOjgy\nMDAvdjEvcGtpL3Jvb3QvY3JsMAoGCCqGSM49BAMCA2kAMGYCMQDHp+5xLJGo0kRp\nnkYtrVASlf/IB/NLLwCK3CyvCYboXM3erv286ZBW6a44AMGwtOICMQDEPNhJDq6n\nNJWXuKiQg1e68YG1Dpa63rKeiUQ+7Q4paO5Dp3aSJMF9BPDeKuz16K8=\n-----END CERTIFICATE-----",
    "issuing_ca": "-----BEGIN CERTIFICATE-----\nMIIDAjCCAoigAwIBAgIUGtWhDkGL1G+IwUzyzwYly+kWCKswCgYIKoZIzj0EAwIw\nEjEQMA4GA1UEAxMHcm9vdCBjYTAeFw0xODA2MjkxMjQ4MTNaFw0xODA3MzExMjQ4\nNDNaMBwxGjAYBgNVBAMTEUlDQSBhd3MtdXMtd2VzdC0yMHYwEAYHKoZIzj0CAQYF\nK4EEACIDYgAEOpVvTNjk/uBz+lAFgmX1o9Xiwu8OGXuCxwBjLEyxQ2bHZfWmg8fL\nc6DXhccn4+7vDzznPvkuVY8mNxkF5IVu9n/9t5c1iDiRHwLtHPt2qkfbX/y/q/w0\ne2GSn7lJ0n54o4IBkzCCAY8wDgYDVR0PAQH/BAQDAgEGMBIGA1UdEwEB/wQIMAYB\nAf8CAQAwHQYDVR0OBBYEFPFzZlARXtd54Jrfih+KszJZik2jMB8GA1UdIwQYMBaA\nFHlI9gb+Q+RAUOigOCJypadrvfwkMIGaBggrBgEFBQcBAQSBjTCBijBKBggrBgEF\nBQcwAoY+aHR0cHM6Ly92YXVsdC5zZXJ2aWNlLmF3cy11cy13ZXN0LTIuY29uc3Vs\nOjgyMDAvdjEvcGtpL3Jvb3QvY2EwPAYIKwYBBQUHMAKGMGh0dHBzOi8vdmF1bHQu\nc2VydmljZS5jb25zdWw6ODIwMC92MS9wa2kvcm9vdC9jYTCBiwYDVR0fBIGDMIGA\nMEWgQ6BBhj9odHRwczovL3ZhdWx0LnNlcnZpY2UuYXdzLXVzLXdlc3QtMi5jb25z\ndWw6ODIwMC92MS9wa2kvcm9vdC9jcmwwN6A1oDOGMWh0dHBzOi8vdmF1bHQuc2Vy\ndmljZS5jb25zdWw6ODIwMC92MS9wa2kvcm9vdC9jcmwwCgYIKoZIzj0EAwIDaAAw\nZQIxAI0OJZ395FMk1WXoKrY4DpYHlvInJwzLuxUzvBBtkxPuPnAgFGGtHq0Z9kw+\nnwRUUgIwRrjtCZWSYOTqjNfEtm70WDdj7JSEOzw6T6T44lLvslMWuGGmMh1SioH8\nta4O7+wq\n-----END CERTIFICATE-----",
    "private_key": "-----BEGIN EC PRIVATE KEY-----\nMIGkAgEBBDAVW+3UFVXYiGn5xpNsTh6qlJaaEylAK/RRrKNxGbEoTL2f8PraWxNB\nf52yLnChoe+gBwYFK4EEACKhZANiAAQRUB6RQBtV3pyRWgNpsVLvXOMeF7yzcxRu\nB9uwRheDK4znMVEdF3PY8YK18Dlkb6BeKVt4xhXSfZNISJXTwEqFzx+V8wWr/mOY\n1MUPH09Q+GmGFapMn8GLOebJU1Vic4I=\n-----END EC PRIVATE KEY-----",
    "private_key_type": "ec",
    "serial_number": "73:eb:37:b1:c5:24:af:ca:6f:7d:a5:77:c9:4c:c0:5c:51:c1:8a:a3"
  },
  "warnings": null
}

Nomad Client

$ vault write -format=json pki/aws-us-west-2/issue/nomad-client common_name=nomad-client.node.consul alt_names=alt_names=nomad-client.node.aws-us-west-2.consul,nomad-client.service.consul,nomad-client.service.aws-us-west-2.consul,client.global.nomad,localhost ip_sans=127.0.0.1 format=pem ttl=720h
{
  "request_id": "606ba187-fa58-59b5-6f96-f12c680d9d97",
  "lease_id": "pki/aws-us-west-2/issue/nomad-client/03f67c56-1b51-7eed-b63b-3875fe3d3e49",
  "lease_duration": 2591999,
  "renewable": false,
  "data": {
    "ca_chain": [
      "-----BEGIN CERTIFICATE-----\nMIIDAjCCAoigAwIBAgIUGtWhDkGL1G+IwUzyzwYly+kWCKswCgYIKoZIzj0EAwIw\nEjEQMA4GA1UEAxMHcm9vdCBjYTAeFw0xODA2MjkxMjQ4MTNaFw0xODA3MzExMjQ4\nNDNaMBwxGjAYBgNVBAMTEUlDQSBhd3MtdXMtd2VzdC0yMHYwEAYHKoZIzj0CAQYF\nK4EEACIDYgAEOpVvTNjk/uBz+lAFgmX1o9Xiwu8OGXuCxwBjLEyxQ2bHZfWmg8fL\nc6DXhccn4+7vDzznPvkuVY8mNxkF5IVu9n/9t5c1iDiRHwLtHPt2qkfbX/y/q/w0\ne2GSn7lJ0n54o4IBkzCCAY8wDgYDVR0PAQH/BAQDAgEGMBIGA1UdEwEB/wQIMAYB\nAf8CAQAwHQYDVR0OBBYEFPFzZlARXtd54Jrfih+KszJZik2jMB8GA1UdIwQYMBaA\nFHlI9gb+Q+RAUOigOCJypadrvfwkMIGaBggrBgEFBQcBAQSBjTCBijBKBggrBgEF\nBQcwAoY+aHR0cHM6Ly92YXVsdC5zZXJ2aWNlLmF3cy11cy13ZXN0LTIuY29uc3Vs\nOjgyMDAvdjEvcGtpL3Jvb3QvY2EwPAYIKwYBBQUHMAKGMGh0dHBzOi8vdmF1bHQu\nc2VydmljZS5jb25zdWw6ODIwMC92MS9wa2kvcm9vdC9jYTCBiwYDVR0fBIGDMIGA\nMEWgQ6BBhj9odHRwczovL3ZhdWx0LnNlcnZpY2UuYXdzLXVzLXdlc3QtMi5jb25z\ndWw6ODIwMC92MS9wa2kvcm9vdC9jcmwwN6A1oDOGMWh0dHBzOi8vdmF1bHQuc2Vy\ndmljZS5jb25zdWw6ODIwMC92MS9wa2kvcm9vdC9jcmwwCgYIKoZIzj0EAwIDaAAw\nZQIxAI0OJZ395FMk1WXoKrY4DpYHlvInJwzLuxUzvBBtkxPuPnAgFGGtHq0Z9kw+\nnwRUUgIwRrjtCZWSYOTqjNfEtm70WDdj7JSEOzw6T6T44lLvslMWuGGmMh1SioH8\nta4O7+wq\n-----END CERTIFICATE-----"
    ],
    "certificate": "-----BEGIN CERTIFICATE-----\nMIIDtDCCAzqgAwIBAgIUI/Y+orYMNzY1fwQoXV3cY4lqn+cwCgYIKoZIzj0EAwIw\nHDEaMBgGA1UEAxMRSUNBIGF3cy11cy13ZXN0LTIwHhcNMTgwNjI5MTI1MDI1WhcN\nMTgwNzI5MTI1MDU1WjAjMSEwHwYDVQQDExhub21hZC1jbGllbnQubm9kZS5jb25z\ndWwwdjAQBgcqhkjOPQIBBgUrgQQAIgNiAATk5GvLPCFWDPsgf3RPGwJwuZPlxvdx\nyubVtQkJ89hm7l1wBB0E45egEdjrPr1SBTtfADHaLVwoVj3M9N1SezI8jVUMd1eA\ntU5F+auyxo6DwFxdJ+F5lGScknhzrn67LCyjggI0MIICMDAOBgNVHQ8BAf8EBAMC\nA6gwHQYDVR0lBBYwFAYIKwYBBQUHAwEGCCsGAQUFBwMCMB0GA1UdDgQWBBSqytOZ\nw3ucO3Y5OCUkAOjCY30KKDAfBgNVHSMEGDAWgBTxc2ZQEV7XeeCa34ofirMyWYpN\nozCBmgYIKwYBBQUHAQEEgY0wgYowSgYIKwYBBQUHMAKGPmh0dHBzOi8vdmF1bHQu\nc2VydmljZS5hd3MtdXMtd2VzdC0yLmNvbnN1bDo4MjAwL3YxL3BraS9yb290L2Nh\nMDwGCCsGAQUFBzAChjBodHRwczovL3ZhdWx0LnNlcnZpY2UuY29uc3VsOjgyMDAv\ndjEvcGtpL3Jvb3QvY2EwgZMGA1UdEQSBizCBiIIYbm9tYWQtY2xpZW50Lm5vZGUu\nY29uc3VsghNjbGllbnQuZ2xvYmFsLm5vbWFkgglsb2NhbGhvc3SCKW5vbWFkLWNs\naWVudC5zZXJ2aWNlLmF3cy11cy13ZXN0LTIuY29uc3Vsghtub21hZC1jbGllbnQu\nc2VydmljZS5jb25zdWyHBH8AAAEwgYsGA1UdHwSBgzCBgDBFoEOgQYY/aHR0cHM6\nLy92YXVsdC5zZXJ2aWNlLmF3cy11cy13ZXN0LTIuY29uc3VsOjgyMDAvdjEvcGtp\nL3Jvb3QvY3JsMDegNaAzhjFodHRwczovL3ZhdWx0LnNlcnZpY2UuY29uc3VsOjgy\nMDAvdjEvcGtpL3Jvb3QvY3JsMAoGCCqGSM49BAMCA2gAMGUCMAiR0BaODupT4anH\np+XmGqhaqPbfMFbqkGlPNdpsclAcZO8rF43JlzPvPP4tZJwCmAIxAIUCUvY3ajrb\nie+z13o1md7xfEs0ZuARBZvNJVcPaGFvI6+D00qbL1O5wazCR1ErlA==\n-----END CERTIFICATE-----",
    "issuing_ca": "-----BEGIN CERTIFICATE-----\nMIIDAjCCAoigAwIBAgIUGtWhDkGL1G+IwUzyzwYly+kWCKswCgYIKoZIzj0EAwIw\nEjEQMA4GA1UEAxMHcm9vdCBjYTAeFw0xODA2MjkxMjQ4MTNaFw0xODA3MzExMjQ4\nNDNaMBwxGjAYBgNVBAMTEUlDQSBhd3MtdXMtd2VzdC0yMHYwEAYHKoZIzj0CAQYF\nK4EEACIDYgAEOpVvTNjk/uBz+lAFgmX1o9Xiwu8OGXuCxwBjLEyxQ2bHZfWmg8fL\nc6DXhccn4+7vDzznPvkuVY8mNxkF5IVu9n/9t5c1iDiRHwLtHPt2qkfbX/y/q/w0\ne2GSn7lJ0n54o4IBkzCCAY8wDgYDVR0PAQH/BAQDAgEGMBIGA1UdEwEB/wQIMAYB\nAf8CAQAwHQYDVR0OBBYEFPFzZlARXtd54Jrfih+KszJZik2jMB8GA1UdIwQYMBaA\nFHlI9gb+Q+RAUOigOCJypadrvfwkMIGaBggrBgEFBQcBAQSBjTCBijBKBggrBgEF\nBQcwAoY+aHR0cHM6Ly92YXVsdC5zZXJ2aWNlLmF3cy11cy13ZXN0LTIuY29uc3Vs\nOjgyMDAvdjEvcGtpL3Jvb3QvY2EwPAYIKwYBBQUHMAKGMGh0dHBzOi8vdmF1bHQu\nc2VydmljZS5jb25zdWw6ODIwMC92MS9wa2kvcm9vdC9jYTCBiwYDVR0fBIGDMIGA\nMEWgQ6BBhj9odHRwczovL3ZhdWx0LnNlcnZpY2UuYXdzLXVzLXdlc3QtMi5jb25z\ndWw6ODIwMC92MS9wa2kvcm9vdC9jcmwwN6A1oDOGMWh0dHBzOi8vdmF1bHQuc2Vy\ndmljZS5jb25zdWw6ODIwMC92MS9wa2kvcm9vdC9jcmwwCgYIKoZIzj0EAwIDaAAw\nZQIxAI0OJZ395FMk1WXoKrY4DpYHlvInJwzLuxUzvBBtkxPuPnAgFGGtHq0Z9kw+\nnwRUUgIwRrjtCZWSYOTqjNfEtm70WDdj7JSEOzw6T6T44lLvslMWuGGmMh1SioH8\nta4O7+wq\n-----END CERTIFICATE-----",
    "private_key": "-----BEGIN EC PRIVATE KEY-----\nMIGkAgEBBDCB+6MFvnMGg9wqz5o1NMX+shLEi7eGoaDM6PnpitbPcs9/1GtW+hMu\nEIdrQo1joh6gBwYFK4EEACKhZANiAATk5GvLPCFWDPsgf3RPGwJwuZPlxvdxyubV\ntQkJ89hm7l1wBB0E45egEdjrPr1SBTtfADHaLVwoVj3M9N1SezI8jVUMd1eAtU5F\n+auyxo6DwFxdJ+F5lGScknhzrn67LCw=\n-----END EC PRIVATE KEY-----",
    "private_key_type": "ec",
    "serial_number": "23:f6:3e:a2:b6:0c:37:36:35:7f:04:28:5d:5d:dc:63:89:6a:9f:e7"
  },
  "warnings": null
}

Please let us know if there's any more information we can provide to help you with this.

@SoMuchToGrok
Copy link
Author

Is there anything additional we can do to help with this issue? We're still stuck on Nomad v0.7.1 because of this bug.

@Xopherus
Copy link
Contributor

Xopherus commented Aug 10, 2018

I was able to reproduce with the following:

  • 1 Consul agent on version 1.0.6
  • 1 Vault server on version 0.10.2
  • 1 Nomad client and 1 Nomad server on version 0.8.4 / 0.9.0-dev

When you re-issue a cert and send a SIGHUP to the nomad client, everything is fine. It's when you re-issue a cert for the nomad server that the problem manifests itself.

Nomad Server Logs

You can see after the SIGHUP is sent that there are a lot of errors like nomad.rpc: Closing server RPC connection

==> Caught signal: hangup
==> Reloading configuration...
==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
    2018/08/10 13:42:19.187842 [DEBUG] agent: starting reload of agent config
    2018/08/10 13:42:19.193892 [DEBUG] agent: starting reload of server config
    2018/08/10 13:42:19.194197 [INFO] nomad: reloading server connections due to configuration changes
    2018/08/10 13:42:19.208107 [DEBUG] nomad: finished reloading server connections
    2018/08/10 13:42:19.208131 [INFO] agent: Reloading HTTP server with new TLS configuration
    2018/08/10 13:42:19.208138 [DEBUG] http: Shutting down http server
    2018/08/10 13:42:19.460633 [DEBUG] vault: not renewing token as it is root
    2018/08/10 13:42:20.312419 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:23.616124 [DEBUG] http: Request GET /v1/agent/health?type=server (154.333µs)
    2018/08/10 13:42:25 [DEBUG] memberlist: TCP connection from=127.0.0.1:55900
    2018/08/10 13:42:26.568915 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
    2018/08/10 13:42:33.637483 [DEBUG] http: Request GET /v1/agent/health?type=server (139.611µs)
    2018/08/10 13:42:35 [DEBUG] memberlist: TCP connection from=127.0.0.1:55905
    2018/08/10 13:42:36.448273 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:37.799745 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:37.801752 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:39.377860 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:39.379074 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:40.879984 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:40.881732 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:42.812276 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:42.814450 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:43.658705 [DEBUG] http: Request GET /v1/agent/health?type=server (419.168µs)
    2018/08/10 13:42:44.557962 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:44.560620 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:45 [DEBUG] memberlist: TCP connection from=127.0.0.1:55912
    2018/08/10 13:42:45.579913 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:45.581623 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:46.446298 [WARN] nomad.heartbeat: node 'cb31c673-27e0-5bf1-9f88-ac352224c746' TTL expired
    2018/08/10 13:42:46.959658 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:46.961002 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:48.302956 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:48.303873 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:49.746487 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:49.747652 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:51.271261 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:51.272460 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:53.268205 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:53.270470 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:53.680464 [DEBUG] http: Request GET /v1/agent/health?type=server (505.173µs)
    2018/08/10 13:42:54.313170 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:54.314829 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:55 [DEBUG] memberlist: TCP connection from=127.0.0.1:55930
    2018/08/10 13:42:55.500682 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:55.502501 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:56.572457 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
    2018/08/10 13:42:57.299575 [INFO] nomad.rpc: Closing server RPC connection
    2018/08/10 13:42:57.301394 [INFO] nomad.rpc: Closing server RPC connection

Nomad Client Logs

Shortly after the nomad server starts erroring, you can see rpc errors happen in the client over and over.

    2018/08/10 13:42:36.446591 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
    2018/08/10 13:42:36.446714 [ERR] client: heartbeating failed. Retrying in 1.350090439s: failed to update status: rpc error: EOF
    2018/08/10 13:42:36.447385 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
    2018/08/10 13:42:36.448472 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:

* rpc error: EOF
    2018/08/10 13:42:37.800052 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
    2018/08/10 13:42:37.800113 [ERR] client: heartbeating failed. Retrying in 1.573605339s: failed to update status: rpc error: EOF
    2018/08/10 13:42:37.800912 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
    2018/08/10 13:42:37.801971 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:

* rpc error: EOF
    2018/08/10 13:42:39.377992 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
    2018/08/10 13:42:39.378033 [ERR] client: heartbeating failed. Retrying in 1.499009466s: failed to update status: rpc error: EOF
    2018/08/10 13:42:39.378474 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
    2018/08/10 13:42:39.379204 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:

* rpc error: EOF
    2018/08/10 13:42:40.880297 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
    2018/08/10 13:42:40.880340 [ERR] client: heartbeating failed. Retrying in 1.928699781s: failed to update status: rpc error: EOF
    2018/08/10 13:42:40.880918 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
    2018/08/10 13:42:40.881891 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:

* rpc error: EOF
    2018/08/10 13:42:42.812550 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
    2018/08/10 13:42:42.812624 [ERR] client: heartbeating failed. Retrying in 1.739761915s: failed to update status: rpc error: EOF
    2018/08/10 13:42:42.813176 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
    2018/08/10 13:42:42.814840 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:

* rpc error: EOF
    2018/08/10 13:42:43.149744 [DEBUG] http: Request GET /v1/agent/health?type=client (388.307µs)
    2018/08/10 13:42:44.154786 http: TLS handshake error from 127.0.0.1:55911: tls: first record does not look like a TLS handshake
    2018/08/10 13:42:44.558265 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
    2018/08/10 13:42:44.558371 [ERR] client: heartbeating failed. Retrying in 1.017166125s: failed to update status: rpc error: EOF
    2018/08/10 13:42:44.559671 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
    2018/08/10 13:42:44.560783 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:

* rpc error: EOF
    2018/08/10 13:42:45.580207 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
    2018/08/10 13:42:45.580275 [ERR] client: heartbeating failed. Retrying in 1.377125573s: failed to update status: rpc error: EOF
    2018/08/10 13:42:45.580856 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
    2018/08/10 13:42:45.581789 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:

* rpc error: EOF
    2018/08/10 13:42:46.959888 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
    2018/08/10 13:42:46.959919 [ERR] client: heartbeating failed. Retrying in 1.33951467s: failed to update status: rpc error: EOF
    2018/08/10 13:42:46.960373 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
    2018/08/10 13:42:46.961139 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:

* rpc error: EOF
    2018/08/10 13:42:48.303057 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
    2018/08/10 13:42:48.303082 [ERR] client: heartbeating failed. Retrying in 1.439765118s: failed to update status: rpc error: EOF
    2018/08/10 13:42:48.303428 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
    2018/08/10 13:42:48.303961 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:

* rpc error: EOF
    2018/08/10 13:42:49.746619 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
    2018/08/10 13:42:49.746648 [ERR] client: heartbeating failed. Retrying in 1.519506656s: failed to update status: rpc error: EOF
    2018/08/10 13:42:49.747089 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
    2018/08/10 13:42:49.747776 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:

* rpc error: EOF
    2018/08/10 13:42:51.271401 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
    2018/08/10 13:42:51.271432 [ERR] client: heartbeating failed. Retrying in 1.991679597s: failed to update status: rpc error: EOF
    2018/08/10 13:42:51.271868 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
    2018/08/10 13:42:51.272580 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:

* rpc error: EOF
    2018/08/10 13:42:53.171400 [DEBUG] http: Request GET /v1/agent/health?type=client (134.281µs)
    2018/08/10 13:42:53.268541 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
    2018/08/10 13:42:53.268637 [ERR] client: heartbeating failed. Retrying in 1.040439126s: failed to update status: rpc error: EOF
    2018/08/10 13:42:53.269526 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
    2018/08/10 13:42:53.270664 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:

* rpc error: EOF
    2018/08/10 13:42:54.313479 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
    2018/08/10 13:42:54.313526 [ERR] client: heartbeating failed. Retrying in 1.186429866s: failed to update status: rpc error: EOF
    2018/08/10 13:42:54.314053 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
    2018/08/10 13:42:54.314997 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:

* rpc error: EOF
    2018/08/10 13:42:55.501162 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
    2018/08/10 13:42:55.501230 [ERR] client: heartbeating failed. Retrying in 1.79393641s: failed to update status: rpc error: EOF
    2018/08/10 13:42:55.501732 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
    2018/08/10 13:42:55.502660 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:

* rpc error: EOF
    2018/08/10 13:42:57.299842 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: EOF
    2018/08/10 13:42:57.299965 [ERR] client: heartbeating failed. Retrying in 1.396287817s: failed to update status: rpc error: EOF
    2018/08/10 13:42:57.300590 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
    2018/08/10 13:42:57.301560 [ERR] client.consul: error discovering nomad servers: 1 error(s) occurred:

* rpc error: EOF
    2018/08/10 13:42:58.299391 [ERR] nomad: "Node.GetClientAllocs" RPC failed to server 127.0.0.1:4647: rpc error: EOF
    2018/08/10 13:42:58.299421 [ERR] client: failed to query for node allocations: rpc error: EOF
    2018/08/10 13:42:58.697862 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 127.0.0.1:4647: rpc error: failed to get conn: dial tcp 127.0.0.1:4647: connect: connection refused
    2018/08/10 13:42:58.697926 [ERR] client: heartbeating failed. Retrying in 1.222660464s: failed to update status: rpc error: failed to get conn: dial tcp 127.0.0.1:4647: connect: connection refused
    2018/08/10 13:42:58.698557 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
    2018/08/10 13:42:58.699032 [ERR] client.consul: error discovering nomad servers: no Nomad Servers advertising service "nomad" in Consul datacenters: ["dc1"]

Nomad CLI Logs

While this chaos is going down, if you try to do a nomad job status you get an error:

$ nomad job status
Error querying jobs: Get http://127.0.0.1:4646/v1/jobs: net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x15\x03\x01\x00\x02\x02"

Although if I let the server / client run for a few more minutes, the CLI will return data but I will be unable to place new jobs (which is expected I think)

$ nomad job run example-job.hcl
==> Monitoring evaluation "659d2fff"
    Evaluation triggered by job "example"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "659d2fff" finished with status "complete" but failed to place all allocations:
    Task Group "hello-world" (failed to place 1 allocation):
      * No nodes were eligible for evaluation
      * No nodes are available in datacenter "dc1"
    Evaluation "d4012573" waiting for additional capacity to place remainder

Xopherus added a commit to Xopherus/nomad that referenced this issue Aug 10, 2018
Multiplexer continues to create rpc connections even when
the context which is passed to the underlying rpc connections
is cancelled by the server.

This was causing hashicorp#4413 - when a SIGHUP causes everything to reload,
it uses context to cancel the underlying http/rpc connections
so that they may come up with the new configuration.
The multiplexer was not being cancelled properly so it would
continue to create rpc connections and constantly fail,
causing communication issues with other nomad agents.

Fixes hashicorp#4413
@Xopherus
Copy link
Contributor

I believe I found the issue! I will open a PR with a proposed fix - let me know if that will be acceptable!

@dadgar @cheeseprocedure

Xopherus added a commit to Xopherus/nomad that referenced this issue Aug 10, 2018
Multiplexer continues to create rpc connections even when
the context which is passed to the underlying rpc connections
is cancelled by the server.

This was causing hashicorp#4413 - when a SIGHUP causes everything to reload,
it uses context to cancel the underlying http/rpc connections
so that they may come up with the new configuration.
The multiplexer was not being cancelled properly so it would
continue to create rpc connections and constantly fail,
causing communication issues with other nomad agents.

Fixes hashicorp#4413
Xopherus added a commit to Xopherus/nomad that referenced this issue Aug 13, 2018
Multiplexer continues to create rpc connections even when
the context which is passed to the underlying rpc connections
is cancelled by the server.

This was causing hashicorp#4413 - when a SIGHUP causes everything to reload,
it uses context to cancel the underlying http/rpc connections
so that they may come up with the new configuration.
The multiplexer was not being cancelled properly so it would
continue to create rpc connections and constantly fail,
causing communication issues with other nomad agents.

Fixes hashicorp#4413
Xopherus added a commit to Xopherus/nomad that referenced this issue Aug 13, 2018
Multiplexer continues to create rpc connections even when
the context which is passed to the underlying rpc connections
is cancelled by the server.

This was causing hashicorp#4413 - when a SIGHUP causes everything to reload,
it uses context to cancel the underlying http/rpc connections
so that they may come up with the new configuration.
The multiplexer was not being cancelled properly so it would
continue to create rpc connections and constantly fail,
causing communication issues with other nomad agents.

Fixes hashicorp#4413
preetapan pushed a commit that referenced this issue Sep 10, 2018
Multiplexer continues to create rpc connections even when
the context which is passed to the underlying rpc connections
is cancelled by the server.

This was causing #4413 - when a SIGHUP causes everything to reload,
it uses context to cancel the underlying http/rpc connections
so that they may come up with the new configuration.
The multiplexer was not being cancelled properly so it would
continue to create rpc connections and constantly fail,
causing communication issues with other nomad agents.

Fixes #4413
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants