Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad worker failed to start running after upgrading to 0.9.2 #5795

Closed
xbglowx opened this issue Jun 9, 2019 · 7 comments
Closed

Nomad worker failed to start running after upgrading to 0.9.2 #5795

xbglowx opened this issue Jun 9, 2019 · 7 comments

Comments

@xbglowx
Copy link

xbglowx commented Jun 9, 2019

Nomad version

Nomad v0.9.2 (028326684b9da489e0371247a223ef3ae4755d87)

Operating system and Environment details

lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.6 LTS
Release:	16.04
Codename:	xenial

Issue

Upgrading from Nomad 0.9.1 to 0.9.2 causes client.consul errors and failure to rejoin the cluster. Rolling back to 0.9.1 fixes the problem.

Reproduction steps

  1. Upgrade Nomad from 0.9.1 to 0.9.2

Job file (if appropriate)

Nomad Client logs (if appropriate)

Jun 06 20:48:41 ip-10-137-6-3 nomad[24583]: ==> Starting Nomad agent...
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]: ==> Nomad agent configuration:
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:        Advertise Addrs: HTTP: 10.137.6.3:4646
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:             Bind Addrs: HTTP: 0.0.0.0:4646
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:                 Client: true
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:              Log Level: INFO
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:                 Region: global (DC: dev-us-east-1)
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:                 Server: false
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:                Version: 0.9.2
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]: ==> Nomad agent started! Log data will stream in below:
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:     2019-06-06T20:48:41.815Z [WARN ] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/var/lib/nomad/plugins
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:     2019-06-06T20:48:41.816Z [INFO ] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:     2019-06-06T20:48:41.817Z [INFO ] agent: detected plugin: name=java type=driver plugin_version=0.1.0
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:     2019-06-06T20:48:41.817Z [INFO ] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:     2019-06-06T20:48:41.817Z [INFO ] agent: detected plugin: name=rkt type=driver plugin_version=0.1.0
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:     2019-06-06T20:48:41.817Z [INFO ] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:     2019-06-06T20:48:41.817Z [INFO ] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:     2019-06-06T20:48:41.817Z [INFO ] agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:     2019-06-06T20:48:41.817Z [INFO ] client: using state directory: state_dir=/var/lib/nomad/client
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:     2019-06-06T20:48:41.817Z [INFO ] client: using alloc directory: alloc_dir=/var/lib/nomad/alloc
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:     2019-06-06T20:48:41.820Z [INFO ] client.fingerprint_mgr.cgroup: cgroups are available
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:     2019-06-06T20:48:42.003Z [INFO ] client.fingerprint_mgr.vault: Vault is available
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:     2019-06-06T20:48:42.009Z [INFO ] client.plugin: starting plugin manager: plugin-type=driver
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:     2019-06-06T20:48:42.009Z [INFO ] client.plugin: starting plugin manager: plugin-type=device
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:     2019-06-06T20:48:42.011Z [ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get http://127.0.0.1:8500/v1/catalog/datacenters: net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x15\x03\x01\x00\x02\x02""
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:     2019-06-06T20:48:42.075Z [WARN ] client.server_mgr: no servers available
Jun 06 20:48:42 ip-10-137-6-3 nomad[24583]:     2019-06-06T20:48:42.075Z [ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get http://127.0.0.1:8500/v1/catalog/datacenters: net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x15\x03\x01\x00\x02\x02""

Nomad Server logs (if appropriate)

@xbglowx
Copy link
Author

xbglowx commented Jun 9, 2019

Looks like Nomad 0.9.2 is trying to connect to consul with HTTP instead of HTTPS, even though my Consul config for Nomad is this:

consul {
  address    = "127.0.0.1:8500"
  ssl        = true
  verify_ssl = false
}

@notnoop
Copy link
Contributor

notnoop commented Jun 9, 2019

@xbglowx Thanks for reaching out. I just succeeded in setting up a nomad client with a consul configured with TLS, and looks like we may need a deeper look.

Can you try the following steps and provide more information please:

  1. Provide the full nomad client config and the passed arguments to the agent? Is there a second consul stanza being injected elsewhere?
  2. Change consul address to 127.0.0.2:8501 and report logs - while this should fail, the log lines would indicate whether consul block is ignored as a whole or is it just the ssl config.
  3. Check if there is a CONSUL_HTTP_* environments set

@preetapan
Copy link
Contributor

@xbglowx Could you provide your nomad agent's config in its entirety here? Will aid us in debugging further

@xbglowx
Copy link
Author

xbglowx commented Jun 10, 2019

Yes, I will do that today.

@xbglowx
Copy link
Author

xbglowx commented Jun 10, 2019

Here is the entire client config:

> sudo cat /etc/nomad/nomad.d/*

datacenter = "dev-us-east-1"

advertise {
  http = "10.137.21.213"
  rpc  = "10.137.21.213"
  serf = "10.137.21.213"
}

addresses {
  http = "0.0.0.0"
  rpc  = "0.0.0.0"
  serf = "0.0.0.0"
}

acl {
  enabled = true
}
client {
  options {
    "docker.auth.config" = "/root/.docker/config.json"
    "docker.caps.whitelist" = "CHOWN,DAC_OVERRIDE,FSETID,FOWNER,MKNOD,NET_RAW,SETGID,SETUID,SETFCAP,SETPCAP,NET_BIND_SERVICE,SYS_CHROOT,KILL,AUDIT_WRITE,NET_ADMIN"
    "docker.privileged.enabled" = "1"
    "driver.raw_exec.enable" = "1"
  }

  meta {
    "color" = "green"
    "work_load_type" = "generic"
  }
}
leave_on_interrupt = true
telemetry {
  collection_interval = "3s"
  publish_allocation_metrics = true
  publish_node_metrics = true
  prometheus_metrics = true
}
server {
    encrypt = "***REDACTED***"
}
tls {
    http = true
    rpc  = true

    ca_file   = "/etc/nomad/ca_chain.pem"
    cert_file = "/etc/nomad/client_certificate.pem"
    key_file  = "/etc/nomad/client_key.pem"

    verify_server_hostname = false
    verify_https_client    = false
    tls_cipher_suites      = "TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384"
}
consul {
  address    = "127.0.0.1:8500"
  ssl        = true
  verify_ssl = false
}

vault {
  enabled          = true
  address          = "https://***REDACTED***"
  create_from_role = "nomad-cluster"
  token            = "***REDACTED***"
}

Nomad is started as so:

/usr/local/sbin/nomad agent -client=True -config /etc/nomad/nomad.d -data-dir /var/lib/nomad

After changing consul address to 127.0.0.1:8501 and restarting Nomad, I don't get anything new in the logs. Below is showing the current timestamp, last Nomad log entry, and process table entry:

ps -ef |rg 'nomad' | rg -v rg
root      5597     1  0 14:42 ?        00:00:01 /usr/local/sbin/nomad agent -client=True -config /etc/nomad/nomad.d -data-dir /var/lib/nomad
date; echo;  journalctl -n 1 -fu nomad
Mon Jun 10 14:44:35 UTC 2019

-- Logs begin at Sun 2019-06-09 03:11:17 UTC. --
Jun 10 05:24:27 ip-10-137-21-213 nomad[8736]:     2019/06/10 05:24:27.902013 Unsolicited response received on idle HTTP channel starting with "HTTP/1.0 408 Request Time-out\r\nCache-Control: no-cache\r\nConnection: close\r\nContent-Type: text/html\r\n\r\n<html><body><h1>408 Request Time-out</h1>\nYour browser didn't send a complete request in time.\n</body></html>\n\n"; err=<nil>

This is how other Nomad workers see this node:

Every 2.0s: nomad node status -verbose                                                                                                                       Mon Jun 10 14:49:43 2019

ID                                    DC             Name              Class   Address        Version  Drain  Eligibility  Status
b49c39db-dda2-6af2-4f5b-7e870f855b4c  dev-us-east-1  ip-10-137-6-3     <none>  10.137.6.3     0.9.1    false  eligible     ready
8c9844b0-a55f-d145-de8c-f28b30cfd694  dev-us-east-1  ip-10-137-17-235  <none>  10.137.17.235  0.9.1    false  eligible     ready
6dc2b1c0-e0cf-7300-7093-1b6a9df535cb  dev-us-east-1  ip-10-137-21-213  <none>  10.137.21.213  0.9.1    false  eligible     down
c3a62f11-f9bb-69ad-e704-dfc153fa3422  dev-us-east-1  ip-10-137-9-164   <none>  10.137.9.164   0.9.1    false  eligible     ready
94957d61-3480-dba4-d50c-07513c49bf55  dev-us-east-1  ip-10-137-5-155   <none>  10.137.5.155   0.9.1    false  eligible     ready

I don't see CONSUL_HTTP_* being set anywhere. This is our systemd service file:

cat /etc/systemd/system/nomad.service

[Unit]
Description=Nomad
Documentation=https://nomadproject.io/docs/

[Service]
EnvironmentFile=-/etc/sysconfig/nomad
Restart=on-failure
ExecStart=/usr/local/sbin/nomad agent $OPTIONS -config /etc/nomad/nomad.d -data-dir /var/lib/nomad
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
cat /etc/sysconfig/nomad

OPTIONS=" -client=True"

@notnoop
Copy link
Contributor

notnoop commented Jun 11, 2019

Thank you so much for reporting this issue. This was fixed by #5817.

@notnoop notnoop closed this as completed Jun 11, 2019
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants