Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client node remains in down state after restarting host #5589

Closed
uzzz opened this issue Apr 22, 2019 · 5 comments
Closed

Client node remains in down state after restarting host #5589

uzzz opened this issue Apr 22, 2019 · 5 comments

Comments

@uzzz
Copy link

uzzz commented Apr 22, 2019

Nomad version

0.9.0

Operating system and Environment details

Ubuntu 16.04.5 LTS with systemd

Issue

I have a single client node and after I restart server it never gets back to "ready" state. The only way to make it work again is to run "docker container prune" and then restart nomad again.

State after reboot:

$ nomad node status
ID        DC        Name           Class   Drain  Eligibility  Status
753b93a9  leaseweb  nomad_client1  <none>  false  eligible     down

Reproduction steps

  1. Have some docker containers running on client machine
  2. Reboot the host
  3. After reboot docker ps -a shows container(s) in Exited state
  4. Container(s) does not restart and client node in down state

Nomad Client logs

Apr 22 13:50:09 HQDH079.HQDH.local systemd[1]: Started nomad agent.
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/var/nomad/plugins
Apr 22 13:50:10 HQDH079.HQDH.local nomad[1683]: ==> Loaded configuration from /etc/nomad.d/base.hcl, /etc/nomad.d/client.hcl, /etc/nomad.d/docker.hcl
Apr 22 13:50:10 HQDH079.HQDH.local nomad[1683]: ==> Starting Nomad agent...
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/var/nomad/plugins
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/var/nomad/plugins
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: agent: detected plugin: name=java type=driver plugin_version=0.1.0
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: agent: detected plugin: name=rkt type=driver plugin_version=0.1.0
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: client: using state directory: state_dir=/var/nomad/client
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: client: using alloc directory: alloc_dir=/var/nomad/alloc
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: client.fingerprint_mgr: built-in fingerprints: fingerprinters="[arch cgroup consul cpu host memory network nomad signal storage vault env_aws env_gce]"
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: client.fingerprint_mgr.cgroup: cgroups are available
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: client.fingerprint_mgr: fingerprinting periodically: fingerprinter=cgroup period=15s
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: client.fingerprint_mgr: fingerprinting periodically: fingerprinter=consul period=15s
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: client.fingerprint_mgr.cpu: detected cpu frequency: MHz=2500
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: client.fingerprint_mgr.cpu: detected core count: cores=32
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=eth4
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: client.fingerprint_mgr.network: unable to parse link speed: path=/sys/class/net/eth4/speed
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: client.fingerprint_mgr.network: link speed could not be detected and no speed specified by user, falling back to default speed: mbits=1000
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: client.fingerprint_mgr.network: detected interface IP: interface=eth4 IP=212.32.254.148
Apr 22 13:50:09 HQDH079.HQDH.local nomad[1683]: client.fingerprint_mgr: fingerprinting periodically: fingerprinter=vault period=15s
Apr 22 13:50:11 HQDH079.HQDH.local nomad[1683]: client.fingerprint_mgr.env_aws: error querying AWS Metadata URL, skipping
Apr 22 13:50:12 HQDH079.HQDH.local nomad[1683]: client.fingerprint_mgr.env_gce: could not read value for attribute: attribute=machine-type error="Get http://169.254.169.254/computeMetadata/v1/instance/machine-type: dial tcp 169.254.169.254:80: connect: no route to host"
Apr 22 13:50:12 HQDH079.HQDH.local nomad[1683]: client.fingerprint_mgr.env_gce: error querying GCE Metadata URL, skipping
Apr 22 13:50:12 HQDH079.HQDH.local nomad[1683]: client.fingerprint_mgr: detected fingerprints: node_attrs="[arch cgroup cpu host network nomad signal storage]"
Apr 22 13:50:12 HQDH079.HQDH.local nomad[1683]: client.plugin: starting plugin manager: plugin-type=driver
Apr 22 13:50:12 HQDH079.HQDH.local nomad[1683]: client.plugin: starting plugin manager: plugin-type=device
Apr 22 13:50:12 HQDH079.HQDH.local nomad[1683]: client.plugin: waiting on plugin manager initial fingerprint: plugin-type=driver
Apr 22 13:50:12 HQDH079.HQDH.local nomad[1683]: client.plugin: waiting on plugin manager initial fingerprint: plugin-type=device
Apr 22 13:50:12 HQDH079.HQDH.local nomad[1683]: client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get http://10.201.0.78:8500/v1/catalog/datacenters: dial tcp 10.201.0.78:8500: connect: connection refused"
Apr 22 13:50:12 HQDH079.HQDH.local nomad[1683]: client.driver_mgr: initial driver fingerprint: driver=qemu health=undetected description=
Apr 22 13:50:12 HQDH079.HQDH.local nomad[1683]: client.plugin: finished plugin manager initial fingerprint: plugin-type=device
Apr 22 13:50:12 HQDH079.HQDH.local nomad[1683]: client.driver_mgr: initial driver fingerprint: driver=raw_exec health=undetected description=disabled
Apr 22 13:50:12 HQDH079.HQDH.local nomad[1683]: client.driver_mgr: initial driver fingerprint: driver=rkt health=undetected description="Failed to execute rkt version: exec: "rkt": executable file not found in $PATH"
Apr 22 13:50:12 HQDH079.HQDH.local nomad[1683]: client.driver_mgr: initial driver fingerprint: driver=java health=undetected description=
Apr 22 13:50:12 HQDH079.HQDH.local nomad[1683]: client.driver_mgr: initial driver fingerprint: driver=exec health=healthy description=Healthy
Apr 22 13:50:15 HQDH079.HQDH.local nomad[1683]: client.driver_mgr.docker: failed to reattach to docker logger process: driver=docker error="failed to reattach to docker logger process: Reattachment process not found"
Apr 22 13:50:15 HQDH079.HQDH.local nomad[1683]: client.driver_mgr.docker.docker_logger: starting plugin: driver=docker path=/usr/local/bin/nomad args="[/usr/local/bin/nomad docker_logger]"
Apr 22 13:50:15 HQDH079.HQDH.local nomad[1683]: client.driver_mgr.docker.docker_logger: plugin started: driver=docker path=/usr/local/bin/nomad pid=2978
Apr 22 13:50:15 HQDH079.HQDH.local nomad[1683]: client.driver_mgr.docker.docker_logger: waiting for RPC address: driver=docker path=/usr/local/bin/nomad
Apr 22 13:50:15 HQDH079.HQDH.local nomad[1683]: client.driver_mgr: initial driver fingerprint: driver=docker health=healthy description=Healthy
Apr 22 13:50:15 HQDH079.HQDH.local nomad[1683]: client.driver_mgr: detected drivers: drivers="[exec docker]"
Apr 22 13:50:15 HQDH079.HQDH.local nomad[1683]: client.plugin: finished plugin manager initial fingerprint: plugin-type=driver
Apr 22 13:50:15 HQDH079.HQDH.local nomad[1683]: client.driver_mgr.docker.docker_logger.nomad: plugin address: driver=docker @module=docker_logger address=/tmp/plugin692568831 network=unix timestamp=2019-04-22T13:50:15.121Z
Apr 22 13:50:15 HQDH079.HQDH.local nomad[1683]: client.driver_mgr.docker.docker_logger: using plugin: driver=docker version=2
Apr 22 13:50:15 HQDH079.HQDH.local nomad[1683]: client.driver_mgr.docker.docker_logger.nomad: using client connection initialized from environment: driver=docker @module=docker_logger timestamp=2019-04-22T13:50:15.124Z
Apr 22 13:50:24 HQDH079.HQDH.local nomad[1683]: client.fingerprint_mgr.consul: consul agent is available
@preetapan
Copy link
Contributor

@uzzz We think this PR #5577 should fix this. We will be releasing a 0.9.1-rc today to fix this. Would you be able to try it out and let us know if the problem goes away?

@uzzz
Copy link
Author

uzzz commented Apr 24, 2019

@preetapan Just checked 0.9.1-rc1 - and yes, looks like bug has gone.

@preetapan
Copy link
Contributor

Thanks for confirming @uzzz

@HimanshuJha2000
Copy link

Facing a similar issue while doing it in vagrant-cluster, client went down just after restarting nomad. Created a host_volume inside client.hcl and restarted it. Can someone help, I'm new to using hashicorp tools?

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants