Clients try to rerun old allocations after machine reboot #1795

Gerrrr · 2016-10-06T12:44:47Z

Nomad version

Nomad v0.3.2

Operating system and Environment details

Ubuntu 14.04

Issue

Nomad client puts some of the temporary files to /tmp. In the situation when alloc_dir is not erased on reboot, Nomad client tries to restart all the allocations that were run previously on this node and terminates them after receiving a command from server (note Started,Started, Killed event sequence in the allocation status). Since the socket in /tmp is gone after reboot, the client produces an error log (see below).

In our setup we solved it by putting alloc_dir to /tmp.

Reproduction steps

Create a Nomad cluster with 2 clients
Submit a job on 1 of the clients
Reboot the client which runs the job

Nomad job status

ID          = infra-cluster-broccoli
Name        = infra-cluster-broccoli
Type        = service
Priority    = 50
Datacenters = dc1
Status      = running
Periodic    = false

==> Evaluations
ID        Priority  Triggered By  Status
194b03f7  50        node-update   complete
36b8d462  50        node-update   complete
319c009c  50        node-update   complete

==> Allocations
ID        Eval ID   Node ID   Task Group  Desired  Status
6d9c70aa  36b8d462  3df25888  server      run      running
c130ac62  319c009c  449c9ba8  server      stop     complete

Nomad alloc-status

ID            = c130ac62
Eval ID       = 319c009c
Name          = infra-cluster-broccoli.server[0]
Node ID       = 449c9ba8
Job ID        = infra-cluster-broccoli
Client Status = complete

==> Task Resources
Task: "server"
CPU  Memory MB  Disk MB  IOPS  Addresses
500  1024       300      0     http: 10.250.18.29:9000

==> Task "server" is "dead"
Recent Events:
Time                    Type                   Description
06/10/16 14:12:30 CEST  Killed                 Task successfully killed
06/10/16 14:12:23 CEST  Started                Task started by client
04/10/16 16:59:45 CEST  Started                Task started by client
04/10/16 16:59:44 CEST  Downloading Artifacts  Client is downloading artifacts
04/10/16 16:59:44 CEST  Received               Task received by client

Nomad Client configuration

log_level = "INFO"
datacenter = "dc1"
data_dir = "/var/lib/nomad"
bind_addr = "0.0.0.0"
advertise {
  http = "10.250.18.28:4646"
  rpc = "10.250.18.28:4647"
  serf = "10.250.18.28:4648"
}
client {
  enabled = true
  servers = ["10.250.18.27"]
  options {
    "driver.raw_exec.enable" = "1"
    "driver.exec.enable" = "1"
    "driver.docker.enable" = "1"
  }
}

Nomad Client logs (if appropriate)

==> Caught signal: terminated
    2016/10/06 14:10:43 [INFO] agent: requesting shutdown
    2016/10/06 14:10:43 [INFO] client: shutting down
    2016/10/06 14:10:43 [INFO] agent: shutdown complete
    Loaded configuration from /etc/nomad.d/client/config.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:

                 Atlas: <disabled>
                Client: true
             Log Level: INFO
                Region: global (DC: dc1)
                Server: false

==> Nomad agent started! Log data will stream in below:

    2016/10/06 14:12:19 [INFO] client: using state directory /var/lib/nomad/client
    2016/10/06 14:12:19 [INFO] client: using alloc directory /var/lib/nomad/alloc
    2016/10/06 14:12:19 [INFO] fingerprint.cgroups: cgroups are available
    2016/10/06 14:12:23 [WARN] fingerprint.env_gce: Could not read value for attribute "machine-type"
    2016/10/06 14:12:23 [WARN] fingerprint.network: Unable to parse Speed in output of '/sbin/ethtool eth0'
    2016/10/06 14:12:23 [WARN] fingerprint.network: Unable to read link speed from /sys/class/net/eth0/speed
    2016/10/06 14:12:23 [WARN] client: port not specified, using default port
    2016/10/06 14:12:23 [INFO] client: setting server address list: [10.250.18.27:4647]
    2016/10/06 14:12:23 [ERR] driver.raw_exec: error connecting to plugin so destroying plugin pid and user pid
    2016/10/06 14:12:23 [ERR] driver.raw_exec: error destroying plugin and userpid: 2 error(s) occurred:

* os: process already finished
* os: process already finished
    2016/10/06 14:12:23 [ERR] client: failed to open handle to task 'server' for alloc 'c130ac62-268f-3ae8-3aac-95d315f37b99': error connecting to plugin: error creating rpc client for executor plugin: Reattachment process not found

The text was updated successfully, but these errors were encountered:

dadgar · 2016-10-06T16:41:34Z

Hey @Gerrrr,

This is actually expected behavior. So what the client is doing is attempting to re-attach to anything that was already running. This can be useful if you kill Nomad Client and do an in-place upgrade for example, start it up again and have it find all the processes.

In your case, there is nothing to connect to anymore because the tasks are dead, so it is just cleaning up.

Let me know if that made sense!

Gerrrr · 2016-10-07T11:54:40Z

Hi @dadgar,

Thanks for the explanation, it makes sense to me for in-place upgrades or when you just restart Nomad Client.

In our case the problem was that after the VM reboot, Nomad Client was starting allocations that were already rescheduled so we ended up having jobs running multiple times. So the rebooted Nomad client immediately sent SIGKILL to all jobs which are just started.

However, SIGKILL cannot be caught and the jobs cannot do any cleanup. Actually, they should not have been started in the first place, should they? We solved it by putting alloc_dir to /tmp so that the client does not try to run previously allocated jobs after reboot.

dadgar · 2016-10-07T17:20:06Z

Ah thanks for the clarification. Will re-open

Fixes #1795 Running restored allocations and pulling what allocations to run from the server happen concurrently. This means that if a client is rebooted, and has its allocations rescheduled, it may restart the dead allocations before it contacts the server and determines they should be dead. This commit makes tasks that fail to reattach on restore wait until the server is contacted before restarting.

github-actions · 2022-11-23T02:22:43Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar closed this as completed Oct 6, 2016

dadgar reopened this Oct 7, 2016

dadgar added type/bug theme/client labels Oct 7, 2016

schmichael mentioned this issue May 8, 2019

client: do not restart restored tasks until server is contacted #5669

Merged

schmichael closed this as completed in #5669 May 14, 2019

jozef-slezak mentioned this issue Jul 9, 2019

Constraint/count is not respected after Nomad cluster restart (previously failed allocs) #5921

Open

github-actions bot locked as resolved and limited conversation to collaborators Nov 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clients try to rerun old allocations after machine reboot #1795

Clients try to rerun old allocations after machine reboot #1795

Gerrrr commented Oct 6, 2016

dadgar commented Oct 6, 2016

Gerrrr commented Oct 7, 2016

dadgar commented Oct 7, 2016

github-actions bot commented Nov 23, 2022

Clients try to rerun old allocations after machine reboot #1795

Clients try to rerun old allocations after machine reboot #1795

Comments

Gerrrr commented Oct 6, 2016

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad job status

Nomad alloc-status

Nomad Client configuration

Nomad Client logs (if appropriate)

dadgar commented Oct 6, 2016

Gerrrr commented Oct 7, 2016

dadgar commented Oct 7, 2016

github-actions bot commented Nov 23, 2022