Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clients try to rerun old allocations after machine reboot #1795

Closed
Gerrrr opened this issue Oct 6, 2016 · 4 comments · Fixed by #5669
Closed

Clients try to rerun old allocations after machine reboot #1795

Gerrrr opened this issue Oct 6, 2016 · 4 comments · Fixed by #5669

Comments

@Gerrrr
Copy link
Contributor

Gerrrr commented Oct 6, 2016

Nomad version

Nomad v0.3.2

Operating system and Environment details

Ubuntu 14.04

Issue

Nomad client puts some of the temporary files to /tmp. In the situation when alloc_dir is not erased on reboot, Nomad client tries to restart all the allocations that were run previously on this node and terminates them after receiving a command from server (note Started,Started, Killed event sequence in the allocation status). Since the socket in /tmp is gone after reboot, the client produces an error log (see below).

In our setup we solved it by putting alloc_dir to /tmp.

Reproduction steps

  • Create a Nomad cluster with 2 clients
  • Submit a job on 1 of the clients
  • Reboot the client which runs the job

Nomad job status

ID          = infra-cluster-broccoli
Name        = infra-cluster-broccoli
Type        = service
Priority    = 50
Datacenters = dc1
Status      = running
Periodic    = false

==> Evaluations
ID        Priority  Triggered By  Status
194b03f7  50        node-update   complete
36b8d462  50        node-update   complete
319c009c  50        node-update   complete

==> Allocations
ID        Eval ID   Node ID   Task Group  Desired  Status
6d9c70aa  36b8d462  3df25888  server      run      running
c130ac62  319c009c  449c9ba8  server      stop     complete

Nomad alloc-status

ID            = c130ac62
Eval ID       = 319c009c
Name          = infra-cluster-broccoli.server[0]
Node ID       = 449c9ba8
Job ID        = infra-cluster-broccoli
Client Status = complete

==> Task Resources
Task: "server"
CPU  Memory MB  Disk MB  IOPS  Addresses
500  1024       300      0     http: 10.250.18.29:9000

==> Task "server" is "dead"
Recent Events:
Time                    Type                   Description
06/10/16 14:12:30 CEST  Killed                 Task successfully killed
06/10/16 14:12:23 CEST  Started                Task started by client
04/10/16 16:59:45 CEST  Started                Task started by client
04/10/16 16:59:44 CEST  Downloading Artifacts  Client is downloading artifacts
04/10/16 16:59:44 CEST  Received               Task received by client

Nomad Client configuration

log_level = "INFO"
datacenter = "dc1"
data_dir = "/var/lib/nomad"
bind_addr = "0.0.0.0"
advertise {
  http = "10.250.18.28:4646"
  rpc = "10.250.18.28:4647"
  serf = "10.250.18.28:4648"
}
client {
  enabled = true
  servers = ["10.250.18.27"]
  options {
    "driver.raw_exec.enable" = "1"
    "driver.exec.enable" = "1"
    "driver.docker.enable" = "1"
  }
}

Nomad Client logs (if appropriate)

==> Caught signal: terminated
    2016/10/06 14:10:43 [INFO] agent: requesting shutdown
    2016/10/06 14:10:43 [INFO] client: shutting down
    2016/10/06 14:10:43 [INFO] agent: shutdown complete
    Loaded configuration from /etc/nomad.d/client/config.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:

                 Atlas: <disabled>
                Client: true
             Log Level: INFO
                Region: global (DC: dc1)
                Server: false

==> Nomad agent started! Log data will stream in below:

    2016/10/06 14:12:19 [INFO] client: using state directory /var/lib/nomad/client
    2016/10/06 14:12:19 [INFO] client: using alloc directory /var/lib/nomad/alloc
    2016/10/06 14:12:19 [INFO] fingerprint.cgroups: cgroups are available
    2016/10/06 14:12:23 [WARN] fingerprint.env_gce: Could not read value for attribute "machine-type"
    2016/10/06 14:12:23 [WARN] fingerprint.network: Unable to parse Speed in output of '/sbin/ethtool eth0'
    2016/10/06 14:12:23 [WARN] fingerprint.network: Unable to read link speed from /sys/class/net/eth0/speed
    2016/10/06 14:12:23 [WARN] client: port not specified, using default port
    2016/10/06 14:12:23 [INFO] client: setting server address list: [10.250.18.27:4647]
    2016/10/06 14:12:23 [ERR] driver.raw_exec: error connecting to plugin so destroying plugin pid and user pid
    2016/10/06 14:12:23 [ERR] driver.raw_exec: error destroying plugin and userpid: 2 error(s) occurred:

* os: process already finished
* os: process already finished
    2016/10/06 14:12:23 [ERR] client: failed to open handle to task 'server' for alloc 'c130ac62-268f-3ae8-3aac-95d315f37b99': error connecting to plugin: error creating rpc client for executor plugin: Reattachment process not found
@dadgar
Copy link
Contributor

dadgar commented Oct 6, 2016

Hey @Gerrrr,

This is actually expected behavior. So what the client is doing is attempting to re-attach to anything that was already running. This can be useful if you kill Nomad Client and do an in-place upgrade for example, start it up again and have it find all the processes.

In your case, there is nothing to connect to anymore because the tasks are dead, so it is just cleaning up.

Let me know if that made sense!

@dadgar dadgar closed this as completed Oct 6, 2016
@Gerrrr
Copy link
Contributor Author

Gerrrr commented Oct 7, 2016

Hi @dadgar,

Thanks for the explanation, it makes sense to me for in-place upgrades or when you just restart Nomad Client.

In our case the problem was that after the VM reboot, Nomad Client was starting allocations that were already rescheduled so we ended up having jobs running multiple times. So the rebooted Nomad client immediately sent SIGKILL to all jobs which are just started.

However, SIGKILL cannot be caught and the jobs cannot do any cleanup. Actually, they should not have been started in the first place, should they? We solved it by putting alloc_dir to /tmp so that the client does not try to run previously allocated jobs after reboot.

@dadgar
Copy link
Contributor

dadgar commented Oct 7, 2016

Ah thanks for the clarification. Will re-open

@dadgar dadgar reopened this Oct 7, 2016
schmichael added a commit that referenced this issue May 8, 2019
Fixes #1795

Running restored allocations and pulling what allocations to run from
the server happen concurrently. This means that if a client is rebooted,
and has its allocations rescheduled, it may restart the dead allocations
before it contacts the server and determines they should be dead.

This commit makes tasks that fail to reattach on restore wait until the
server is contacted before restarting.
schmichael added a commit that referenced this issue May 8, 2019
Fixes #1795

Running restored allocations and pulling what allocations to run from
the server happen concurrently. This means that if a client is rebooted,
and has its allocations rescheduled, it may restart the dead allocations
before it contacts the server and determines they should be dead.

This commit makes tasks that fail to reattach on restore wait until the
server is contacted before restarting.
schmichael added a commit that referenced this issue May 8, 2019
Fixes #1795

Running restored allocations and pulling what allocations to run from
the server happen concurrently. This means that if a client is rebooted,
and has its allocations rescheduled, it may restart the dead allocations
before it contacts the server and determines they should be dead.

This commit makes tasks that fail to reattach on restore wait until the
server is contacted before restarting.
schmichael added a commit that referenced this issue May 14, 2019
Fixes #1795

Running restored allocations and pulling what allocations to run from
the server happen concurrently. This means that if a client is rebooted,
and has its allocations rescheduled, it may restart the dead allocations
before it contacts the server and determines they should be dead.

This commit makes tasks that fail to reattach on restore wait until the
server is contacted before restarting.
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants