Nomad server process dies after submitting a batch job twice #1462

minhdanh · 2016-07-23T02:18:21Z

I'm encountering a mysterious issue that causes Nomad server agent exit right away. This happens at the second time I submit a batch job to Nomad server. Whenever I try to start Nomad again, it exits almost immediately.

Here's my Nomad config on server agent:

log_level = "DEBUG"

bind_addr = "0.0.0.0"

data_dir = "/var/lib/nomad/server"

enable_debug = true
enable_syslog = true

name = "nomad-server-192.168.45.14"

advertise {
    http = "192.168.45.14:4646"
    rpc = "192.168.45.14:4647"
    serf = "192.168.45.14:4648"
}

consul {
  server_service_name = "nomad"
  auto_advertise = true
  server_auto_join = true
  address = "192.168.45.14:8500"
}

server {
    enabled = true
    bootstrap_expect = 1
}

Nomad version

Nomad v0.4.0

Operating system and Environment details

Ubuntu trusty64 (running on VirtualBox with Vagrant)

Issue

Nomad server process stops unexpectedly. It also exits almost right after restarting Nomad again.

Reproduction steps

Submit the attached job (glusterfs.hcl) to Nomad using this command on Nomad server:
nomad run -detach glusterfs.hcl

At the second time running the above job Nomad process will exit.

To start Nomad again I had to issue such a one-liner: /etc/init.d/nomad start && nomad stop 3wp-glusterfs for several times.

I tried to change the task name (create-volume to something), or the command in the config block to something else (say: echo 1). Then I submited the job again, it ran fine. But the second time with this job again, it failed.

Nomad Server logs

Jul 23 02:01:19 vagrant-ubuntu-trusty-64 nomad[9702]: http: Request /v1/jobs?region=global (2.320808ms)
Jul 23 02:01:19 vagrant-ubuntu-trusty-64 nomad[9702]: worker: dequeued evaluation 6682de9c-9dba-2b09-51b2-a50ac7d7a751
Jul 23 02:01:19 vagrant-ubuntu-trusty-64 nomad[9702]: sched: <Eval '6682de9c-9dba-2b09-51b2-a50ac7d7a751' JobID: '3wp-glusterfs'>: allocs: (place 0) (update 16) (migrate 0) (stop 0) (ignore 0)

Nomad Client logs

Jul 23 02:06:07 vagrant-ubuntu-trusty-64 nomad[9169]: client: RPC failed to server 192.168.45.14:4647: rpc error: EOF
Jul 23 02:06:07 vagrant-ubuntu-trusty-64 nomad[9169]: client: failed to query for node allocations: rpc error: EOF

Job file

# glusterfs.hcl
job "3wp-glusterfs" {
    region = "global"

    datacenters = ["dc1"]

    type = "batch"

    group "glusterfs" {
        count = 1

        task "create-volume" {
            driver = "raw_exec"
            config {
              command = "/usr/local/bin/create_volume"
              args = [
                  "-v", "3wp",
                  "-b", "/srv/gluster/brick/3wp",
                  "-s", "192.168.45.10,192.168.45.11"
             ]
            }
            constraint {
                attribute = "${node.class}"
                value = "glusterfs"
            }
            resources {
            }
        }
    }
}

The text was updated successfully, but these errors were encountered:

dadgar · 2016-07-25T22:52:39Z

Hey! I tried to reproduce this and I could not.

Is there anyway you can paste the full client and server logs?

minhdanh · 2016-07-26T01:27:19Z

Hi @dadgar,

I can reproduce this very easy. I notice that this happens when I do the follow sequences:

Run the above job. The first time it will be ok.
Change the task name to something like create-volume1. Run this job again, it'll be ok, too.
Run this job the second or third time. This time it will cause Nomad to exit.

You can change the job name and repeat the above steps.

I submit my Nomad client and server logs anyway. Please refer to the gist.

https://gist.github.com/minhdanh/e00979491d80f2a8b74960bdec285bee

My job file includes another task called download-wordpress, but it doesn't seem to be a problem.

github-actions · 2022-12-20T02:15:23Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added the stage/waiting-reply label Jul 25, 2016

dadgar added type/bug theme/core and removed stage/waiting-reply labels Jul 26, 2016

dadgar mentioned this issue Jul 27, 2016

filterCompleteAllocs filters replaced batch allocs #1471

Merged

dadgar closed this as completed in #1471 Jul 28, 2016

github-actions bot locked as resolved and limited conversation to collaborators Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad server process dies after submitting a batch job twice #1462

Nomad server process dies after submitting a batch job twice #1462

minhdanh commented Jul 23, 2016

dadgar commented Jul 25, 2016

minhdanh commented Jul 26, 2016

github-actions bot commented Dec 20, 2022

Nomad server process dies after submitting a batch job twice #1462

Nomad server process dies after submitting a batch job twice #1462

Comments

minhdanh commented Jul 23, 2016

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Server logs

Nomad Client logs

Job file

dadgar commented Jul 25, 2016

minhdanh commented Jul 26, 2016

github-actions bot commented Dec 20, 2022