Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad server process dies after submitting a batch job twice #1462

Closed
minhdanh opened this issue Jul 23, 2016 · 3 comments · Fixed by #1471
Closed

Nomad server process dies after submitting a batch job twice #1462

minhdanh opened this issue Jul 23, 2016 · 3 comments · Fixed by #1471

Comments

@minhdanh
Copy link

I'm encountering a mysterious issue that causes Nomad server agent exit right away. This happens at the second time I submit a batch job to Nomad server. Whenever I try to start Nomad again, it exits almost immediately.

Here's my Nomad config on server agent:

log_level = "DEBUG"

bind_addr = "0.0.0.0"

data_dir = "/var/lib/nomad/server"

enable_debug = true
enable_syslog = true

name = "nomad-server-192.168.45.14"

advertise {
    http = "192.168.45.14:4646"
    rpc = "192.168.45.14:4647"
    serf = "192.168.45.14:4648"
}

consul {
  server_service_name = "nomad"
  auto_advertise = true
  server_auto_join = true
  address = "192.168.45.14:8500"
}

server {
    enabled = true
    bootstrap_expect = 1
}

Nomad version

Nomad v0.4.0

Operating system and Environment details

Ubuntu trusty64 (running on VirtualBox with Vagrant)

Issue

Nomad server process stops unexpectedly. It also exits almost right after restarting Nomad again.

Reproduction steps

Submit the attached job (glusterfs.hcl) to Nomad using this command on Nomad server:
nomad run -detach glusterfs.hcl

At the second time running the above job Nomad process will exit.

To start Nomad again I had to issue such a one-liner: /etc/init.d/nomad start && nomad stop 3wp-glusterfs for several times.

I tried to change the task name (create-volume to something), or the command in the config block to something else (say: echo 1). Then I submited the job again, it ran fine. But the second time with this job again, it failed.

Nomad Server logs

Jul 23 02:01:19 vagrant-ubuntu-trusty-64 nomad[9702]: http: Request /v1/jobs?region=global (2.320808ms)
Jul 23 02:01:19 vagrant-ubuntu-trusty-64 nomad[9702]: worker: dequeued evaluation 6682de9c-9dba-2b09-51b2-a50ac7d7a751
Jul 23 02:01:19 vagrant-ubuntu-trusty-64 nomad[9702]: sched: <Eval '6682de9c-9dba-2b09-51b2-a50ac7d7a751' JobID: '3wp-glusterfs'>: allocs: (place 0) (update 16) (migrate 0) (stop 0) (ignore 0)

Nomad Client logs

Jul 23 02:06:07 vagrant-ubuntu-trusty-64 nomad[9169]: client: RPC failed to server 192.168.45.14:4647: rpc error: EOF
Jul 23 02:06:07 vagrant-ubuntu-trusty-64 nomad[9169]: client: failed to query for node allocations: rpc error: EOF

Job file

# glusterfs.hcl
job "3wp-glusterfs" {
    region = "global"

    datacenters = ["dc1"]

    type = "batch"

    group "glusterfs" {
        count = 1

        task "create-volume" {
            driver = "raw_exec"
            config {
              command = "/usr/local/bin/create_volume"
              args = [
                  "-v", "3wp",
                  "-b", "/srv/gluster/brick/3wp",
                  "-s", "192.168.45.10,192.168.45.11"
             ]
            }
            constraint {
                attribute = "${node.class}"
                value = "glusterfs"
            }
            resources {
            }
        }
    }
}
@dadgar
Copy link
Contributor

dadgar commented Jul 25, 2016

Hey! I tried to reproduce this and I could not.

Is there anyway you can paste the full client and server logs?

@minhdanh
Copy link
Author

Hi @dadgar,

I can reproduce this very easy. I notice that this happens when I do the follow sequences:

  1. Run the above job. The first time it will be ok.
  2. Change the task name to something like create-volume1. Run this job again, it'll be ok, too.
  3. Run this job the second or third time. This time it will cause Nomad to exit.

You can change the job name and repeat the above steps.

I submit my Nomad client and server logs anyway. Please refer to the gist.

https://gist.github.com/minhdanh/e00979491d80f2a8b74960bdec285bee

My job file includes another task called download-wordpress, but it doesn't seem to be a problem.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants