Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.2.2 borked? #573

Closed
pires opened this issue Dec 12, 2015 · 14 comments · Fixed by #591
Closed

0.2.2 borked? #573

pires opened this issue Dec 12, 2015 · 14 comments · Fixed by #591

Comments

@pires
Copy link

pires commented Dec 12, 2015

The following service runs with 0.2.1 but fails with 0.2.2 with no reason to be found:

job "web" {
  datacenters = ["dc1"]

  type = "service"

  constraint {
    distinct_hosts = true
  }

  update {
    stagger = "30s"
    max_parallel = 1
  }

  group "web-servers" {
    count = 1

    task "nginx" {
      driver = "docker"

      config {
        image = "nginx"
      }

      service {
        tags = ["web"]
        port = "http"
        check {
          type = "tcp"
          delay = "10s"
          timeout = "4s"
        }
      }

      resources {
        cpu = 500
        memory = 256
        network {
          port "http" {
          }
        }
      }
    }
  }

}

Also, restarting 0.2.2 crashes:

(...)
Dec 12 12:52:25  sh[2896]: os.Stat(0xc8202dc100, 0x3e, 0x0, 0x0, 0x0, 0x0)
Dec 12 12:52:25  sh[2896]: /usr/local/go/src/os/file_unix.go:154 +0x66
Dec 12 12:52:25  sh[2896]: os.MkdirAll(0xc8202dc100, 0x3e, 0xc8000001c0, 0x0, 0x0)
Dec 12 12:52:25  sh[2896]: /usr/local/go/src/os/path.go:21 +0x4f
Dec 12 12:52:25  sh[2896]: github.com/hashicorp/nomad/client/allocdir.(*AllocDir).Build(0xc8202dc140, 0xc820291700, 0x1, 0x4, 0x0, 0x0)
Dec 12 12:52:25  sh[2896]: /home/diptanuc/Projects/gocode/src/github.com/hashicorp/nomad/client/allocdir/alloc_dir.go:62 +0x9b
Dec 12 12:52:25  sh[2896]: github.com/hashicorp/nomad/client.(*AllocRunner).Run(0xc8201b4790)
Dec 12 12:52:25  sh[2896]: /home/diptanuc/Projects/gocode/src/github.com/hashicorp/nomad/client/alloc_runner.go:308 +0x800
Dec 12 12:52:25  sh[2896]: created by github.com/hashicorp/nomad/client.(*Client).restoreState
Dec 12 12:52:25  sh[2896]: /home/diptanuc/Projects/gocode/src/github.com/hashicorp/nomad/client/client.go:381 +0xb69

The stack is just too long.

@diptanu
Copy link
Contributor

diptanu commented Dec 12, 2015

@pires I am not able to reproduce the issue. Can you please provide us some more details about your environment, configuration, etc?

The job definition had incorrect check defintion, which shouldn't have resulted in a crash so I will need more information to debug here.

The updated job definition is -

job "web" {
  datacenters = ["dc1"]

  type = "service"

  update {
    stagger = "30s"
    max_parallel = 1
  }

  group "web-servers" {

    task "nginx" {
      driver = "docker"

      config {
        image = "nginx"
      }

      service {
        tags = ["web"]
        port = "http"
        check {
          type = "tcp"
          interval = "10s"
          timeout = "4s"
        }
      }

      resources {
        cpu = 500
        memory = 256
        network {
          port "http" {
          }
        }
      }
    }
  }

}

Note that when you are running a system job, there is no need to use the distinct_host constraint and the count. Nomad will run one task per node in your cluster if it's a system job.

@ranjib
Copy link
Contributor

ranjib commented Dec 12, 2015

nomad client is failing to create the task alloc dir, it will be helpful to have the whole stack trace, theres an error check https://github.com/hashicorp/nomad/blob/master/client/allocdir/alloc_dir.go#L62 , but its not clear from the stack trace why the panic occured

@diptanu
Copy link
Contributor

diptanu commented Dec 12, 2015

@ranjib Is it failing for you too or was the comment basically about the stack trace from @pires's issue?

@ranjib
Copy link
Contributor

ranjib commented Dec 12, 2015

im just tried with master + lxc driver.. its working fine for me. comments were specific to @pires issue. I dont think this issue is driver specific,

@pires
Copy link
Author

pires commented Dec 12, 2015

@diptanu first of all, I set job type as service, not system.

Now, I'm using CoreOS to provide 3 servers and 1 client. nomad and consul run as root.

server.hcl:

log_level = "INFO"
data_dir = "/var/lib/nomad/data"
bind_addr = "$private_ipv4"
server {
    enabled = true
    bootstrap_expect = 3
}

client.hcl:

log_level = "DEBUG"
data_dir = "/var/lib/nomad/data"
bind_addr = "$private_ipv4"
client {
    enabled = true
    options = {
        consul.address = "$private_ipv4"
    }
}

Servers server-join, client join and Consul are scripted and are working.

$ NOMAD_ADDR=http://10.142.0.2:4646 nomad server-members
Name                            Addr        Port  Status  Proto  Build  DC   Region
server-asia-01.internal.global  10.140.0.2  4648  alive   2      0.2.2  dc1  global
server-eu-01.internal.global    10.132.0.2  4648  alive   2      0.2.2  dc1  global
server-us-01.internal.global    10.142.0.2  4648  alive   2      0.2.2  dc1  global
$ NOMAD_ADDR=http://10.142.0.2:4646 nomad node-status
ID                                    DC   Name                   Class   Drain  Status
9b0ee15f-5f0c-9d44-be27-83968d8ca476  dc1  client-us-01.internal  <none>  false  ready
$ curl localhost:8500/v1/catalog/nodes
[
  {
    "Node":"client-us-01.internal",
    "Address":"10.142.0.3",
    "CreateIndex":34,
    "ModifyIndex":47
  },
  {
    "Node":"server-asia-01.internal",
    "Address":"10.140.0.2",
    "CreateIndex":4,
    "ModifyIndex":10
  },
  {
    "Node":"server-eu-01.internal",
    "Address":"10.132.0.2",
    "CreateIndex":3,
    "ModifyIndex":9
  },
  {
    "Node":"server-us-01.internal",
    "Address":"10.142.0.2",
    "CreateIndex":7,
    "ModifyIndex":8
  }
]

@pires
Copy link
Author

pires commented Dec 12, 2015

Client logs on first start:

/opt/bin/nomad agent -config /etc/nomad/client.hcl -servers 10.142.0.2:4647 10.132.0.2:4647 10.140.0.2:4647

Dec 12 18:12:20 client-us-01.internal sh[957]: 2015/12/12 18:12:18 [WARN] fingerprint.network: Unable to read link speed from /sys/class/net/ens4v1/speed
Dec 12 18:12:20 client-us-01.internal sh[957]: 2015/12/12 18:12:18 [DEBUG] fingerprint.network: Unable to read link speed; setting to default 100
Dec 12 18:12:20 client-us-01.internal sh[957]: 2015/12/12 18:12:18 [DEBUG] client: applied fingerprints [arch cpu env_gce host memory network storage]
Dec 12 18:12:20 client-us-01.internal sh[957]: 2015/12/12 18:12:18 [DEBUG] driver.docker: using client connection initialized from environment
Dec 12 18:12:20 client-us-01.internal sh[957]: 2015/12/12 18:12:18 [DEBUG] driver.docker: privileged containers are disabled
Dec 12 18:12:20 client-us-01.internal sh[957]: 2015/12/12 18:12:19 [DEBUG] client: available drivers [exec rkt docker]
Dec 12 18:12:20 client-us-01.internal sh[957]: 2015/12/12 18:12:19 [DEBUG] client: node registration complete
Dec 12 18:12:20 client-us-01.internal sh[957]: 2015/12/12 18:12:19 [DEBUG] client: updated allocations at index 1 (0 allocs)
Dec 12 18:12:20 client-us-01.internal sh[957]: 2015/12/12 18:12:19 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
Dec 12 18:12:22 client-us-01.internal sh[957]: 2015/12/12 18:12:22 [DEBUG] client: state updated to ready

Client logs after nomad run web.hcl:

$ NOMAD_ADDR=http://10.142.0.2:4646 nomad run web.hcl
==> Monitoring evaluation "bcac288e-ce0d-beb9-fc7f-7f82921f934a"
    Evaluation triggered by job "web"
    Allocation "89c81e1e-c91d-f997-b386-7edc6f5d5f8f" created: node "9b0ee15f-5f0c-9d44-be27-83968d8ca476", group "servers"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "bcac288e-ce0d-beb9-fc7f-7f82921f934a" finished with status "complete"

$ NOMAD_ADDR=http://10.142.0.2:4646 nomad status web
ID          = web
Name        = web
Type        = service
Priority    = 50
Datacenters = dc1
Status      = <none>

==> Evaluations
ID                                    Priority  TriggeredBy   Status
bcac288e-ce0d-beb9-fc7f-7f82921f934a  50        job-register  complete

==> Allocations
ID                                    EvalID                                NodeID                                TaskGroup  Desired  Status
89c81e1e-c91d-f997-b386-7edc6f5d5f8f  bcac288e-ce0d-beb9-fc7f-7f82921f934a  9b0ee15f-5f0c-9d44-be27-83968d8ca476  servers    run      failed

Client logs after nomad run web.hcl:

Dec 12 18:24:04 client-us-01.internal sh[1498]: 2015/12/12 18:24:04 [DEBUG] client: updated allocations at index 23 (1 allocs)
Dec 12 18:24:04 client-us-01.internal sh[1498]: 2015/12/12 18:24:04 [DEBUG] client: allocs: (added 1) (removed 0) (updated 0) (ignore 0)
Dec 12 18:24:04 client-us-01.internal sh[1498]: 2015/12/12 18:24:04 [DEBUG] client: starting runner for alloc '89c81e1e-c91d-f997-b386-7edc6f5d5f8f'
Dec 12 18:24:04 client-us-01.internal sh[1498]: 2015/12/12 18:24:04 [WARN] client: failed to build task directories: User not found in /etc/passwd
Dec 12 18:24:04 client-us-01.internal sh[1498]: 2015/12/12 18:24:04 [DEBUG] client: updated allocations at index 25 (1 allocs)
Dec 12 18:24:04 client-us-01.internal sh[1498]: 2015/12/12 18:24:04 [DEBUG] client: allocs: (added 0) (removed 0) (updated 1) (ignore 0)

Client logs after nomad restart:

Dec 12 18:26:43 client-us-01.internal sh[1657]: goroutine 28 [runnable]:
Dec 12 18:26:43 client-us-01.internal sh[1657]: github.com/hashicorp/go-checkpoint.CheckInterval.func1(0x4e94914f0000, 0xc82000bf80, 0xc820287630, 0xc8201c6fc0)
Dec 12 18:26:43 client-us-01.internal sh[1657]: /home/diptanuc/Projects/gocode/src/github.com/hashicorp/go-checkpoint/checkpoint.go:195
Dec 12 18:26:43 client-us-01.internal sh[1657]: created by github.com/hashicorp/go-checkpoint.CheckInterval
Dec 12 18:26:43 client-us-01.internal sh[1657]: /home/diptanuc/Projects/gocode/src/github.com/hashicorp/go-checkpoint/checkpoint.go:205 +0xb7
Dec 12 18:26:43 client-us-01.internal sh[1657]: goroutine 29 [runnable]:
Dec 12 18:26:43 client-us-01.internal sh[1657]: github.com/hashicorp/nomad/command/agent.(*Command).setupAgent.func1(0xc82015aaa0, 0xc82000bf80)
Dec 12 18:26:43 client-us-01.internal sh[1657]: /home/diptanuc/Projects/gocode/src/github.com/hashicorp/nomad/command/agent/command.go:310
Dec 12 18:26:43 client-us-01.internal sh[1657]: created by github.com/hashicorp/nomad/command/agent.(*Command).setupAgent
Dec 12 18:26:43 client-us-01.internal sh[1657]: /home/diptanuc/Projects/gocode/src/github.com/hashicorp/nomad/command/agent/command.go:313 +0x9c2
Dec 12 18:26:42 client-us-01.internal sh[1657]: created by net/http.(*Transport).dialConn
Dec 12 18:26:42 client-us-01.internal sh[1657]: /home/diptanuc/Src/go/src/net/http/transport.go:686 +0xc9d
Dec 12 18:26:42 client-us-01.internal sh[1657]: goroutine 39 [select]:
Dec 12 18:26:42 client-us-01.internal sh[1657]: net/http.(*persistConn).writeLoop(0xc820208160)
Dec 12 18:26:42 client-us-01.internal sh[1657]: /home/diptanuc/Src/go/src/net/http/transport.go:1009 +0x40c
Dec 12 18:26:42 client-us-01.internal sh[1657]: created by net/http.(*Transport).dialConn
Dec 12 18:26:42 client-us-01.internal sh[1657]: /home/diptanuc/Src/go/src/net/http/transport.go:686 +0xc9d
Dec 12 18:26:42 client-us-01.internal sh[1657]: goroutine 24 [runnable]:
Dec 12 18:26:42 client-us-01.internal sh[1657]: fmt.(*pp).handleMethods(0xc8201e6000, 0xc800000076, 0x0, 0xc800000001)
Dec 12 18:26:42 client-us-01.internal sh[1657]: /usr/local/go/src/fmt/print.go:724 +0x50b
Dec 12 18:26:42 client-us-01.internal sh[1657]: fmt.(*pp).printArg(0xc8201e6000, 0xc74540, 0xc8202509f0, 0x76, 0x0, 0x0)
Dec 12 18:26:42 client-us-01.internal sh[1657]: /usr/local/go/src/fmt/print.go:806 +0x4a5
Dec 12 18:26:42 client-us-01.internal sh[1657]: fmt.(*pp).doPrintf(0xc8201e6000, 0xf17620, 0x33, 0xc8201bfd68, 0x1, 0x1)
Dec 12 18:26:42 client-us-01.internal sh[1657]: /usr/local/go/src/fmt/print.go:1219 +0x1dd8
Dec 12 18:26:42 client-us-01.internal sh[1657]: fmt.Sprintf(0xf17620, 0x33, 0xc8201bfd68, 0x1, 0x1, 0x0, 0x0)
Dec 12 18:26:42 client-us-01.internal sh[1657]: /usr/local/go/src/fmt/print.go:203 +0x6f
Dec 12 18:26:42 client-us-01.internal sh[1657]: log.(*Logger).Printf(0xc820196690, 0xf17620, 0x33, 0xc8201bfd68, 0x1, 0x1)
Dec 12 18:26:42 client-us-01.internal sh[1657]: /usr/local/go/src/log/log.go:173 +0x49
Dec 12 18:26:42 client-us-01.internal sh[1657]: github.com/hashicorp/nomad/client.(*AllocRunner).Run(0xc8200bbce0)
Dec 12 18:26:42 client-us-01.internal sh[1657]: /home/diptanuc/Projects/gocode/src/github.com/hashicorp/nomad/client/alloc_runner.go:309 +0x8fd
Dec 12 18:26:43 client-us-01.internal sh[1657]: created by github.com/hashicorp/nomad/client.(*Client).restoreState
Dec 12 18:26:43 client-us-01.internal sh[1657]: /home/diptanuc/Projects/gocode/src/github.com/hashicorp/nomad/client/client.go:381 +0xb69
Dec 12 18:26:43 client-us-01.internal sh[1657]: goroutine 25 [runnable]:
Dec 12 18:26:43 client-us-01.internal sh[1657]: github.com/hashicorp/nomad/client.(*Client).run(0xc8201a0000)
Dec 12 18:26:43 client-us-01.internal sh[1657]: /home/diptanuc/Projects/gocode/src/github.com/hashicorp/nomad/client/client.go:576
Dec 12 18:26:43 client-us-01.internal sh[1657]: created by github.com/hashicorp/nomad/client.NewClient
Dec 12 18:26:43 client-us-01.internal sh[1657]: /home/diptanuc/Projects/gocode/src/github.com/hashicorp/nomad/client/client.go:145 +0xabf
Dec 12 18:26:43 client-us-01.internal sh[1657]: goroutine 26 [runnable]:
Dec 12 18:26:43 client-us-01.internal sh[1657]: github.com/hashicorp/nomad/client.(*ConsulService).SyncWithConsul(0xc820194600)
Dec 12 18:26:43 client-us-01.internal sh[1657]: /home/diptanuc/Projects/gocode/src/github.com/hashicorp/nomad/client/consul.go:188
Dec 12 18:26:43 client-us-01.internal sh[1657]: created by github.com/hashicorp/nomad/client.NewClient
Dec 12 18:26:43 client-us-01.internal sh[1657]: /home/diptanuc/Projects/gocode/src/github.com/hashicorp/nomad/client/client.go:148 +0xae5
Dec 12 18:26:43 client-us-01.internal sh[1657]: goroutine 27 [runnable]:
Dec 12 18:26:43 client-us-01.internal sh[1657]: net/http.Serve(0x7f9406a8f748, 0xc82002a410, 0x7f9406a8f780, 0xc82028e180, 0x7f9406a8f748, 0xc82002a410)
Dec 12 18:26:43 client-us-01.internal sh[1657]: /home/diptanuc/Src/go/src/net/http/server.go:1763
Dec 12 18:26:43 client-us-01.internal sh[1657]: created by github.com/hashicorp/nomad/command/agent.NewHTTPServer
Dec 12 18:26:43 client-us-01.internal sh[1657]: /home/diptanuc/Projects/gocode/src/github.com/hashicorp/nomad/command/agent/http.go:59 +0x4ca
Dec 12 18:26:43 client-us-01.internal sh[1657]: goroutine 28 [runnable]:
Dec 12 18:26:43 client-us-01.internal sh[1657]: github.com/hashicorp/go-checkpoint.CheckInterval.func1(0x4e94914f0000, 0xc82000bf80, 0xc820287630, 0xc8201c6fc0)
Dec 12 18:26:43 client-us-01.internal sh[1657]: /home/diptanuc/Projects/gocode/src/github.com/hashicorp/go-checkpoint/checkpoint.go:195
Dec 12 18:26:43 client-us-01.internal sh[1657]: created by github.com/hashicorp/go-checkpoint.CheckInterval
Dec 12 18:26:43 client-us-01.internal sh[1657]: /home/diptanuc/Projects/gocode/src/github.com/hashicorp/go-checkpoint/checkpoint.go:205 +0xb7
Dec 12 18:26:43 client-us-01.internal sh[1657]: goroutine 29 [runnable]:
Dec 12 18:26:43 client-us-01.internal sh[1657]: github.com/hashicorp/nomad/command/agent.(*Command).setupAgent.func1(0xc82015aaa0, 0xc82000bf80)
Dec 12 18:26:43 client-us-01.internal sh[1657]: /home/diptanuc/Projects/gocode/src/github.com/hashicorp/nomad/command/agent/command.go:310
Dec 12 18:26:43 client-us-01.internal sh[1657]: created by github.com/hashicorp/nomad/command/agent.(*Command).setupAgent
Dec 12 18:26:43 client-us-01.internal sh[1657]: /home/diptanuc/Projects/gocode/src/github.com/hashicorp/nomad/command/agent/command.go:313 +0x9c2

@diptanu
Copy link
Contributor

diptanu commented Dec 12, 2015

@pires First of all sorry, somehow I read the scheduler was system and not service! And secondly, thanks for the detailed logs. I think I know what's going on here. The reason this is failing is because -

Dec 12 18:24:04 2015/12/12 18:24:04 [WARN] client: failed to build task directories: User not found in /etc/passwd

We merged a PR related to how we find users, to get rid of cgo dependencies.

@dadgar You might want to take a look at this?

@diptanu
Copy link
Contributor

diptanu commented Dec 12, 2015

@pires Looks like we were making a syscall to get the user nobody earlier, but now we are looking up information about the user from /etc/passwd. I think in your setup the user is missing from /etc/passwd and hence this is breaking. I don't have a lot of experience with CoreOS. Can you please confirm?

@pires
Copy link
Author

pires commented Dec 12, 2015

@diptanu I can confirm there is no such user.

But mind that the client starts at first and becomes ready, but crashes on a restart that happens after the allocation failed.

@diptanu
Copy link
Contributor

diptanu commented Dec 12, 2015

@pires I think the client crashes on restart because it's trying to restore some state regarding the allocdirs, but it is not finding them. This is probably an un-related bug, which needs to be fixed too. I think the main issue here is that Nomad isn't able to find the nobody user.

@dadgar Since Nomad relies on that user, we shouldn't mark that node as ready if it's can't find it. Also I am wondering if the user could be configurable by the user.

@pires
Copy link
Author

pires commented Dec 12, 2015

@diptanu I was wrong, the nobody user exists but just isn't listed in /etc/passwd.

$ sudo useradd nobody -d /nonexistent -s /usr/sbin/nologin
useradd: user 'nobody' already exists

@diptanu
Copy link
Contributor

diptanu commented Dec 12, 2015

@pires So the user nobody exists but in /usr/share/baselayout which is pulled in by nssswitch, so this was working when we were making a syscall to figure out the users.

I will work on a PR which will use getent [1] to get information about users instead of reading /etc/passwd

We should be able to release a 0.2.3-rc-1 soon, I will update this thread once we do, please let us know if that would resolve your issue.

[1] http://man7.org/linux/man-pages/man1/getent.1.html

@pires
Copy link
Author

pires commented Dec 12, 2015

I'll wait on 0.2.3-rc1 then. Adding the line to /etc/passwd seems to work but having other issues now. Will open different issues.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants