Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad fails to place second allocation in basic job #14235

Closed
shoenig opened this issue Aug 23, 2022 · 8 comments · Fixed by #14248
Closed

nomad fails to place second allocation in basic job #14235

shoenig opened this issue Aug 23, 2022 · 8 comments · Fixed by #14248
Labels
Milestone

Comments

@shoenig
Copy link
Member

shoenig commented Aug 23, 2022

This job reliably fails to run one of the two allocations. Using dev mode and machine has plenty of resources. The only interesting thing I can think of is it's running CentOS 9.

➜ uname -a
Linux thinkcentre 5.14.0-142.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Aug 4 18:15:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

The job

job "py" {
  datacenters = ["dc1"]

  group "web1" {
    task "py1" {
      driver = "raw_exec"
      config {
        command = "python3"
        args    = ["-m", "http.server", "8081"]
      }
      resources {
        memory = 128
	cpu = 100
      }
    }
  }

  group "web2" {
    task "py2" {
      driver = "raw_exec"
      config {
        command = "python3"
        args    = ["-m", "http.server", "8082"]
      }
      resources {
        memory = 128
	cpu = 100
      }
    }
  }  
}

Running the job

➜ nomad job run py.nomad
==> 2022-08-23T10:52:09-05:00: Monitoring evaluation "bcf3f9a8"
    2022-08-23T10:52:09-05:00: Evaluation triggered by job "py"
    2022-08-23T10:52:10-05:00: Evaluation within deployment: "2733f7be"
    2022-08-23T10:52:10-05:00: Allocation "91ddb242" created: node "d1b86c81", group "web1"
    2022-08-23T10:52:10-05:00: Allocation "adeb4be7" created: node "d1b86c81", group "web2"
    2022-08-23T10:52:10-05:00: Evaluation status changed: "pending" -> "complete"
==> 2022-08-23T10:52:10-05:00: Evaluation "bcf3f9a8" finished with status "complete"
==> 2022-08-23T10:52:10-05:00: Monitoring deployment "2733f7be"
  ⠏ Deployment "2733f7be" in progress...
    
    2022-08-23T10:52:55-05:00
    ID          = 2733f7be
    Job ID      = py
    Job Version = 0
    Status      = running
    Description = Deployment is running
    
    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    web1        1        1       0        0          2022-08-23T11:02:09-05:00
    web2        1        1       1        0          2022-08-23T11:02:19-05:00^C

One of the allocs will never become healthy

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
91ddb242  d1b86c81  web1        0        run      pending  1m30s ago  1m30s ago
adeb4be7  d1b86c81  web2        0        run      running  1m30s ago  1m20s ago

Looking at the evals

➜ nomad eval list
ID        Priority  Triggered By        Job ID  Namespace  Node ID  Status    Placement Failures
1e369795  50        deployment-watcher  py      default    <none>   complete  false
bcf3f9a8  50        job-register        py      default    <none>   complete  false
➜ nomad job status py
ID            = py
Name          = py
Submit Date   = 2022-08-23T10:52:09-05:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Alloc status returns incomplete results

➜ nomad alloc status 91
ID                  = 91ddb242-dcdf-4802-5910-777d5929ef04
Eval ID             = bcf3f9a8
Name                = py.web1[0]
Node ID             = d1b86c81
Node Name           = thinkcentre
Job ID              = py
Job Version         = 0
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 3m47s ago
Modified            = 3m47s ago
Deployment ID       = 2733f7be
Deployment Health   = unset

~/Config 
➜ 

There are no WARN or ERROR statements regarding the eval/deployment/alloc.

@shoenig shoenig added this to the 1.3.4 milestone Aug 23, 2022
@shoenig
Copy link
Member Author

shoenig commented Aug 23, 2022

spotted this in a stack dump while the agent was stuck - not sure I like the coincidence of being in the same function as #14217

goroutine 3365 [syscall]:
runtime.cgocall(0x1e11ab0, 0xc00086b9b0)
	runtime/cgocall.go:158 +0x5c fp=0xc00086b988 sp=0xc00086b950 pc=0x40761c
os/user._Cfunc_mygetpwnam_r(0xc001312080, 0xc000a68060, 0x7f8f94000c90, 0x400, 0xc0007ae8a0)
	_cgo_gotypes.go:159 +0x4c fp=0xc00086b9b0 sp=0xc00086b988 pc=0xaa62ec
os/user.lookupUser.func1.1({0xc001312080, 0xc000e8a190?, 0x0?}, 0xc000580400?, 0xc00086ba58?, 0x411547?)
	os/user/cgo_lookup_unix.go:69 +0xbb fp=0xc00086ba10 sp=0xc00086b9b0 pc=0xaa6fdb
os/user.lookupUser.func1()
	os/user/cgo_lookup_unix.go:69 +0x31 fp=0xc00086ba50 sp=0xc00086ba10 pc=0xaa6ef1
os/user.retryWithBuffer(0xc000e8a190, 0xc00086bb40)
	os/user/cgo_lookup_unix.go:244 +0x39 fp=0xc00086ba98 sp=0xc00086ba50 pc=0xaa8619
os/user.lookupUser({0x26ddbb4, 0x6})
	os/user/cgo_lookup_unix.go:60 +0x168 fp=0xc00086bb90 sp=0xc00086ba98 pc=0xaa6cc8
os/user.Lookup({0x26ddbb4, 0x6})
	os/user/lookup.go:36 +0x91 fp=0xc00086bbd0 sp=0xc00086bb90 pc=0xaa5c11
github.com/hashicorp/nomad/client/allocdir.dropDirPermissions({0xc0007dc0a0, 0x4a}, 0x86bd30?)
	github.com/hashicorp/nomad/client/allocdir/fs_unix.go:42 +0x125 fp=0xc00086bca8 sp=0xc00086bbd0 pc=0xe00145
github.com/hashicorp/nomad/client/allocdir.(*AllocDir).Build(0xc0005d6700)
	github.com/hashicorp/nomad/client/allocdir/alloc_dir.go:339 +0x245 fp=0xc00086bd60 sp=0xc00086bca8 pc=0xdfd4c5
github.com/hashicorp/nomad/client/allocrunner.(*allocDirHook).Prerun(0x0?)
	github.com/hashicorp/nomad/client/allocrunner/allocdir_hook.go:28 +0x1c fp=0xc00086bd78 sp=0xc00086bd60 pc=0x15beb9c
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).prerun(0xc0004e6280)
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner_hooks.go:197 +0x427 fp=0xc00086bed0 sp=0xc00086bd78 pc=0x15bbc27
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).Run(0xc0004e6280)
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:342 +0x13a fp=0xc00086bfc8 sp=0xc00086bed0 pc=0x15b283a
github.com/hashicorp/nomad/client.(*Client).addAlloc.func2()
	github.com/hashicorp/nomad/client/client.go:2548 +0x26 fp=0xc00086bfe0 sp=0xc00086bfc8 pc=0x1607fe6
runtime.goexit()
	runtime/asm_amd64.s:1594 +0x1 fp=0xc00086bfe8 sp=0xc00086bfe0 pc=0x472581
created by github.com/hashicorp/nomad/client.(*Client).addAlloc
	github.com/hashicorp/nomad/client/client.go:2548 +0x8c5

@shoenig
Copy link
Member Author

shoenig commented Aug 23, 2022

Whelp it seems we sometimes enter user.Lookup and never return.

In the logs we see both allocs being created, entering dropDirPermissions. However, once we see

[trace] alloc_dir enter lookup path=/tmp/NomadClient3252994847/56f37b30-d96d-7ee5-c1b6-86c1a25351f4/alloc

we apparently never return from the lookup, and never hear anything about alloc 56f37b30 ever again (!)

[info] alloc_dir dropDirPermissions path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc desired=2d727778727778727778
[info] alloc_dir dropDirPermissions path=/tmp/NomadClient3252994847/56f37b30-d96d-7ee5-c1b6-86c1a25351f4/alloc desired=2d727778727778727778
[info] alloc_dir Chmod ok path=/tmp/NomadClient3252994847/56f37b30-d96d-7ee5-c1b6-86c1a25351f4/alloc path=/tmp/NomadClient3252994847/56f37b30-d96d-7ee5-c1b6-86c1a25351f4/alloc
[info] alloc_dir Chmod ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc
[info] alloc_dir Geteuid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc
[info] alloc_dir Geteuid ok path=/tmp/NomadClient3252994847/56f37b30-d96d-7ee5-c1b6-86c1a25351f4/alloc
[trace] alloc_dir enter lookup path=/tmp/NomadClient3252994847/56f37b30-d96d-7ee5-c1b6-86c1a25351f4/alloc
[trace] alloc_dir enter lookup path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc
[info] alloc_dir Lookup ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc
[info] alloc_dir getUid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc
[info] alloc_dir getGid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc
[info] alloc_dir Chown ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc
[info] alloc_dir dropDirPermissions path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/logs desired=2d727778727778727778
[info] alloc_dir Chmod ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/logs path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/logs
[info] alloc_dir Geteuid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/logs
[trace] alloc_dir enter lookup path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/logs
[info] alloc_dir Lookup ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/logs
[info] alloc_dir getUid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/logs
[info] alloc_dir getGid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/logs
[info] alloc_dir Chown ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/logs
[info] alloc_dir dropDirPermissions path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/tmp desired=2d727778727778727778
[info] alloc_dir Chmod ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/tmp path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/tmp
[info] alloc_dir Geteuid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/tmp
[trace] alloc_dir enter lookup path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/tmp
[info] alloc_dir Lookup ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/tmp
[info] alloc_dir getUid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/tmp
[info] alloc_dir getGid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/tmp
[info] alloc_dir Chown ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/tmp
[info] alloc_dir dropDirPermissions path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/data desired=2d727778727778727778
[info] alloc_dir Chmod ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/data path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/data
[info] alloc_dir Geteuid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/data
[trace] alloc_dir enter lookup path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/data
[info] alloc_dir Lookup ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/data
[info] alloc_dir getUid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/data
[info] alloc_dir getGid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/data
[info] alloc_dir Chown ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/alloc/data
[info] alloc_dir dropDirPermissions path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2 desired=2d727778727778727778
[info] alloc_dir Chmod ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2 path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2
[info] alloc_dir Geteuid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2
[trace] alloc_dir enter lookup path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2
[info] alloc_dir Lookup ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2
[info] alloc_dir getUid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2
[info] alloc_dir getGid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2
[info] alloc_dir Chown ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2
[info] alloc_dir dropDirPermissions path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/local desired=2d727778727778727778
[info] alloc_dir Chmod ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/local path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/local
[info] alloc_dir Geteuid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/local
[trace] alloc_dir enter lookup path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/local
[info] alloc_dir Lookup ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/local
[info] alloc_dir getUid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/local
[info] alloc_dir getGid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/local
[info] alloc_dir Chown ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/local
[info] alloc_dir dropDirPermissions path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/tmp desired=74727778727778727778
[info] alloc_dir Chmod ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/tmp path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/tmp
[info] alloc_dir Geteuid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/tmp
[trace] alloc_dir enter lookup path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/tmp
[info] alloc_dir Lookup ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/tmp
[info] alloc_dir getUid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/tmp
[info] alloc_dir getGid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/tmp
[info] alloc_dir Chown ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/tmp
[info] alloc_dir dropDirPermissions path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/secrets desired=2d727778727778727778
[info] alloc_dir Chmod ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/secrets path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/secrets
[info] alloc_dir Geteuid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/secrets
[trace] alloc_dir enter lookup path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/secrets
[info] alloc_dir Lookup ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/secrets
[info] alloc_dir getUid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/secrets
[info] alloc_dir getGid ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/secrets
[info] alloc_dir Chown ok path=/tmp/NomadClient3252994847/2f4f2293-be3f-545d-8e24-f18ecf2534fb/py2/secrets

@tgross
Copy link
Member

tgross commented Aug 23, 2022

This lookup issue smelled familiar, so leaving a link to this closed issue #12342

@shoenig
Copy link
Member Author

shoenig commented Aug 23, 2022

Yeah we've seen trouble with looking up nobody elsewhere too.

On this install we have nobody in /etc/passwd though so the hang and panic are surprising.

➜ grep nobody /etc/passwd
nobody:x:65534:65534:Kernel Overflow User:/:/sbin/nologin

@shoenig
Copy link
Member Author

shoenig commented Aug 23, 2022

Well some random googling suggests perhaps getpwnam_r isn't thread safe, which makes sense with our symptoms I guess.

e.g.

collectd/collectd#229
https://bugs.schedmd.com/show_bug.cgi?id=3159

We could try doing a single lookup of nobody and cache the result since it should never change anyway.

@tgross
Copy link
Member

tgross commented Aug 23, 2022

Can we get away with using the osusergo build tag to force the non-CGO implementation? (ref) I know that'll block stuff like https://github.com/hashicorp/nomad-enterprise/issues/790 (internal), but it could get us into a shipping state reasonably quickly and we can circle back to fixing this "for real" if we end up tackling that?

@shoenig
Copy link
Member Author

shoenig commented Aug 23, 2022

Huh, TIL! That does actually seem to fix the problem.

➜ git diff
diff --git a/GNUmakefile b/GNUmakefile
index c506dfec9..5f1d99c6e 100644
--- a/GNUmakefile
+++ b/GNUmakefile
@@ -21,7 +21,7 @@ ifndef BIN
 BIN := $(GOPATH)/bin
 endif
 
-GO_TAGS ?=
+GO_TAGS ?= osusergo

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants