Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad 1.4.0-rc1 user lookup fails with NSS #14737

Closed
jdoss opened this issue Sep 28, 2022 · 8 comments
Closed

Nomad 1.4.0-rc1 user lookup fails with NSS #14737

jdoss opened this issue Sep 28, 2022 · 8 comments

Comments

@jdoss
Copy link

jdoss commented Sep 28, 2022

Nomad version

Output from nomad version

Nomad v1.4.0-rc.1 (6aa153c)

Operating system and Environment details

Fedora CoreOS 36.20220906.3.2

Issue

Nomad 1.4 can no longer allocate tasks if the nobody user is missing from /etc/passwd. Nomad 1.3.x works just fine on Fedora CoreOS. When launching a job on 1.4 I get this error on the task:

failed to setup alloc: pre-run hook "alloc_dir" failed: user: unknown user nobody

Fedora CoreOS uses NSS with the https://github.com/aperezdc/nss-altfiles module so it can split the machine state of /etc which is owned by users and the OS controlled state in /usr/lib/.

# grep altfiles /etc/nsswitch.conf
passwd:     files altfiles sss systemd
group:      files altfiles sss systemd

You can read more context on the FCOS issue here coreos/fedora-coreos-tracker#1197 (comment)

You can see that the nobody user in fact does exist:

# id nobody
uid=99(nobody) gid=99(nobody) groups=99(nobody)
# fgrep nobody /etc/passwd
# fgrep nobody /usr/lib/passwd
nobody:x:99:99:Kernel Overflow User:/:/usr/sbin/nologin
# fgrep nobody /etc/group
# fgrep nobody /usr/lib/group 
nobody:x:99:

I am not sure what has changed in 1.4 that would cause Nomad to not use NSS. Manually copying nobody entries from /usr/lib/passwd to /etc/passwd gets things working as expected.

Nomad should use NSS rather than just reading /etc/passwd directly when checking for users.

Job file (if appropriate)

https://juicefs.com/docs/csi/csi-in-nomad/

@jdoss jdoss added the type/bug label Sep 28, 2022
@jdoss
Copy link
Author

jdoss commented Sep 28, 2022

It looks like in 1.3.x this issue was impacting the exec driver only #13047, but now in 1.4 it is impacting all jobs from what I can see.

@jdoss
Copy link
Author

jdoss commented Sep 28, 2022

From #13047 it talked about https://pkg.go.dev/os/user#Lookup being used.

For most Unix systems, this package has two internal implementations of resolving user and group ids to names, and listing supplementary group IDs. One is written in pure Go and parses /etc/passwd and /etc/group. The other is cgo-based and relies on the standard C library (libc) routines such as getpwuid_r, getgrnam_r, and getgrouplist.

When cgo is available, and the required routines are implemented in libc for a particular platform, cgo-based (libc-backed) code is used. This can be overridden by using osusergo build tag, which enforces the pure Go implementation.

It looks like the Makefile is setting the osusergo tag which will disable cgo and this will cause user.Lookup("nobody") to manually parse /etc/passwd instead of using getpwnam. Removing the osusergo tag and building with cgo fixes things:

$ make dev
==> Formatting HCL
==> Removing old development build...
==> Building pkg/linux_amd64/nomad with tags ui osusergo  ...
$ make dev
==> Formatting HCL
==> Removing old development build...
==> Building pkg/linux_amd64/nomad with tags ui  ...
$ ./pkg/linux_amd64/nomad version
Nomad v1.4.0-dev (58e76c64d58ea3351d4a57dc3ef1ba9496d57e71+CHANGES)

$ nomad job run juicefs-csi-controller.nomad 
==> 2022-09-28T16:33:26-05:00: Monitoring evaluation "a1eded2f"
    2022-09-28T16:33:26-05:00: Evaluation triggered by job "jfs-controller"
    2022-09-28T16:33:27-05:00: Allocation "d697b69e" created: node "725e947d", group "controller"
    2022-09-28T16:33:27-05:00: Evaluation status changed: "pending" -> "complete"
==> 2022-09-28T16:33:27-05:00: Evaluation "a1eded2f" finished with status "complete"
$ nomad job status
ID              Type    Priority  Status   Submit Date
jfs-controller  system  50        running  2022-09-28T16:33:26-05:00

$ nomad server members
Name              Address    Port  Status  Leader  Raft Version  Build      Datacenter  Region
mycool.nomad      10.0.2.15  4648  alive   true    3             1.4.0-dev  home        us

@jdoss
Copy link
Author

jdoss commented Sep 28, 2022

Looks like #14583 from 15 days ago is causing things on Linux to go sideways. @lgfa29 can be reverted so Linux user direction works again?

@tgross tgross self-assigned this Sep 29, 2022
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Sep 29, 2022
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Sep 29, 2022
@tgross
Copy link
Member

tgross commented Sep 29, 2022

Hi @jdoss so the reason we switched to osusergo for the lookup wasn't because of the exec driver (which is that open issue #13047), but because we were experiencing crashes from the call to getpwnam_r. See #14235. We're doing that before we hand off to libcontainer to create the pivot root (on Linux where that's available), so this is all happening in the client itself and a crash DoS's the entire client as a result.

Switching off the CGO implementation was intended to get us shipping but we never circled back to it (see my unfortunate comment here: #14235 (comment) 😀 ). We'd done some smoke-testing with some typical "production" distros and didn't run into any problems, but obviously we've missed some cases.

This leaves us with a few options:

  1. Leave osusergo in place and leave everyone using NSS broken. 😦
  2. Remove osusergo and expose everyone to possible crashes. 😦
  3. Figure out what the underlying problem is in the stdlib call and fix it.

Exposing users to crashes isn't going to be an option, but I'd rather not ship 1.4.0 with broken support for folks with NSS either. Seeing as how we're in the release candidate window for Nomad 1.4.0, I'm going to spend a bit of time today seeing how feasible option (3) is on short notice.

@tgross tgross changed the title Nomad 1.4 can no longer allocate tasks if the nobody user is missing from /etc/passwd Nomad 1.4.0-rc1 user lookup fails with NSS Sep 29, 2022
@tgross
Copy link
Member

tgross commented Sep 29, 2022

@jdoss we've just merged #14742 with what we think will fix the underlying bug, and will remove the osusergo tag. We're in the midst of discussing whether we'll cut a RC2 or whether this will go out in the GA, but I'll report back here when we know more.

Thanks for trying out the RC and catching this before we went GA!

@jdoss
Copy link
Author

jdoss commented Sep 29, 2022

Woop! Thanks @tgross and @shoenig for the quick turn around. 🙌🏼

I just tested things out with the changes in #14742 and things are working as expected.

@tgross tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Sep 29, 2022
@tgross tgross added this to the 1.4.0 milestone Sep 29, 2022
@shoenig
Copy link
Member

shoenig commented Sep 29, 2022

Thanks for the very clear bug report, @jdoss!

@shoenig shoenig closed this as completed Sep 29, 2022
Nomad - Community Issues Triage automation moved this from In Progress to Done Sep 29, 2022
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

3 participants