Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raw_exec: make raw exec driver work with cgroups v2 #12419

Merged
merged 2 commits into from
Apr 5, 2022
Merged

Conversation

shoenig
Copy link
Member

@shoenig shoenig commented Mar 31, 2022

This PR adds support for the raw_exec driver on systems with only cgroups v2.

The raw exec driver is may use cgroups to manage processes. This happens
only on Linux, when exec_driver is enabled, and the no_cgroups option is not
set. The driver uses the freezer controller to freeze processes of a task,
issue a sigkill, then unfreeze. Previously the implementation assumed cgroups
v1, and now it also supports cgroups v2. Because the driver manages cgroups
directly and not via containerd, we must account for the v2 changes ourselves.

There is a bit of refactoring in this PR, but the fundamental design remains
the same.

Unlike the v1 implementation, the v2 implementation respects the client cgroup_parent
configuration option. This value is plumbed through to the driver via a task
environment variable, since it (and allocID / taskName) are not available otherwise
from the task config protobuf. A new environment variable NOMAD_PARENT_CGROUP
is now part of the task runtime when supported.

A few irrelevant tests have been stubbed out with SkipSlow because they are very flaky
on GHA, which skips slow tests. Circle is still configured to run slow tests.

Closes #12351 #12348
Partial #10251 (leaving open for the memory stats)

Reviewers:
Neither GHA or Circle are yet to offer machine types using only cgroupsv2. Instead of running everything manually I forked Nomad into shoenig/nomad and setup a temporary self-hosted GHA runner running ubuntu-21.10. The run based on c8d2402 (this PR) is HERE. The show-system job indicates the cgroupv2 mount point on /sys/fs/cgroup, which activates Nomad's cgroup v2 code paths and tests. Which you can't the output of unless you're me 😞

shoenig@opti3010:~$ cat /etc/os-release 
PRETTY_NAME="Ubuntu 21.10"
NAME="Ubuntu"
VERSION_ID="21.10"
VERSION="21.10 (Impish Indri)"
VERSION_CODENAME=impish
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=impish
shoenig@opti3010:~$ mount -l | grep cgroup 
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

@@ -78,7 +78,8 @@ jobs:
run: |
make bootstrap
make generate-all
make test-nomad-module
sudo sed -i 's!Defaults!#Defaults!g' /etc/sudoers
sudo -E env "PATH=$PATH" make test-nomad-module
Copy link
Member Author

@shoenig shoenig Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestFS_Log checks the alloc becomes complete, which it doesn't because it gets blocked by the missing cgroup for cpuset unless running as root.

This PR adds support for the raw_exec driver on systems with only cgroups v2.

The raw exec driver is able to use cgroups to manage processes. This happens
only on Linux, when exec_driver is enabled, and the no_cgroups option is not
set. The driver uses the freezer controller to freeze processes of a task,
issue a sigkill, then unfreeze. Previously the implementation assumed cgroups
v1, and now it also supports cgroups v2.

There is a bit of refactoring in this PR, but the fundamental design remains
the same.

Closes #12351 #12348
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @shoenig. I've left some comments but they're mostly more like questions. 😀

Comment on lines +1867 to +1868
// CgroupParent for this node (linux only)
CgroupParent string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this show up in the api.Node struct as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! fixed

@@ -2244,7 +2244,7 @@ func TestCoreScheduler_CSIPluginGC(t *testing.T) {
}

func TestCoreScheduler_CSIVolumeClaimGC(t *testing.T) {
ci.Parallel(t)
ci.SkipSlow(t, "flaky on GHA; #12358")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a streaming test as suggested in #12358. I think it's fixed with #12439 which merged yesterday. Have you seen otherwise?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yup this is out of date, removed!

Comment on lines 5 to 7
type containment struct {
// non-linux executors currently do not create resources to be cleaned up
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a little trouble following what we're doing with the Containment interface. We don't implement it for the non-Linux case (here), so isn't there only ever 1 implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, the only implementation applies to Linux and the creation / usage is all gated in build-tagged files. The implementation in containment_default.go isn't even complete much less used; let's just get rid of it.

Comment on lines +178 to +179
// kill the processes in cgroup
for _, pid := range pids {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not expect that all the PIDs are going to be in the same process group? That is, could we kill via -pid here the same way we do for non-Linux use cases, or is this so we can avoid killing the executor even though we moved it into a different cgroup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is honestly just copying the original implementation, but I believe this defends against sub-sub-sub processes that get daemonized. So like

Executor -> A -> B -> C

If B dies C gets owed by PID 1, not A and is no longer part of the process group ... right? 🤔

Copy link
Member

@tgross tgross Apr 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Children always get the parent's process group when forked, so the chain would be preserved even in that case (I don't think it gets changed after the fact on wait). But this implementation works, so let's keep it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty in the weeds but I guess it does protect against callers of setsid or setpgid

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. 👍

Comment on lines +353 to +359
// todo(shoenig) - Setting AllocID causes test to fail - with or without
// cgroups, and with or without chroot. It has to do with MkAllocDir
// creating the directory structure, but the actual root cause is still
// TBD. The symptom is that any command you try to execute will result
// in "permission denied" coming from os/exec. This test is imperfect,
// the actual feature of running commands as another user works fine.
// AllocID: uuid.Generate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we leave AllocID out then the AllocDir.SharedDir and AllocDir.AllocDir then both end up directly under the tempdirthen ends up directly under the tempdir (which is also task.AllocDir), instead of under the allocation ID one level down.

That is instead of /tmp/nomad_driver_harness-/alloc/:alloc_id and /tmp/nomad_driver_harness-/alloc we get /tmp/nomad_driver_harness-/alloc and /tmp/nomad_driver_harness- and maybe the latter has the wrong permissions now?

Comment on lines +96 to +100
// use the task environment variables for determining the cgroup path -
// not ideal but plumbing the values directly requires grpc protobuf changes
parent := lookup(e.commandCfg.Env, taskenv.CgroupParent)
allocID := lookup(e.commandCfg.Env, taskenv.AllocID)
task := lookup(e.commandCfg.Env, taskenv.TaskName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And those changes would be platform-specific, which would be sad to have to plumb into the protobufs if we can avoid it. It's probably worth considering whether we want to have a "platform variables" field or something in the future so we're not trying to smuggle things in thru env vars that we might not necessarily want to expose to the workload.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"platform variables" field

That's a clever idea I'm definitely going to forget, so I created #12468

@shoenig
Copy link
Member Author

shoenig commented Apr 5, 2022

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

make getAllPidsByCgroup work with cgroups v2
3 participants