Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: enable cpuset support for cgroups.v2 #12274

Merged
merged 2 commits into from
Mar 24, 2022
Merged

client: enable cpuset support for cgroups.v2 #12274

merged 2 commits into from
Mar 24, 2022

Conversation

shoenig
Copy link
Member

@shoenig shoenig commented Mar 14, 2022

This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.

Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it now manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.

Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.

When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.

Client nodes now export a new fingerprint attribute, unique.cgroup.version
which will be set to "v1" or "v2" to indicate the cgroups regime in use by
Nomad.

The new cpuset management strategy fixes #11705, where docker tasks that
spawned processes on startup would "leak" and make use of forbidden cpu cores.
With the v2 manager, the PIDs are started in the cgroup they will always live in, and
thus the source of the leak is eliminated.

[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

Closes #11289
Fixes #11705 #11773 #11933

Review Notes:

@vercel vercel bot temporarily deployed to Preview – nomad March 16, 2022 14:40 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad March 16, 2022 19:10 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad March 16, 2022 20:00 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad March 16, 2022 20:49 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad March 16, 2022 21:03 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad March 16, 2022 21:47 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad March 16, 2022 22:27 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad March 16, 2022 22:35 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad March 17, 2022 13:15 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad March 17, 2022 15:29 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad March 17, 2022 18:53 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad March 17, 2022 19:00 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad March 17, 2022 21:29 Inactive
This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.

Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.

Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.

When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.

Client nodes now export a new fingerprint attribute, unique.cgroups.version
which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by
Nomad.

The new cpuset management strategy fixes #11705, where docker tasks that
spawned processes on startup would "leak". In cgroups v2, the PIDs are
started in the cgroup they will always live in, and thus the cause of
the leak is eliminated.

[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

Closes #11289
Fixes #11705 #11773 #11933
@shoenig shoenig changed the title [WIP] client: enable cpuset support for cgroups.v2 client: enable cpuset support for cgroups.v2 Mar 23, 2022
@shoenig shoenig marked this pull request as ready for review March 23, 2022 18:02
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great @shoenig! I've reviewed the driver and fingerprinter changes at this point and have started working thru the cgutil package changes. I'll have to finish my review tomorrow but I wanted to save my review in the meantime so I'll mark it as a comment.

Comment on lines 99 to 106
Starting with Nomad 1.3.0, Linux systems configured to use [cgroups v2][cgroups2] are
now supported. A Nomad client will only activate its v2 control groups manager if the
system is configured with the cgroups2 controller mounted at `/sys/fs/cgroup`. This implies
Nomad will continue to fallback to the v1 control groups manager on systems
configured to run in hybrid mode, where the cgroups2 controller is typically mounted
at `/sys/fs/cgroup/unified`. Systems that do not support cgroups v2 are not affected. A
new client attribute `unique.cgroup.version` indicates which version of control groups
Nomad is using.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is 100% accurate but I think it might be missing whether the user needs to do anything in particular here, especially for users who aren't really solid on just what cgroup v1 vs v2 is. I'd even consider breaking it down into bullet points:

Suggested change
Starting with Nomad 1.3.0, Linux systems configured to use [cgroups v2][cgroups2] are
now supported. A Nomad client will only activate its v2 control groups manager if the
system is configured with the cgroups2 controller mounted at `/sys/fs/cgroup`. This implies
Nomad will continue to fallback to the v1 control groups manager on systems
configured to run in hybrid mode, where the cgroups2 controller is typically mounted
at `/sys/fs/cgroup/unified`. Systems that do not support cgroups v2 are not affected. A
new client attribute `unique.cgroup.version` indicates which version of control groups
Nomad is using.
Starting with Nomad 1.3.0, Linux systems configured to use [cgroups v2][cgroups2]
are now supported. A Nomad client will only activate its v2 control groups manager
if the system is configured with the cgroups2 controller mounted at `/sys/fs/cgroup`.
* Systems that do not support cgroups v2 are not affected.
* Hosts configured in hybrid mode typically mount the cgroups2
controller at `/sys/fs/cgroup/unified`, so Nomad will continue to
use cgroups v1 for these hosts.
* Hosts configured for only cgroups v2 will now correctly support
`cpuset`.
Nomad will preserve the existing cgroup for tasks when a client is
upgraded, so there will be no disruption to tasks. A new client
attribute `unique.cgroup.version` indicates which version of control
groups Nomad is using.

These cgroups are created by Nomad before a task starts. External task drivers that support
containerization should be updated to make use of the new cgroup locations.

```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```
The new cgroup file system layout will look like the following:
```shell-session


func cgroupsCompatibleV1(t *testing.T) bool {
if runtime.GOOS != "linux" {
return false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unreachable given we have the build flag on this file, but maybe ok to leave in anyways if you think it makes the code clearer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed in favor of a comment reminding of the build tag

// GetCgroupParent returns the mount point under the root cgroup in which Nomad
// will create cgroups. If parent is not set, an appropriate name for the version
// of cgroups will be used.
func GetCgroupParent(parent string) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ever not ""?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! May as well make use of it and cleanup direct calls to getParentV1/V2

if f.lastState == cgroupAvailable {
f.logger.Info("cgroups are unavailable")
f.logger.Warn("cgroups are now unavailable")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a bad situation to be in! 😀 But it looks like you've got the cpuset fixer handling that as gracefully as we can, so 👍

Comment on lines +27 to +30
// Due to Docker not allowing the configuration of the full cgroup path, we must
// manually fix the cpuset values for all docker containers continuously, as the
// values will change as tasks of any driver using reserved cores are started and
// stopped, changing the size of the remaining shared cpu pool.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my clarity, in the case of #11705 with cgroups v2: any child processes will be in the same cgroup as their parent that Docker starts. So although they won't be pinned to the right CPU for a very brief window at startup, they'll all get moved together to the correct cpuset because we're not moving processes between cgroups, just changing the cgroup in place.

Is my understanding right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! And actually this is probably worth amending the RFC and reconsidering.

Although this behavior is most similar to the way V1 works, I don't see why it wouldn't be worth just setting the cpuset on the docker task config on initial startup now. The drawback is I don't think we have the plumbing to get the shared cores - the initial value would only be the reserved cores requested by the task resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it could be easier for us to understand that way, for sure.

I was going to say that some applications fingerprint cores at startup and not again afterwards (ex. GOMAXPROCS). But they could potentially take a hit either way: from not having enough threads if we have the initial value only be reserved, or from contention if we don't. I suspect if you care about this, you probably also need to care enough to set the values right for the application as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our trading infrastructure reads the cpus in the cgroup on startup and index into the cpuset to pin threads to specific cores. If the cgroup gets modified after we've done this thread <-> core pinning, I'm not entirely sure what happens but it's bad as either threads will still be pinned to CPUs outside of the allocated cgroup or it just breaks altogether.

cf: #12374

Comment on lines +86 to +92
func (cf *cpusetFixer) fix(c coordinate) {
source := c.NomadCgroup()
destination := c.DockerCgroup()
if err := cgutil.CopyCpuset(source, destination); err != nil {
cf.logger.Debug("failed to copy cpuset", "err", err)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the design here where we didn't end up plumbing any management code into the driver, but are just copying the cpu sets between the Nomad-managed cgroup and the Docker-managed one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still waiting to hear back on moby/moby#43363

Which if that would work, all this goes away!

return false, fmt.Errorf("Not a member of the alloc's cgroup: expected=...:/nomad/... -- found=%q", line)
// Skip rdma subsystem; rdma was added in most recent kernels and libcontainer/docker
// don't isolate it by default.
if strings.Contains(line, ":rdma:") || strings.Contains(line, "::") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we skip :misc: here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup

Comment on lines +202 to +203
// github actions freezer cgroup
acceptable = append(acceptable, ":freezer:/actions_job")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, I imagine you discovered this experimentally and it wasn't documented? Probably nothing we can do about it but some day that's going to change out from under us and break a bunch of tests. So at least you've documented it! 😀

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh yeah, it's really got me concerned curious how GHA works under the hood ...

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @shoenig, nice work! I've left a few small comments and questions.

Comment on lines +106 to +107
create(t, source)
defer cleanup(t, source)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicky: the mgr.Apply call isn't an atomic change. It can create the paths before the leaf successfully and then return an error for its final step. So if we call require.NoError in create we may never call the cleanup function. Maybe create can clean itself up on failure, and then do a require.NoError to fail the test?

ref cgroups/fs2/fs2.go#L65-L85

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, fixed

"github.com/stretchr/testify/require"
)

// Note: these tests need to run on GitHub Actions runners with only 2 cores.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it "at least 2 cores"? It'd be good to have a testutil for skipping this based on the number of cores, because otherwise folks might end up running on single-core Vagrant boxes and failing unexpectedly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, added

// CgroupScope returns the name of the scope for Nomad's managed cgroups for
// the given allocID and task.
//
// e.g. "<allocID>-<task>.scope"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want a dot to match what we've done in the code everywhere:

Suggested change
// e.g. "<allocID>-<task>.scope"
// e.g. "<allocID>.<task>.scope"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

Comment on lines +66 to +68
// identity is the "<allocID>.<taskName>" string that uniquely identifies an
// individual instance of a task within the flat cgroup namespace
type identity string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ❤️ type-aliased IDs

Comment on lines +35 to +37
// rootless is (for now) always false; Nomad clients require root, so we
// assume to not need to do the extra plumbing for rootless cgroups.
rootless = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the version of libcontainer we have would return an error if we passed rootless = true anyways. And more recent versions of libcontainer/cgroups/fs2 drop this parameter entirely. Not sure whether we want to update the comment to reflect we know that it's safe to remove when we bump versions of libcontainer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened and noted in #12372

Comment on lines 43 to 44
// null represents nothing
var null = nothing{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like what you're doing here but calling it null means we do c.sharing[id] = null to add an ID to the sharing set, which reads almost the opposite of what you're going for. Maybe if it were c.sharing[id] = ok or something like that?

(What I wouldn't do for a real set type in the stdlib!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed null to present to be more clear

(What I wouldn't do for a real set type in the stdlib!)

+10000000000000000000000

Comment on lines +240 to +241
// We avoid removing a cgroup if it still contains a PID, as the cpuset manager
// may be initially empty on a Nomad client restart.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do it for this PR, but it might be nice if we can eventually figure out a way to clean up stray cgroup entries. The allocdir hook is the only hook that comes before the cgroup hook, but if its postrun hook throws an error I think we'll never clean up the cgroup entry for that PID?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point; it's a similar problem for other resources like networks too, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah #11096 and #6385.

if taskInfo.Error != nil {
break
}

timer.Reset(100 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could move this into the timer.C case below, I think. Or do we pretty much always need the initial 100ms?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nomad CPU pinning is moving the container after it's created. cpuset support with cgroupsv2
3 participants