Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DevicesSets being removed when cpusets are reloaded with cgroup v2 #17535

Merged
merged 2 commits into from
Jun 15, 2023

Conversation

TrueBrain
Copy link
Contributor

This meant that if any allocation was created or removed, all active DevicesSets were removed from all cgroups of all tasks.

This was most noticeable with "exec" and "raw_exec", as it meant they no longer had access to /dev files.

Fixes #12877, and possibly a bunch of other related but different reports in the issue tracker. See #12877 (comment) for steps how to reproduce this problem, and validate it is actually fixed with this change. Most important piece of context: https://github.com/opencontainers/runc/blob/5cf9bb229feed19a767cbfdf9702f6487341e29e/libcontainer/cgroups/devices/v2.go#L55-L57

Few warnings:

  • 4 hours ago I had no idea how cgroups worked, nor how BPF and DevicesSets interacted with cgroups
  • 1 hour ago I never touched a Go project in my life (but a long history of other languages, so "its fine")

PS: I checked if there were any e2e tests this could be part of; but it seems there is nothing there yet, neither for cgroup or cpuset. If I missed it, please let me know.

@hashicorp-cla
Copy link

hashicorp-cla commented Jun 14, 2023

CLA assistant check
All committers have signed the CLA.

@shoenig
Copy link
Member

shoenig commented Jun 15, 2023

This is great @TrueBrain!

On your branch, can you run make cl to generate a changelog entry?

I turned your repro steps into an e2e test in #17546, can confirm it seems to work. Most cgroups tests are in the drivers/ packages but just following the repro steps in the e2e environment is pretty effective.

before (main)

➜ go test -v -run TestCgroup
=== RUN   TestCgroupDevices
=== RUN   TestCgroupDevices/testDevicesUsable
    assert.go:11:
        devices_test.go:53: expected nil error
        ↪ error: command nomad [alloc exec 9f0e235a-3cb2-db10-a5ac-3450cad63cda dd if=/dev/zero of=/dev/null count=1] failed: exit status 1
        Output: dd: failed to open '/dev/zero': Operation not permitted
--- FAIL: TestCgroupDevices (1.58s)
    --- FAIL: TestCgroupDevices/testDevicesUsable (1.46s)
FAIL
exit status 1
FAIL    github.com/hashicorp/nomad/e2e/isolation        1.587s

after with (7bdde48)

➜ go test -v -run TestCgroup
=== RUN   TestCgroupDevices
=== RUN   TestCgroupDevices/testDevicesUsable
--- PASS: TestCgroupDevices (1.64s)
    --- PASS: TestCgroupDevices/testDevicesUsable (1.52s)
PASS
ok      github.com/hashicorp/nomad/e2e/isolation        1.644s

@shoenig shoenig added this to the 1.5.x milestone Jun 15, 2023
@shoenig shoenig added backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line labels Jun 15, 2023
@TrueBrain
Copy link
Contributor Author

Shall I also cherry-pick your e2e in this PR?

And awesome, happy there is a place for some e2e for this :D

@shoenig
Copy link
Member

shoenig commented Jun 15, 2023

Shall I also cherry-pick your e2e in this PR?

Go for it!

TrueBrain and others added 2 commits June 15, 2023 15:38
This meant that if any allocation was created or removed, all
active DevicesSets were removed from all cgroups of all tasks.

This was most noticeable with "exec" and "raw_exec", as it meant
they no longer had access to /dev files.
Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nomad 1.3.0-rc.1 cgroupsv2 /dev/ strangeness
3 participants