Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad 1.3.0-rc.1 cgroupsv2 /dev/ strangeness #12877

Closed
badalex opened this issue May 5, 2022 · 23 comments · Fixed by #17535
Closed

nomad 1.3.0-rc.1 cgroupsv2 /dev/ strangeness #12877

badalex opened this issue May 5, 2022 · 23 comments · Fixed by #17535

Comments

@badalex
Copy link

badalex commented May 5, 2022

Nomad version

Nomad v1.3.0-rc.1 (31b0a18)

Operating system and Environment details

Ubuntu 22.04 Jammy Jellyfish 5.15.0-27-generic #28-Ubuntu SMP Thu Apr 14 04:55:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Issue

Something funky happens when using cgroups v2 where (at the very least) the ability to read /dev/zero and write to /dev/null is lost. Using a raw_exec job and dd if=/dev/zero of=/dev/null count=1 I get dd: failed to open '/dev/zero': Operation not permitted fairly randomly but consistently. The original jobs are actually python/java, not that it seems to matter.

If I disable cgroupsv2 with the following, the problem disappears.

index 1333d0cc1..5f8d8fc68 100644
--- a/client/lib/cgutil/cgutil_linux.go
+++ b/client/lib/cgutil/cgutil_linux.go
@@ -18,7 +18,7 @@ import (
 // cgroups.v1
 //
 // This is a read-only value.
-var UseV2 = cgroups.IsCgroup2UnifiedMode()
+var UseV2 = false //cgroups.IsCgroup2UnifiedMode()

 // GetCgroupParent returns the mount point under the root cgroup in which Nomad
 // will create cgroups. If parent is not set, an appropriate name for the version

Output from a failed run, I'll note it seems to always work on the "first" run, only seems to be restarts under the same allocation id.

+0 records in
1+0 records out
512 bytes copied, 0.000455007 s, 1.1 MB/s
1+0 records in
1+0 records out
512 bytes copied, 0.000992872 s, 516 kB/s
1+0 records in
1+0 records out
512 bytes copied, 0.000578372 s, 885 kB/s
1+0 records in
1+0 records out
512 bytes copied, 0.000302934 s, 1.7 MB/s
1+0 records in
1+0 records out
512 bytes copied, 0.000195964 s, 2.6 MB/s
dd: failed to open '/dev/zero': Operation not permitted

File: /alloc/logs/env.stdout.0
busted
total 124K
lrwxrwxrwx   1 root root     7 Apr 23 02:08 bin -> usr/bin
lrwxrwxrwx   1 root root     8 Apr 23 02:08 sbin -> usr/sbin
lrwxrwxrwx   1 root root    10 Apr 23 02:08 libx32 -> usr/libx32
lrwxrwxrwx   1 root root     9 Apr 23 02:08 lib64 -> usr/lib64
lrwxrwxrwx   1 root root     9 Apr 23 02:08 lib32 -> usr/lib32
lrwxrwxrwx   1 root root     7 Apr 23 02:08 lib -> usr/lib
drwxr-xr-x   2 root root  4.0K Apr 23 02:08 srv
drwxr-xr-x   2 root root  4.0K Apr 23 02:08 media
drwxr-xr-x  14 root root  4.0K Apr 23 02:08 usr
drwxr-xr-x   6 root root  4.0K Apr 23 02:09 snap
drwx------   2 root root   16K Apr 23 02:09 lost+found
drwxr-xr-x   4 root root  4.0K May  5 01:44 boot
drwxr-xr-x   2 root root  4.0K May  5 01:49 mnt
drwxr-xr-x   5 root wheel 4.0K May  5 01:49 nfs
drwxr-xr-x 158 root root  4.0K May  5 01:50 home
drwxr-xr-x  14 root root  4.0K May  5 01:51 var
dr-xr-xr-x  13 root root     0 May  5 02:03 sys
dr-xr-xr-x 539 root root     0 May  5 02:03 proc
drwxr-xr-x  20 root root  4.0K May  5 02:03 ..
drwxr-xr-x  20 root root  4.0K May  5 02:03 .
drwxr-xr-x  19 root root  4.1K May  5 02:03 dev
drwxr-xr-x 141 root root   12K May  5 03:01 etc
drwxr-xr-x  48 root root  4.0K May  5 04:58 opt
drwx------   6 root root  4.0K May  5 05:14 root
drwxr-xr-x  39 root root  1.3K May  5 05:35 run
drwxrwxrwt  15 root root   40K May  5 05:36 tmp
crw-rw-rw- 1 root root 1, 5 May  5 02:03 /dev/zero
uid=0(root) gid=0(root) groups=0(root)
Current: =ep
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore
Ambient set =
Current IAB:
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=0(root) euid=0(root)
gid=0(root)
groups=
Guessed mode: UNCERTAIN (0)
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 3182
max locked memory           (kbytes, -l) unlimited
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) unlimited
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

Job file (if appropriate)

job "test-devnull" {
	datacenters = ["..."]
	type = "service"

	constraint {
		attribute = "${attr.unique.hostname}"
		value = "..."
	}

       reschedule {
	    delay = "5s"
	    delay_function = "constant"
	    unlimited = true
       }

	group "group" {
		restart {
			attempts = 5
			mode = "fail"
			delay = "0s"
			interval = "30s"
		}
	
               count = 10

		task "env" {
			driver = "raw_exec"
			config {
				command = "/usr/bin/bash"
				args = [ "-c", "dd if=/dev/zero of=/dev/null count=1 || (echo \"busted\"; ls -ltrha /; ls -ltrha /dev/zero; id; capsh --print; ulimit -a; sleep 1000)"]
			}
			resources {
				memory = "10"
			}
		}

	}
}```
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation May 5, 2022
@tgross
Copy link
Member

tgross commented May 5, 2022

Hi @badalex sounds like it might be related to #12863, which @shoenig is working on in #12875. What makes me think that is that you've said it works the first time but not after task restart.

@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage May 5, 2022
@tgross tgross added the theme/cgroups cgroups issues label May 5, 2022
@tgross tgross added this to the 1.3.0 milestone May 5, 2022
@badalex
Copy link
Author

badalex commented May 5, 2022

I failed to mention this is with #12875 applied. Without it, I couldn't get far enough to create a semi-reliable example (I was originally going to report this bug, but in trying to reproduce what I was seeing with my "real" jobs, I hit #12863)

Sorry about that and thanks!

Another thing I didn't mention, to reproduce with the provided job file it usually breaks pretty quick for me. However there has been a time or two where I've had more success stopping and the restarting the job or a random allocation. The count is not important, just trying to tickle the bug out quicker.

FWIW, smells like some kind of kernel bug to me. I'm curios if anyone else will be able to reproduce.

@badalex
Copy link
Author

badalex commented May 5, 2022

I was able to reproduce this on Archlinux 5.17.5-arch1-1 #1 SMP PREEMPT Wed, 27 Apr 2022 20:56:11 +0000 x86_64 GNU/Linux.

@RickyGrassmuck
Copy link
Contributor

RickyGrassmuck commented May 27, 2022

I'm not 100% my issue is related to this one outside of the error message but I have run into this same "Operation not permitted" issue with the Cinder CSI Driver container after updating to 1.3.0 on one of our client nodes.

Ever since updating the node, I'm seeing the error messages below when a job using a Cinder CSI Volume is scheduled on it:

I0526 06:18:16.625370       1 utils.go:88] [ID:27] GRPC call: /csi.v1.Controller/ControllerPublishVolume
I0526 06:18:16.625406       1 controllerserver.go:179] ControllerPublishVolume: called with args {"node_id":"4ee70781-9c40-4fb8-9b33-4b5f21e57e7b","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4","mount_flags":["rw"]}},"access_mode":{"mode":1}},"volume_id":"5c8c4096-8c8d-4d4a-999a-3efcd0d4c060"}
I0526 06:18:24.336459       1 utils.go:88] [ID:28] GRPC call: /csi.v1.Node/NodeStageVolume
I0526 06:18:24.336498       1 nodeserver.go:353] NodeStageVolume: called with args {"publish_context":{"DevicePath":"/dev/vdb"},"staging_target_path":"/local/csi/staging/es-master-2/rw-file-system-single-node-writer","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4","mount_flags":["rw"]}},"access_mode":{"mode":1}},"volume_id":"a7cdf508-1b77-45b2-8034-b11b70fb9d4f"}
I0526 06:18:24.424763       1 mount.go:172] Found disk attached as "virtio-a7cdf508-1b77-45b2-8"; full devicepath: /dev/disk/by-id/virtio-a7cdf508-1b77-45b2-8
I0526 06:18:24.426808       1 mount_linux.go:487] Attempting to determine if disk "/dev/disk/by-id/virtio-a7cdf508-1b77-45b2-8" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/disk/by-id/virtio-a7cdf508-1b77-45b2-8])
I0526 06:18:24.431456       1 mount_linux.go:490] Output: "blkid: error: /dev/disk/by-id/virtio-a7cdf508-1b77-45b2-8: Operation not permitted\n"
I0526 06:18:24.431528       1 mount_linux.go:449] Disk "/dev/disk/by-id/virtio-a7cdf508-1b77-45b2-8" appears to be unformatted, attempting to format as type: "ext4" with options: [-F -m0 /dev/disk/by-id/virtio-a7cdf508-1b77-45b2-8]
E0526 06:18:24.434338       1 mount_linux.go:455] format of disk "/dev/disk/by-id/virtio-a7cdf508-1b77-45b2-8" failed: type:("ext4") target:("/local/csi/staging/es-master-2/rw-file-system-single-node-writer") options:("rw,defaults") errcode:(exit status 1) output:(mke2fs 1.44.5 (15-Dec-2018)
Could not open /dev/disk/by-id/virtio-a7cdf508-1b77-45b2-8: Operation not permitted
) 
E0526 06:18:24.434378       1 utils.go:92] [ID:28] GRPC error: rpc error: code = Internal desc = format of disk "/dev/disk/by-id/virtio-a7cdf508-1b77-45b2-8" failed: type:("ext4") target:("/local/csi/staging/es-master-2/rw-file-system-single-node-writer") options:("rw,defaults") errcode:(exit status 1) output:(mke2fs 1.44.5 (15-Dec-2018)
Could not open /dev/disk/by-id/virtio-a7cdf508-1b77-45b2-8: Operation not permitted
) 

To clarify, the only thing done between the node working and getting the error above, was draining the client and stopping the service, then updating Nomad on the client and restarting it. The servers were already updated to 1.3.0.

Since this is a docker container task I suspect this may be a different issue but figured I'd reach out here first before opening a new issue. If this is different enough to open a new issue I'm happy to open one.

I have left that node running (it still runs non-csi jobs just fine) so if there is any info I could gather that may help let me know and i'll round it up.

@badalex
Copy link
Author

badalex commented May 27, 2022

@RickyGrassmuck that sure smells pretty similar. Can you confirm you are also using cgroups v2?

@RickyGrassmuck
Copy link
Contributor

For sure, I'll check on that when I get in the office in a few hours

@RickyGrassmuck
Copy link
Contributor

RickyGrassmuck commented May 27, 2022

Hmmm.... Interesting, that node is not using cgroups v2 it seems.

unique.cgroup
--
unique.cgroup.mountpoint | /sys/fs/cgroup/systemd
unique.cgroup.version | v1

@RickyGrassmuck
Copy link
Contributor

Ok, sorry for the noise, was actually able to stop that particular error from occurring after redeploying my CSI plugin. Having some new problems that warrant their own issue but they don't seem related to the cgroup issues anymore.

@badalex
Copy link
Author

badalex commented Jun 10, 2022

So I had a little time today and played with this a bit. on 1.3.2-dev

After a few runs, I was able to tease this out of the logs:

Jun 10 01:36:46 cd01-webdeb1 nomad[2605308]: time="2022-06-10T01:36:46Z" level=info msg="found more than one filter (2) attached to a cgroup -- removing extra filters!"
Jun 10 01:36:46 cd01-webdeb1 nomad[2605308]: time="2022-06-10T01:36:46Z" level=info msg="removing old filter 0 from cgroup" id=49060 name= run_count=0 runtime=0s tag=fb6cb1c301453333 type=CGroupDevice
Jun 10 01:36:46 cd01-webdeb1 nomad[2605308]: time="2022-06-10T01:36:46Z" level=info msg="removing old filter 1 from cgroup" id=49061 name= run_count=0 runtime=0s tag=531db05b114e9af3 type=CGroupDevice
Jun 10 01:36:46 cd01-webdeb1 nomad[2605308]: 2022-06-10T01:36:46.448Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=1a0affdc-a314-480a-d2d3-135d785524ed task=env error="failed to launch command with executor: rpc error: code = Unknown desc = failed to set v2 cgroup resources: failed to call BPF_PROG_DETACH (BPF_CGROUP_DEVICE) on old filter program: can't detach program: no such file or directory"
Jun 10 01:36:50 cd01-webdeb1 nomad[2605308]: 2022-06-10T01:36:50.481Z [WARN] client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=1a0affdc-a314-480a-d2d3-135d785524ed task=env @module=logmon timestamp=2022-06-10T01:36:50.480Z
Jun 10 01:36:50 cd01-webdeb1 nomad[2605308]: 2022-06-10T01:36:50.481Z [WARN] client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=1a0affdc-a314-480a-d2d3-135d785524ed task=env @module=logmon timestamp=2022-06-10T01:36:50.480Z

I can get that fairly consistently. Not as frequent as the OP. Which, to me, looks like a symptom of the same problem.

So on a whim I tried this patch:

diff --git a/drivers/shared/executor/executor_universal_linux.go b/drivers/shared/executor/executor_universal_linux.go
index ac79b518b..91a76d0e8 100644
--- a/drivers/shared/executor/executor_universal_linux.go
+++ b/drivers/shared/executor/executor_universal_linux.go
@@ -108,6 +108,7 @@ func (e *UniversalExecutor) configureResourceContainer(pid int) error {
                path := filepath.Join("/", cgutil.GetCgroupParent(parent), scope)
                cfg.Cgroups.Path = path
                e.containment = resources.Contain(e.logger, cfg.Cgroups)
+               cfg.Cgroups.Resources.SkipDevices = true
                return e.containment.Apply(pid)

        } else {

Which breaks tasks consistently with a similar looking error message, albeit much earlier with:

Time Type Description
2022-06-10T02:00:08Z Alloc Unhealthy Unhealthy because of failed task
2022-06-10T02:00:08Z Not Restarting Error was unrecoverable
2022-06-10T02:00:08Z Driver Failure failed to launch command with executor: rpc error: code = Unknown desc = failed to start command path="/usr/bin/bash" --- args=["/usr/bin/bash" "-c" "dd if=/dev/zero of=/dev/null count=1 (echo "busted"; ls -ltrha /; ls -ltrha /dev/zero; id; capsh --print; ulimit -a; cat /proc/self/cgroup; sleep 1000)"]: open /dev/null: operation not permitted
2022-06-10T02:00:07Z Task Setup Building Task Directory
2022-06-10T02:00:07Z Received Task received by client

Kind of all leads me back to #12875 ?

In random googling it also sounds like when launching nomad with systemd Delegate=yes might be wanted per https://systemd.io/CGROUP_DELEGATION/#some-donts. I tried with that with no avail. Although the picture in my head of systemd unit cgroups and nomad cgroupv2 vs v1 paths is very fuzzy.

This is still easily reproducible for me on a vanilla arch linux install running the nomad binary as root from the command line, and ubuntu 22.04. I'm happy to try any patches or wild ideas.

@tgross tgross modified the milestones: 1.3.0, 1.3.x Jun 17, 2022
@mr-karan
Copy link
Contributor

mr-karan commented Jun 20, 2022

@tgross I've updated to Nomad 1.3.1 but getting this error when running CSI plugin jobs with raw_exec driver. Since this issue also talks about raw_exec and exec jobs behaving differently for accessing /dev mount points, I believe it's an related issue. Let me know if I should open a separate issue.

failed to setup alloc: pre-run hook "csi_hook" failed: node plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = could not format "/dev/nvme1n1" and mount it at "/local/csi/staging/redis-aux/rw-file-system-single-node-writer": failed to get disk format of disk /dev/nvme1n1: open /dev/null: operation not permitted 

@tgross
Copy link
Member

tgross commented Jun 20, 2022

Yeah let's open a new issue for that @mr-karan. There's far more moving parts involved and we'll want to get all the CSI plugin logs and whatnot to diagnose that. (Although I'll note that CSI plugins generally aren't intended to be run without isolation, so that they work as raw_exec jobs is mostly just happy coincidence, but let's take it up in the new issue.)

@Rumbles
Copy link

Rumbles commented Aug 16, 2022

I have noticed that when I updated the machines running nomad client to cgroupv2 we start having issues with an application we raw_exec, we raw exec teleport:

task "teleport" {  
  driver = "raw_exec"  
  resources {  
    memory = 350  
    memory_max = 100  
  }  
  config {  
    command = "/opt/bin/teleport"  
    args = [  
      "start", "--config=/opt/config/teleport/teleport.yaml", "--roles=node", "--diag-addr=0.0.0.0:3000"  
    ]  
  }  
}

And when the machines are configured to run cgroupv2, teleport regularly errors after startup, with this message:

2022-08-16T09:35:19Z WARN             roles admin,default-implicit-role can not forward agent for core regular/sshserver.go:1707
2022-08-16T09:35:19Z WARN             Could not start PTY open /dev/ptmx: operation not permitted srv/term.go:153
2022-08-16T09:35:19Z ERRO             open /dev/ptmx: operation not permitted regular/sshserver.go:2026

Switching back to cgroupv1 seems to fix the issue.
Strangely I haven't seen this behaviour on the machines that run nomad server, only agent, the servers are running cgroupv2, but I haven't seen the same error on those machines.
Also strangely, this doesn't always happen, and if I restart the teleport job it might work on the machines for a few minutes before giving the same error back when trying to connect.

This looks similar to the reported issue, but if you want me to raise a new issue please let me know.

We are running Nomad 1.3.3 on flatcar linux 3227.2.1

@shoenig
Copy link
Member

shoenig commented Aug 29, 2022

I'm not 100% certain, but I suspect #14372 and #14373 should fix these issues

@mmcquillan mmcquillan removed this from the 1.3.x milestone Aug 31, 2022
@mr-karan
Copy link
Contributor

mr-karan commented Sep 6, 2022

Also strangely, this doesn't always happen, and if I restart the teleport job it might work on the machines for a few minutes before giving the same error back when trying to connect.

This is what I am experiencing as well here: #13538 (comment)

I'm running 1.3.1

dermetfan added a commit to input-output-hk/tullia that referenced this issue Jan 9, 2023
dermetfan added a commit to input-output-hk/tullia that referenced this issue Jan 9, 2023
dermetfan added a commit to input-output-hk/ci-world that referenced this issue Feb 2, 2023
dermetfan added a commit to input-output-hk/ci-world that referenced this issue Feb 2, 2023
@TrueBrain
Copy link
Contributor

TrueBrain commented Jun 14, 2023

I have absolutely no clue if any of this is useful, as up to 1 hour ago I knew nothing about how cgroups worked etc, but this problem has hit me like a truck from the left. I was nearly finished setting up the cluster I have been working on for the last few weeks .. to notice random outages when reacting to things like AWS ASG LifeCycle hooks .. which run, as you guessed it, with the exec drvier :) (although the report mentions raw_exec , I assume underlying it is the same issue).

So, the important bits for context: I have a job that runs with the exec driver; this runs on a Nomad cluster on AWS, using their latest amazon linux 2023 image. This has cgroups v2 enabled. Otherwise it is a very stock installation, very empty configuration, etc etc. For more details, see: nomad.hcl and install commands . It couldn't be more stock, I think :)

Initially, the jobs always start fine. They can do everything as you would expect. Just after a few minutes / hours (haven't measured this yet), they lose access to /dev files. This breaks everything. It can no longer execute the AWS CLI to gather information about its own instance, as /dev/urandom is no longer reachable, etc etc.

So diving into cgroups, you end up in BPF world, as that decides if the file can be access or not. Now this is where it gets weird. When just starting a job, and checking out the BPF, it looks like this (dumped with bpftool prog dump xlated id <id>):

   0: (61) r2 = *(u32 *)(r1 +0)
   1: (54) w2 &= 65535
   2: (61) r3 = *(u32 *)(r1 +0)
   3: (74) w3 >>= 16
   4: (61) r4 = *(u32 *)(r1 +4)
   5: (61) r5 = *(u32 *)(r1 +8)
   6: (55) if r2 != 0x1 goto pc+5
   7: (bc) w1 = w3
   8: (54) w1 &= 1
   9: (5d) if r1 != r3 goto pc+2
(..)
  63: (95) exit

With a lot of w0 = 1 in between 9 and 63, meaning access is granted. Now for a job that is no longer working, the BPF looks like this:

   0: (61) r2 = *(u32 *)(r1 +0)
   1: (54) w2 &= 65535
   2: (61) r3 = *(u32 *)(r1 +0)
   3: (74) w3 >>= 16
   4: (61) r4 = *(u32 *)(r1 +4)
   5: (61) r5 = *(u32 *)(r1 +8)
   6: (b4) w0 = 0
   7: (95) exit

Now I can fully understand all device access fails with a BPF like that. It literally says that (w0 = 0). But now the million dollar question: why does this BPF change after a certain amount of uptime. That I haven't discovered yet, but this is the next thing to explore.

Again, no clue if this is useful, or that it is just "how it is suppose to work". At least I now know why it returns "not permitted". Now to find out why that happens :)

Edit:
Another observation, all BPFs of jobs that no longer work are written at the same moment in time:

507: cgroup_device  tag fb6cb1c301453333
        loaded_at 2023-06-14T11:37:56+0000  uid 0
        xlated 64B  jited 128B  memlock 4096B
508: cgroup_device  tag fb6cb1c301453333
        loaded_at 2023-06-14T11:37:56+0000  uid 0
        xlated 64B  jited 128B  memlock 4096B
509: cgroup_device  tag fb6cb1c301453333
        loaded_at 2023-06-14T11:37:56+0000  uid 0
        xlated 64B  jited 128B  memlock 4096B
510: cgroup_device  tag fb6cb1c301453333
        loaded_at 2023-06-14T11:37:56+0000  uid 0
        xlated 64B  jited 128B  memlock 4096B

Some of these jobs have been running for 20 hours, others for 3 hours, and Nomad mentions no activity on the jobs. Yet they all have their BPF changed at the exact same time.

@shoenig
Copy link
Member

shoenig commented Jun 14, 2023

Thanks for the detailed update @TrueBrain - do you mind adding the exact Nomad and Linux kernel versions?

@TrueBrain
Copy link
Contributor

TrueBrain commented Jun 14, 2023

6.1.29-50.88.amzn2023.aarch64

nomad --version
Nomad v1.5.6
BuildDate 2023-05-19T18:26:13Z
Revision 8af70885c02ab921dedbdf6bc406a1e886866f80

And another discovery: when ever a job starts with the docker driver, all existing BPFs are overwritten to this one. It seems that the docker driver is using the "deny everything" BPF, and when ever it is made active on a host, all other BPFs are changed to it too. At least, this way I can reproduce this issue a few times already now.

@TrueBrain
Copy link
Contributor

TrueBrain commented Jun 14, 2023

So to reproduce:

Start a job with driver exec. This gets ID f90f2f07. I run:

bpftool cgroup show /sys/fs/cgroup/nomad.slice/f90*

It tells me ID 699 is attached.

I run bpftool prog and the last few entries are now:

692: cgroup_device  tag fb6cb1c301453333
        loaded_at 2023-06-14T15:09:34+0000  uid 0
        xlated 64B  jited 128B  memlock 4096B
694: cgroup_device  tag fb6cb1c301453333
        loaded_at 2023-06-14T15:09:34+0000  uid 0
        xlated 64B  jited 128B  memlock 4096B
695: cgroup_device  tag fb6cb1c301453333
        loaded_at 2023-06-14T15:09:34+0000  uid 0
        xlated 64B  jited 128B  memlock 4096B
696: cgroup_device  tag fb6cb1c301453333
        loaded_at 2023-06-14T15:09:34+0000  uid 0
        xlated 64B  jited 128B  memlock 4096B
699: cgroup_device  tag 531db05b114e9af3
        loaded_at 2023-06-14T15:09:38+0000  uid 0
        xlated 512B  jited 472B  memlock 4096B

Notice: 699 is 512B in xlated.

I stop a job with the docker driver, and start it up again. All BPFs are changed. bpftool cgroup show /sys/fs/cgroup/nomad.slice/f90* now tells me ID 705 is active, and the last few entries are:

706: cgroup_device  tag fb6cb1c301453333
        loaded_at 2023-06-14T15:11:33+0000  uid 0
        xlated 64B  jited 128B  memlock 4096B
707: cgroup_device  tag fb6cb1c301453333
        loaded_at 2023-06-14T15:11:33+0000  uid 0
        xlated 64B  jited 128B  memlock 4096B
708: cgroup_device  tag fb6cb1c301453333
        loaded_at 2023-06-14T15:11:33+0000  uid 0
        xlated 64B  jited 128B  memlock 4096B
709: cgroup_device  name sd_devices  tag 26dcd5a87c1d07a2  gpl
        loaded_at 2023-06-14T15:11:33+0000  uid 0
        xlated 440B  jited 432B  memlock 4096B
712: cgroup_device  tag ab4bc4523b7fe6b4
        loaded_at 2023-06-14T15:11:33+0000  uid 0
        xlated 552B  jited 504B  memlock 4096B

I am not sure what is happening, but this doesn't feel right. But again, just to repeat myself too many times: I know nothing about what I am actually looking at. Just tracing symptoms here :) But it seems that whenever a job with the docker driver is started, all existing BPFs are overwritten with the one docker is using, instead of leaving them alone. Explains why they are all loaded at the same time, and why it feels semi-random when it happens .. if you don't touch your cluster, it will work fine! Well, until a docker crashes and restarts :)

Edit: this also holds true if you restart another job that uses the exec driver. Doesn't seem it is related to the docker driver itself; just any stop/start changes all previous BPFs to a "deny everything" variant.

@shoenig
Copy link
Member

shoenig commented Jun 14, 2023

Oh wow, a reproduction is a huge step forward @TrueBrain, thanks again.

No doubt docker is managing the device controller interface for the shared parent cgroup even though we don't want it to. The device controller is (currently) beyond my skill set, but perhaps a dumb solution would be to just move the docker cgroups somewhere else.

https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#device-controller

@TrueBrain
Copy link
Contributor

Haha, tnx @shoenig ; the question here is only: can anyone else reproduce it :D I found even simpler steps:

Start job A with exec driver.
Start job B with exec driver.
Stop job A.

Job B can no longer access any files in /dev.

@TrueBrain
Copy link
Contributor

TrueBrain commented Jun 14, 2023

Okay, to finalize steps to reproduce, here some very simple steps (sorry, I am always really happy when you can condense an issue to a few simple steps :D):

  • Start nomad in dev-mode on any machine with cgroupv2 (tried on any machine I could get my hands on, from kernel 5.15 up to 6.2, with any distro; so yeah: any :P)
  • Load in the HCL as below
  • Wait for both jobs to become active
  • Run nomad alloc exec <alloc> dd if=/dev/zero of=/dev/null count=1 on both jobs; it is likely they both work fine.
  • Run nomad alloc stop <alloc> on one of the two
  • Run nomad alloc exec <alloc> dd if=/dev/zero of=/dev/null count=1 on the one you did not stop. Observe you get permission errors.
  • Run nomad alloc exec <alloc> dd if=/dev/zero of=/dev/null count=1 on the new alloc; observe it does work
  • And you can repeat this .. only the last allocation will work, the other will fail

Now that should make debugging significantly easier, as it breaks on a developer machine too :D As for the solution .. I wouldn't even know where to start looking :(

nomad.hcl
job "test" {
  type = "service"

  group "test1" {
    count = 1

    task "test1" {
      driver = "raw_exec"

      config {
        command = "sleep"
        args = [
          "100d",
        ]
      }
    }
  }

  group "test2" {
    count = 1

    task "test2" {
      driver = "raw_exec"

      config {
        command = "sleep"
        args = [
          "100d",
        ]
      }
    }
  }
}

PS: I did notice it only breaks when a new allocation is made; not when a process is restarted within its own allocation.

@TrueBrain
Copy link
Contributor

TrueBrain commented Jun 14, 2023

Some poking in the code later (I never touched a Go project .. that was interesting; but I have to say, getting up and running is very smooth), the issue seems to be caused by the cpuset manager.

From what I can tell, every time an allocation changes, the cgroup is written to for all allocations:

func (c *cpusetManagerV2) reconcile() {
for id := range c.sharing {
c.write(id, c.pool)
}
for id, set := range c.isolating {
c.write(id, c.pool.Union(set))
}
}

And very specifically:

if err = m.Set(&configs.Resources{
CpusetCpus: set.String(),
}); err != nil {
c.logger.Error("failed to set cgroup", "path", path, "error", err)
return
}

Which ends up, I think, here:

https://github.com/opencontainers/runc/blob/5cf9bb229feed19a767cbfdf9702f6487341e29e/libcontainer/cgroups/fs2/fs2.go#L150-L204

This seems to suggest this removes all device groups when that function is called without. But it also suggests it removes all memory restrictions etc, and I am pretty sure that is not true; in other words: I am pretty sure I am not actually following what happens, but I just hope if I talk long enough someone goes: owh, shit, this is broken, and fixes the issue :)

Sadly, still no clue how to work around this, as that is kinda the only thing I am looking for :D

Edit: ah, okay, CPU etc aren't changed if all values are zero. Seems that isn't the case for devices .. delving deeper into the rabbit hole :)

And finally found the answer here:

https://github.com/opencontainers/runc/blob/5cf9bb229feed19a767cbfdf9702f6487341e29e/libcontainer/cgroups/devices/v2.go#L55-L57

I will make a Pull Request in a bit.

@TrueBrain
Copy link
Contributor

TrueBrain commented Jun 14, 2023

Right, created a Pull Request with a change that fixes it for me at least. So there is that.

Sadly, I haven't found any way to work around this problem, short of disabling cpuset from cgroup. But for this I also haven't found a good persistent way, except for changing the kernel cmdline. So I guess one can hope this PR actually fixes the problem, and travels to a release real soon? :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

8 participants