Failed to set the cpuset cgroup for container on WSL2 #10709

TGNThump · 2021-06-06T18:58:32Z

Nomad version

Nomad v1.1.0 (2678c3604bc9530014208bc167415e167fd440fc)

Operating system and Environment details

Nomad running on Ubuntu 20.04.2 LTS on WSL2 4.19.84-microsoft-standard on Windows 10 20H2 (OS Build 19042.985)
Docker Desktop v3.3.3, Docker Engine v20.10.6

Issue

failed to set the cpuset cgroup for container: failed to write 3839 to cgroup.procs: write /sys/fs/cgroup/cpuset/nomad/shared/cgroup.procs: no such process

Reproduction steps

Run docker desktop with wsl2 integration enabled in linux containers mode.
Running nomad job run -check-index 0 kratos.jobspec.hcl with nomad running from wsl2.

Expected Result

The job should work.

Actual Result

See errors: failed to set the cpuset cgroup for container: failed to write 3839 to cgroup.procs: write /sys/fs/cgroup/cpuset/nomad/shared/cgroup.procs: no such process

nomad.log
kratos.jobspec.hcl

The text was updated successfully, but these errors were encountered:

tgross · 2021-06-07T13:01:06Z

Hi @TGNThump. Can you give a little more context as to where the Nomad client is actually running here? You're saying you've got Docker Desktop on one hand but then WSL2 on the other. Also, are you running the Nomad client as root? That's a hard requirement.

Just a heads up that in general, the WSL2 environment is not well-supported: #2633. You're going to find that the WSL2 kernel is "weird". Note that it looks like you can't detect bridge networking in this environment either. From your logs:

    2021-06-06T19:52:09.824+0100 [DEBUG] client.fingerprint_mgr: built-in fingerprints: fingerprinters=[arch, bridge, cgroup, cni, consul, cpu, host, memory, network, nomad, signal, storage, vault, env_aws, env_gce, env_azure]
    2021-06-06T19:52:09.824+0100 [WARN]  client.fingerprint_mgr: failed to detect bridge kernel module, bridge network mode disabled: error="3 errors occurred:
	* module bridge not in /proc/modules
	* failed to open /lib/modules/4.19.84-microsoft-standard/modules.builtin: open /lib/modules/4.19.84-microsoft-standard/modules.builtin: no such file or directory
	* failed to open /lib/modules/4.19.84-microsoft-standard/modules.dep: open /lib/modules/4.19.84-microsoft-standard/modules.dep: no such file or directory
	* ```

TGNThump · 2021-06-07T14:34:00Z

Hey @tgross, Nomad is running in --dev mode as root inside the ubuntu WSL2 instance.

tgross · 2021-06-07T15:11:36Z

Thanks @TGNThump.

I traced this error message down to runc, specifically either here or, more likely given the error message, libcontainer/cgroups/utils.go#L398 where we're trying to write the cgroup for the PID in the container. Based on the surrounding code, it looks like the process is not leaving TASK_NEW kernel state to become TASK_RUNNING (see sched.h#L81-L98 in the kernel code) and runc has tried 5 times with a 30ms delay in between. Ordinarily I would say this is an unrealistic scenario but I don't know what the process spawn code path looks like in WSL2.

Unfortunately it looks like we'll need to open a bug with upstream here. We're on version rc93 of runc, but it looks like from that files recent commits that there's been nothing helpful to us here since then either. We'll start with runc but it's possible they'll in turn kick us up to the WSL2 project.

Before we go down that path, does this happen with any other container image?

tgross · 2021-06-07T15:32:43Z

My colleague @notnoop pointed out to me that we'd see the exact same error (which originates from driver.go#L353 if the process has died before we get a chance to move it into the cpuset. So the other thing we'd want to figure out here is if that Postgres container works at all in your environment; if it immediately crashes it could happen quickly enough that we'd see this error and that would be misleading.

TGNThump · 2021-06-07T16:56:39Z

Hmmm. Interestingly, the postgres container and the portainer container which I also tried are left running in docker after the error in nomad.

TGNThump · 2021-06-07T17:23:08Z

I also tried to run ubuntu:latest with nomad. When I run it with docker run --rm -it ubuntu:latest, it works fine. When running with nomad on wsl, the container dies as soon as it's created and the job fails with ` client.driver_mgr.docker.docker_logger.stdio: received EOF, stopping recv loop: driver=docker err="rpc error: code = Unavailable desc = transport is closing".

ubuntu-nomad.log

Not sure if this is down to the same root cause, or something different.

zyclonite · 2021-06-08T07:36:35Z

i am having the same issue on fedora coreos but i am running nomad in a podman container with access to the docker socket as well as mounted required volumes and access to cgroups

the error is the same, so the pid cannot be bound to a cpu set but the docker process is and keeps running - nomad just does loose the control about the process after the error and reports it as failed

my assumption is, that coreos runs the docker containers in a different cgroup namespace and so this PID cannot be assigned
i was trying to change the cgroup_parent but this did not work at all, nomad was still using /nomad

side note: it would be great to disable cpu sets with a config flag

zyclonite · 2021-06-08T07:49:11Z

did a quick test run

the error is
2021-06-08T07:44:15.044Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=c16a2df1-c2dd-232c-63f8-9c880c521758 task=rng error="failed to set the cpuset cgroup for container: failed to write 3546 to cgroup.procs: write /sys/fs/cgroup/cpuset/nomad/shared/cgroup.procs: no such process"

but the pid exists here
/sys/fs/cgroup/cpuset/system.slice/docker-5ed4f9b77190c267af5319b00da0d0fe3102cd01db90ed25e454d5ec0711d6df.scope/cgroup.procs

tgross · 2021-06-08T13:13:53Z

i am having the same issue on fedora coreos but i am running nomad in a podman container with access to the docker socket as well as mounted required volumes and access to cgroups

Hi @zyclonite, we strongly discourage anyone from running Nomad clients in containers and it's not at all a documented, tested, and supported configuration.

tgross · 2021-06-08T13:16:20Z

When running with nomad on wsl, the container dies as soon as it's created and the job fails with ` client.driver_mgr.docker.docker_logger.stdio: received EOF, stopping recv loop: driver=docker err="rpc error: code = Unavailable desc = transport is closing".

Not much in the Nomad logs there, unfortunately. Does the Docker event log have anything in this situation? It might have more error information we could use to debug.

tgross · 2021-06-08T13:54:10Z

Interestingly, the postgres container and the portainer container which I also tried are left running in docker after the error in nomad.

I think I understand why that is, and that's a clear bug: when we hit an error condition on this path we should be stopping the container, just as we do when the logging setup fails at driver.go#L367. Still working on why the error happens in the first place.

zyclonite · 2021-06-08T15:39:03Z

i am having the same issue on fedora coreos but i am running nomad in a podman container with access to the docker socket as well as mounted required volumes and access to cgroups

Hi @zyclonite, we strongly discourage anyone from running Nomad clients in containers and it's not at all a documented, tested, and supported configuration.

@tgross
understood, i did run it on the node directly and it worked

but the option to disable cpuset usage would still be nice ;)

tgross · 2021-06-16T20:20:18Z

@TGNThump would you be willing to test this out again on Nomad 1.1.1? We landed #10416 which will at least eliminate the "early exit" case by not getting us confused by issues with cpuset.

TGNThump · 2021-06-16T20:21:29Z

Sorry, I've been on holiday. I'll give this a try tomorrow.

TGNThump · 2021-06-17T09:27:54Z

@tgross v1.1.1 does indeed fix the cpuset issue.

TGNThump · 2021-06-17T09:28:46Z

I also tried to run ubuntu:latest with nomad. When I run it with docker run --rm -it ubuntu:latest, it works fine. When running with nomad on wsl, the container dies as soon as it's created and the job fails with ` client.driver_mgr.docker.docker_logger.stdio: received EOF, stopping recv loop: driver=docker err="rpc error: code = Unavailable desc = transport is closing".

ubuntu-nomad.log

Not sure if this is down to the same root cause, or something different.

Happy to send over the docker desktop diagnostics zip if that's helpful. I don't really know what I'm looking for. Also happy to create a new issue, given the cpusets issue has been resolved.

tgross · 2021-06-17T13:52:18Z

Let's create a new issue for that. Thanks @TGNThump!

github-actions · 2022-10-18T02:44:52Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

TGNThump added the type/bug label Jun 6, 2021

TGNThump changed the title ~~failed to set the cpuset cgroup for container on WSL2~~ Failed to set the cpuset cgroup for container on WSL2 Jun 6, 2021

tgross self-assigned this Jun 7, 2021

tgross added theme/platform-windows stage/waiting-reply labels Jun 7, 2021

tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jun 7, 2021

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jun 7, 2021

tgross added the theme/dependencies Pull requests that update a dependency file label Jun 7, 2021

tgross removed the stage/waiting-reply label Jun 8, 2021

tgross added the stage/waiting-reply label Jun 16, 2021

tgross closed this as completed Jun 17, 2021

Nomad - Community Issues Triage automation moved this from In Progress to Done Jun 17, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to set the cpuset cgroup for container on WSL2 #10709

Failed to set the cpuset cgroup for container on WSL2 #10709

TGNThump commented Jun 6, 2021

tgross commented Jun 7, 2021

TGNThump commented Jun 7, 2021

tgross commented Jun 7, 2021 •

edited

Loading

tgross commented Jun 7, 2021

TGNThump commented Jun 7, 2021 •

edited

Loading

TGNThump commented Jun 7, 2021 •

edited

Loading

zyclonite commented Jun 8, 2021

zyclonite commented Jun 8, 2021

tgross commented Jun 8, 2021

tgross commented Jun 8, 2021

tgross commented Jun 8, 2021

zyclonite commented Jun 8, 2021 •

edited

Loading

tgross commented Jun 16, 2021

TGNThump commented Jun 16, 2021

TGNThump commented Jun 17, 2021

TGNThump commented Jun 17, 2021

tgross commented Jun 17, 2021

github-actions bot commented Oct 18, 2022

Failed to set the cpuset cgroup for container on WSL2 #10709

Failed to set the cpuset cgroup for container on WSL2 #10709

Comments

TGNThump commented Jun 6, 2021

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

tgross commented Jun 7, 2021

TGNThump commented Jun 7, 2021

tgross commented Jun 7, 2021 • edited Loading

tgross commented Jun 7, 2021

TGNThump commented Jun 7, 2021 • edited Loading

TGNThump commented Jun 7, 2021 • edited Loading

zyclonite commented Jun 8, 2021

zyclonite commented Jun 8, 2021

tgross commented Jun 8, 2021

tgross commented Jun 8, 2021

tgross commented Jun 8, 2021

zyclonite commented Jun 8, 2021 • edited Loading

tgross commented Jun 16, 2021

TGNThump commented Jun 16, 2021

TGNThump commented Jun 17, 2021

TGNThump commented Jun 17, 2021

tgross commented Jun 17, 2021

github-actions bot commented Oct 18, 2022

tgross commented Jun 7, 2021 •

edited

Loading

TGNThump commented Jun 7, 2021 •

edited

Loading

TGNThump commented Jun 7, 2021 •

edited

Loading

zyclonite commented Jun 8, 2021 •

edited

Loading