Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to set the cpuset cgroup for container on WSL2 #10709

Closed
TGNThump opened this issue Jun 6, 2021 · 18 comments
Closed

Failed to set the cpuset cgroup for container on WSL2 #10709

TGNThump opened this issue Jun 6, 2021 · 18 comments
Assignees

Comments

@TGNThump
Copy link

TGNThump commented Jun 6, 2021

Nomad version

Nomad v1.1.0 (2678c3604bc9530014208bc167415e167fd440fc)

Operating system and Environment details

Nomad running on Ubuntu 20.04.2 LTS on WSL2 4.19.84-microsoft-standard on Windows 10 20H2 (OS Build 19042.985)
Docker Desktop v3.3.3, Docker Engine v20.10.6

Issue

failed to set the cpuset cgroup for container: failed to write 3839 to cgroup.procs: write /sys/fs/cgroup/cpuset/nomad/shared/cgroup.procs: no such process

Reproduction steps

Run docker desktop with wsl2 integration enabled in linux containers mode.
Running nomad job run -check-index 0 kratos.jobspec.hcl with nomad running from wsl2.

Expected Result

The job should work.

Actual Result

See errors: failed to set the cpuset cgroup for container: failed to write 3839 to cgroup.procs: write /sys/fs/cgroup/cpuset/nomad/shared/cgroup.procs: no such process

nomad.log
kratos.jobspec.hcl

@TGNThump TGNThump changed the title failed to set the cpuset cgroup for container on WSL2 Failed to set the cpuset cgroup for container on WSL2 Jun 6, 2021
@tgross
Copy link
Member

tgross commented Jun 7, 2021

Hi @TGNThump. Can you give a little more context as to where the Nomad client is actually running here? You're saying you've got Docker Desktop on one hand but then WSL2 on the other. Also, are you running the Nomad client as root? That's a hard requirement.

Just a heads up that in general, the WSL2 environment is not well-supported: #2633. You're going to find that the WSL2 kernel is "weird". Note that it looks like you can't detect bridge networking in this environment either. From your logs:

    2021-06-06T19:52:09.824+0100 [DEBUG] client.fingerprint_mgr: built-in fingerprints: fingerprinters=[arch, bridge, cgroup, cni, consul, cpu, host, memory, network, nomad, signal, storage, vault, env_aws, env_gce, env_azure]
    2021-06-06T19:52:09.824+0100 [WARN]  client.fingerprint_mgr: failed to detect bridge kernel module, bridge network mode disabled: error="3 errors occurred:
	* module bridge not in /proc/modules
	* failed to open /lib/modules/4.19.84-microsoft-standard/modules.builtin: open /lib/modules/4.19.84-microsoft-standard/modules.builtin: no such file or directory
	* failed to open /lib/modules/4.19.84-microsoft-standard/modules.dep: open /lib/modules/4.19.84-microsoft-standard/modules.dep: no such file or directory
	* ```

@tgross tgross self-assigned this Jun 7, 2021
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jun 7, 2021
@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jun 7, 2021
@TGNThump
Copy link
Author

TGNThump commented Jun 7, 2021

Hey @tgross, Nomad is running in --dev mode as root inside the ubuntu WSL2 instance.

@tgross
Copy link
Member

tgross commented Jun 7, 2021

Thanks @TGNThump.

I traced this error message down to runc, specifically either here or, more likely given the error message, libcontainer/cgroups/utils.go#L398 where we're trying to write the cgroup for the PID in the container. Based on the surrounding code, it looks like the process is not leaving TASK_NEW kernel state to become TASK_RUNNING (see sched.h#L81-L98 in the kernel code) and runc has tried 5 times with a 30ms delay in between. Ordinarily I would say this is an unrealistic scenario but I don't know what the process spawn code path looks like in WSL2.

Unfortunately it looks like we'll need to open a bug with upstream here. We're on version rc93 of runc, but it looks like from that files recent commits that there's been nothing helpful to us here since then either. We'll start with runc but it's possible they'll in turn kick us up to the WSL2 project.

Before we go down that path, does this happen with any other container image?

@tgross tgross added the theme/dependencies Pull requests that update a dependency file label Jun 7, 2021
@tgross
Copy link
Member

tgross commented Jun 7, 2021

My colleague @notnoop pointed out to me that we'd see the exact same error (which originates from driver.go#L353 if the process has died before we get a chance to move it into the cpuset. So the other thing we'd want to figure out here is if that Postgres container works at all in your environment; if it immediately crashes it could happen quickly enough that we'd see this error and that would be misleading.

@TGNThump
Copy link
Author

TGNThump commented Jun 7, 2021

Hmmm. Interestingly, the postgres container and the portainer container which I also tried are left running in docker after the error in nomad.
image
image
image

@TGNThump
Copy link
Author

TGNThump commented Jun 7, 2021

I also tried to run ubuntu:latest with nomad. When I run it with docker run --rm -it ubuntu:latest, it works fine. When running with nomad on wsl, the container dies as soon as it's created and the job fails with ` client.driver_mgr.docker.docker_logger.stdio: received EOF, stopping recv loop: driver=docker err="rpc error: code = Unavailable desc = transport is closing".

ubuntu-nomad.log

Not sure if this is down to the same root cause, or something different.

@zyclonite
Copy link

i am having the same issue on fedora coreos but i am running nomad in a podman container with access to the docker socket as well as mounted required volumes and access to cgroups

the error is the same, so the pid cannot be bound to a cpu set but the docker process is and keeps running - nomad just does loose the control about the process after the error and reports it as failed

my assumption is, that coreos runs the docker containers in a different cgroup namespace and so this PID cannot be assigned
i was trying to change the cgroup_parent but this did not work at all, nomad was still using /nomad

side note: it would be great to disable cpu sets with a config flag

@zyclonite
Copy link

did a quick test run

the error is
2021-06-08T07:44:15.044Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=c16a2df1-c2dd-232c-63f8-9c880c521758 task=rng error="failed to set the cpuset cgroup for container: failed to write 3546 to cgroup.procs: write /sys/fs/cgroup/cpuset/nomad/shared/cgroup.procs: no such process"

but the pid exists here
/sys/fs/cgroup/cpuset/system.slice/docker-5ed4f9b77190c267af5319b00da0d0fe3102cd01db90ed25e454d5ec0711d6df.scope/cgroup.procs

@tgross
Copy link
Member

tgross commented Jun 8, 2021

i am having the same issue on fedora coreos but i am running nomad in a podman container with access to the docker socket as well as mounted required volumes and access to cgroups

Hi @zyclonite, we strongly discourage anyone from running Nomad clients in containers and it's not at all a documented, tested, and supported configuration.

@tgross
Copy link
Member

tgross commented Jun 8, 2021

When running with nomad on wsl, the container dies as soon as it's created and the job fails with ` client.driver_mgr.docker.docker_logger.stdio: received EOF, stopping recv loop: driver=docker err="rpc error: code = Unavailable desc = transport is closing".

Not much in the Nomad logs there, unfortunately. Does the Docker event log have anything in this situation? It might have more error information we could use to debug.

@tgross
Copy link
Member

tgross commented Jun 8, 2021

Interestingly, the postgres container and the portainer container which I also tried are left running in docker after the error in nomad.

I think I understand why that is, and that's a clear bug: when we hit an error condition on this path we should be stopping the container, just as we do when the logging setup fails at driver.go#L367. Still working on why the error happens in the first place.

@zyclonite
Copy link

zyclonite commented Jun 8, 2021

i am having the same issue on fedora coreos but i am running nomad in a podman container with access to the docker socket as well as mounted required volumes and access to cgroups

Hi @zyclonite, we strongly discourage anyone from running Nomad clients in containers and it's not at all a documented, tested, and supported configuration.

@tgross
understood, i did run it on the node directly and it worked

but the option to disable cpuset usage would still be nice ;)

@tgross
Copy link
Member

tgross commented Jun 16, 2021

@TGNThump would you be willing to test this out again on Nomad 1.1.1? We landed #10416 which will at least eliminate the "early exit" case by not getting us confused by issues with cpuset.

@TGNThump
Copy link
Author

Sorry, I've been on holiday. I'll give this a try tomorrow.

@TGNThump
Copy link
Author

@tgross v1.1.1 does indeed fix the cpuset issue.

@TGNThump
Copy link
Author

I also tried to run ubuntu:latest with nomad. When I run it with docker run --rm -it ubuntu:latest, it works fine. When running with nomad on wsl, the container dies as soon as it's created and the job fails with ` client.driver_mgr.docker.docker_logger.stdio: received EOF, stopping recv loop: driver=docker err="rpc error: code = Unavailable desc = transport is closing".

ubuntu-nomad.log

Not sure if this is down to the same root cause, or something different.

Happy to send over the docker desktop diagnostics zip if that's helpful. I don't really know what I'm looking for. Also happy to create a new issue, given the cpusets issue has been resolved.

@tgross
Copy link
Member

tgross commented Jun 17, 2021

Let's create a new issue for that. Thanks @TGNThump!

@tgross tgross closed this as completed Jun 17, 2021
Nomad - Community Issues Triage automation moved this from In Progress to Done Jun 17, 2021
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants