Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to CGroupv2 #78

Closed
seirl opened this issue May 12, 2019 · 27 comments
Closed

Migrate to CGroupv2 #78

seirl opened this issue May 12, 2019 · 27 comments

Comments

@seirl
Copy link
Contributor

seirl commented May 12, 2019

When running a container with systemd-nspawn, systemd remounts /sys/fs/cgroup in read-only. This prevents isolate from creating its own cgroup inside /sys.

Apparently, this is intended, as isolate shouldn't create its own cgroup in the root, but do it in a subgroup of the one provided by systemd: https://lists.freedesktop.org/archives/systemd-devel/2017-November/039736.html

I'm completely unfamiliar with the cgroup/Delegate API of systemd, so I'm not sure what a proper fix should look like. I'll try to investigate, but if anyone already knows what a good fix would be, don't hesitate to tell me :-P

@seirl
Copy link
Contributor Author

seirl commented May 12, 2019

@seirl
Copy link
Contributor Author

seirl commented May 14, 2019

Apparently this is a WONTFIX and can only be resolved by moving to cgroupsv2: https://lists.freedesktop.org/archives/systemd-devel/2019-May/042558.html

@seirl seirl changed the title Cgroups in systemd-nspawn containers: /sys/fs/cgroup is mounted as read-only Migrate to CGroupv2 Sep 28, 2019
@edomora97
Copy link

I don't know where this issue fits in the roadmap of isolate, but I think it should be prioritized a bit. In fact Arch Linux disabled by default cgroups v1 and therefore isolate stopped working. They can still be enabled manually by adding systemd.unified_cgroup_hierarchy=0 to the kernel parameters (link), but that's not a long-term solution.

May you update us on it? Thanks!

@gollux
Copy link
Member

gollux commented Nov 5, 2021

This is currently the top item on my TODO list.

@gollux
Copy link
Member

gollux commented Jan 25, 2022

Rudimentary implementation of the move to cgroup v2 is in the cg2 branch.

First of all, I had to solve integration with systemd. It can delegate some types of cgroups to other managers, but apparently the only way how to make the cgroup persistent is to keep a process in it. I therefore wrote a simple daemon called isolate-cg-keeper, which sets up the cgroup and then sleeps forever. See systemd/isolate.service and systemd/isolate.scope for the relevant systemd configuration.

Isolate's config file now contains the path to a master cgroup, under which cgroups for individual sandboxes will be created. If you are using systemd, this is the cgroup maintained by isolate.service. With other service managers, you have to create it yourself and configure isolate to use it.

The good news is that the switch to cgroup v2 simplified isolate a lot.

The bad news is that I failed to find a way how to measure maximum memory usage: there is nothing like memory.max_usage_in_bytes in cgroup v2.

The code is still almost untested and it has a plenty of rough edges and close to no documentation. However, if you want to get your feet wet, I will be glad for any feedback.

@magula
Copy link

magula commented Aug 20, 2022

The bad news is that I failed to find a way how to measure maximum memory usage: there is nothing like memory.max_usage_in_bytes in cgroup v2.

There is now memory.peak, which seems to correspond to memory.max_usage_in_bytes from v1. It was introduced with this commit.

There's no memory.swap.peak to replace memory.memsw.max_usage_in_bytes, though, so I guess there's still no way of correctly measuring memory usage with swap enabled (EDIT, to clarify: apart from the legacy memory.memsw.max_usage_in_bytes accounting).

@wil93
Copy link
Member

wil93 commented Dec 18, 2022

I'm testing the cg2 branch of isolate on Ubuntu 22.04. This version of Ubuntu only has cgroups v2 available, see:

image

The CMS test suite fails with this new version of isolate, it seems that the --init step fails. Do you know if I'm doing something wrong? @gollux

This is the work-in-progress PR over at the CMS repo: cms-dev/cms#1222

2022-12-18 15:46:29,255 - ERROR [Worker,2 88 Worker::execute_job_group] Worker failed: Failed to initialize sandbox.
Traceback (most recent call last):
  File "/home/cmsuser/cms/cms/grading/Sandbox.py", line 1416, in initialize_isolate
    subprocess.check_call(init_cmd)
  File "/usr/local/lib/python3.8/dist-packages/gevent/subprocess.py", line 316, in check_call
    raise CalledProcessError(retcode, cmd) # pylint:disable=undefined-variable
subprocess.CalledProcessError: Command '['isolate', '--cg', '--box-id=31', '--init']' returned non-zero exit status 2.

@wil93
Copy link
Member

wil93 commented Dec 18, 2022

Ok I realized I have to install the systemd configuration files 😅 I'm now getting some more interesting errors.

When I try to start the isolate.service I see this:

Dec 18 16:49:22 f8df5033d1b0 systemd[1]: Started A trivial daemon to keep Isolate's control group hierarchy.
Dec 18 16:49:22 f8df5033d1b0 systemd[1]: isolate.service: Main process exited, code=exited, status=1/FAILURE
Dec 18 16:49:22 f8df5033d1b0 isolate-cg-keeper[3780]: Cannot create subgroup /sys/fs/cgroup/isolate.slice/isolate.service/daemon: No such file or directory
Dec 18 16:49:22 f8df5033d1b0 systemd[1]: isolate.service: Failed with result 'exit-code'.

But if I check the status of isolate.slice I see that it was created in a different folder, seemingly related to Docker:

# systemctl status isolate.slice
* isolate.slice - Slice for Isolate's sandboxes
     Loaded: loaded (/etc/systemd/system/isolate.slice; static)
     Active: active since Sun 2022-12-18 16:49:22 UTC; 7s ago
      Tasks: 0
     Memory: 0B
     CGroup: /docker/f8df5033d1b02ee218e750be331e5ebe073c46d37f9a63ca5cf78d1c96c56f5f/isolate.slice

Dec 18 16:49:22 f8df5033d1b0 systemd[1]: Created slice Slice for Isolate's sandboxes.
Dec 18 16:49:22 f8df5033d1b0 isolate-cg-keeper[3780]: Cannot create subgroup /sys/fs/cgroup/isolate.slice/isolate.service/daemon: No such file or directory

I can find the folder under /sys/fs/cgroup/systemd, the full path is: /sys/fs/cgroup/systemd/docker/f8df5033d1b02ee218e750be331e5ebe073c46d37f9a63ca5cf78d1c96c56f5f/isolate.slice/

Maybe this would work without docker?

@gollux
Copy link
Member

gollux commented Dec 28, 2022

Could you please try it without Docker first?

@gollux
Copy link
Member

gollux commented Feb 24, 2023

I consider the cgroup v2 code almost ready now.

Among other things, the name of the cgroup is no longer hard-coded in the configuration file. Instead, isolate-cg-keeper finds out in which cgroup it is started, and it passed the name to isolate via /run/isolate/cgroup. Beside simplifying configuration, this should also help with running Isolate in containers.

Also, I implemented proper locking of sandboxes, so different users cannot stomp on each other's sandboxes. It also prevents --run --cg if the sandbox was not initialized with --cg and vice versa.

There is some support for having a system-wide daemon which manages access to sandboxes. The daemon itself is not ready yet, but some rudiments can be found in the daemon branch.

I removed the --cg-timing option. We use CG-based timing whenever --cg is active. (This was the default behavior anyway, so I expect nobody was really using the option.)

@gollux
Copy link
Member

gollux commented Feb 24, 2023

In the daemon branch, you find my first attempt to create a daemon for managing sandboxes. Local users can connect to the daemon via a UNIX socket and they are given fresh sandboxes for use. This allows isolate to be used by multiple programs running in parallel, possibly belonging to different system users.

You will find a sketch of documentation at the top of daemon.py. Run the daemon as root.

I will be glad for any feedback.

@BhautikChudasama
Copy link

Hi @gollux, Does it compatible with CGroup v2?

@gollux
Copy link
Member

gollux commented Jul 30, 2023

The version in the cg2 branch supports only CGroup v2, the version in master supports only v1.

I plan to deprecate v1 and merge cg2 into master.

@BhautikChudasama
Copy link

Thanks @gollux for confirmation.

@Bhautik0110
Copy link

Hi @gollux ser,
I am running your cg2 branch with judge0. I removed the older isolate and added isolate v2. Below i have attached the log. Have you any idea why this is happening?

isolate --cg -s -b 32 -M /var/local/lib/isolate/32/metadata.txt --stderr-to-stdout -i /dev/null -t 15.0 -x 0 -w 20.0 -k 128000 -p120 --cg-timing --cg-mem=512000 -f 4096 -E HOME=/tmp -E PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" -E LANG -E LANGUAGE -E LC_ALL -E JUDGE0_HOMEPAGE -E JUDGE0_SOURCE_CODE -E JUDGE0_MAINTAINER -E JUDGE0_VERSION -d /etc:noexec --run -- /bin/bash compile > /var/local/lib/isolate/32/compile_output.txt 
Cannot write /sys/fs/cgroup/memory/box-32/tasks: No such file or directory

@maxkt
Copy link

maxkt commented Oct 27, 2023

Hi @gollux!

First of all, thanks for the great library that has been very useful for us in implementing infrastructure for sandboxed live coding environments.

At the moment we're trying to make a decision whether to migrate to cgroup v2 or no. Do you think the v2 is ready for production now?

Thanks.

@gollux
Copy link
Member

gollux commented Oct 27, 2023

I'm already running it in production and I plan to release it soon. The only missing thing is a bit of documentation.

@jwd-dev
Copy link

jwd-dev commented Nov 10, 2023

Any update on this?

@bajcmartinez
Copy link

bajcmartinez commented Nov 22, 2023

hi, anyone was able to run this under docker?, I realize I need to start the service, just not sure how to do that.

Here is the simple command I'm trying to run:

# isolate --run --cg python
Cannot open /run/isolate/cgroup: No such file or directory

and without --cg it won't work unless I'm running docker with privilege

# isolate --run python
Cannot run proxy, clone failed: Operation not permitted

Also running the check I get the following:

# isolate-check-environment
Checking for cgroup support for memory ... CAUTION
WARNING: the memory is not present. isolate --cg cannot be used.
Checking for cgroup support for cpuacct ... CAUTION
WARNING: the cpuacct is not present. isolate --cg cannot be used.
Checking for cgroup support for cpuset ... CAUTION
WARNING: the cpuset is not present. isolate --cg cannot be used.
Checking for swap ... FAIL
WARNING: swap is enabled, but swap accounting is not. isolate will not be able to enforce memory limits.
swapoff -a
Checking for CPU frequency scaling ... SKIPPED (not detected)
Checking for Intel frequency boost ... SKIPPED (not detected)
Checking for general frequency boost ... SKIPPED (not detected)
Checking for kernel address space randomisation ... FAIL
WARNING: address space randomisation is enabled.
echo 0 > /proc/sys/kernel/randomize_va_space
Checking for transparent hugepage support ... FAIL
WARNING: transparent hugepages are enabled.
echo never > /sys/kernel/mm/transparent_hugepage/enabled
WARNING: transparent hugepage defrag is enabled.
echo never > /sys/kernel/mm/transparent_hugepage/defrag
WARNING: khugepaged defrag is enabled.
echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag

Are those fail and warnings normal? Seems like I'm doing something wrong. I'm using the branch cg2.


UPDATE:

If I manually run isolate-cg-keeper, this is what happens:

/usr/local/sbin/isolate-cg-keeper
Cannot create subgroup /sys/fs/cgroup//daemon: Read-only file system
# isolate --cg --run python
Control group root  does not exist

Thanks!

@gollux
Copy link
Member

gollux commented Nov 22, 2023

isolate-check-environment wasn't updated for the cg2 branch yet. I hope to do it soon -- besides some documentation, it's the only roadblock on the way to merging cg2.

You probably need a privileged container (I'm not sure as I don't use Docker myself).

You certainly need systemctl start isolate.service.

@bajcmartinez
Copy link

@gollux , thanks for the quick response. I'm ignoring the check for now, but --cg won't even work with a privileged container.

Running with privilege I can run successfully the following command:

# isolate --run -- /usr/local/bin/python

however,

# isolate --cg --run -- /usr/local/bin/python

fails with

Control group root      _
 does not exist

I think it's related to the service, can't run the service on docker, and running the keeper manually throws:

/usr/local/sbin/isolate-cg-keeper
Cannot write to /sys/fs/cgroup//cgroup.subtree_control: Device or resource busy

not sure what that means. I'll keep experimenting and will write an update if I find something useful.

Thanks

@gollux
Copy link
Member

gollux commented Nov 22, 2023

You need to have systemd running inside the container.

@bajcmartinez
Copy link

Thanks, that won't work in my environment, I thought there may be a way around it, perhaps there still is, gotta investigate more. I'm trying to run an app that would run untrusted user code in AWS, and I thought I could spin it up as a microservice in fargate, but I don't have much control over how docker spins up, though technically they do support cgroups v2, just can't run the keeper as a service.

Worst case I can deploy it to a virtual machine, but that's painful to maintain for a single man operation hehe.

Is there a reason why you set up a new process with the keeper and not directly as part of the isolate one?

@gollux
Copy link
Member

gollux commented Nov 22, 2023

Is there a reason why you set up a new process with the keeper and not directly as part of the isolate one?

Isolate needs its own subtree in the cgroup hierarchy. On systems with systemd, we can ask systemd to delegate such subtree to a service (and there must be a process running in the service to keep the subtree alive ... this is what the keeper process does). If you can obtain a subtree delegation in a different way, you can let Isolate use it by putting the path to the subtree in Isolate's config file.

@Emru1
Copy link

Emru1 commented Dec 13, 2023

Can't isolate use cgroupfs instead of systemd for cgroupv2?

@yahya-abdul-majeed
Copy link

Hi @gollux ser, I am running your cg2 branch with judge0. I removed the older isolate and added isolate v2. Below i have attached the log. Have you any idea why this is happening?

isolate --cg -s -b 32 -M /var/local/lib/isolate/32/metadata.txt --stderr-to-stdout -i /dev/null -t 15.0 -x 0 -w 20.0 -k 128000 -p120 --cg-timing --cg-mem=512000 -f 4096 -E HOME=/tmp -E PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" -E LANG -E LANGUAGE -E LC_ALL -E JUDGE0_HOMEPAGE -E JUDGE0_SOURCE_CODE -E JUDGE0_MAINTAINER -E JUDGE0_VERSION -d /etc:noexec --run -- /bin/bash compile > /var/local/lib/isolate/32/compile_output.txt 
Cannot write /sys/fs/cgroup/memory/box-32/tasks: No such file or directory

Hi, make sure you initialized your sandbox with --cg flag before running it with --cg

@gollux
Copy link
Member

gollux commented Feb 28, 2024

Finally merged.

@gollux gollux closed this as completed Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests