-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[seccomp] update defaults to current docker rules (Started as: block unshare by default) #1988
Comments
@rhatdan (or anyone else) - sorry for the direct mention, could you please just tell me if there's a policy wrt using docker rules here or if I should redevelop them from scratch? I'll do this work anyway, just cannot get started if I don't know what's acceptable/desired :) |
The most important rule is that we cannot make it more restrictive as this would be a breaking change and most likely break a lot of existing users. |
I can understand where this comes from, but I strongly disagree we should not strive to make the default profile more secure. (Note it's not all black, in the differences with docker's default profile there also are some changes towards being more permissive, like allowing the bpf syscall (currently given if CAP_SYS_ADMIN is present) also if CAP_BPF is given; but I'll be honest and say blocking unshare is my main motivation here...) If you want to prioritize compatibility, something like qemu pc models could make sense: declare a new seccomp-defaults-2024 and new installations can start with it, while systems upgrading from an older system would keep the old one unless they explicitly opt in or |
SECCOMP rules are probably the hardest for people to deal with, since it is very difficult to customize. In 10 years there has been very little movement in seccomp rules for this reason. When new syscalls get added to the kernel, then there are modifications to the seccomp rules. But other then that nothing. When seccomp blocks stuff normal users react in one of two ways. Either --privileged which is worse, or --security-opt seccomp=unconfined. In very rare cases they will create a customer seccomp.json file for the container. Docker decided to block a few syscalls like unshare and mount, which are needed to run podman and other container engines within a container, a fairly common pattern. DIND pattern tends to be volume mounting in the hosts /var/run/docker.sock which does not require unshare or mount. But if you want to do PINP you need these syscalls. I don't think giving these syscalls by default is very dangerous since we also have limited Capabilties (No SYS_ADMIN by default) and tools like SELinux to futher lock down. Also these syscalls are available to unprivileged users by default, so they are heavily monitored for vulnerabilities. A few years ago I gave a talk about Goldi-locks and the three bears which talked about the level of security we can achieve where uniformed users (the majority) only response to permission denied is to run --privileged. (Momma Bear) While the goal is to increase security towards lock down (Papa Bear) we need to be very careful about changing defaults. |
I am not against making the profile more secure, I am against breaking people and denying unshare will certainly have the effect.
Well given io_uring is rather new I don't think many applications have a hard requirement on it and I could assume that they all have a fallback to the older io model's. And well in this case we can argue we did the right thing then by never allowing them to begin with: #1264
Yes I understand it but I don't think it justifies breaking users that depend on it, i.e. nested containers.
I think having different pre-defined profiles to choose from makes a lot of sense given most people will never edit their own profiles. |
I think the best idea would be to have multiple seccomp.json files that users could easily choose from. But who is going to maintain them, and what should they be called and who is going to define them. seccomp-strict.json (Drop lots of questionable seccomp rules). |
Bottom line is users hitting a permission denied will have a hard time understanding what is going wrong and have little recourse to get their container running. https://www.redhat.com/sysadmin/container-permission-denied-errors |
Thank you both for the thorough answers. It's getting late here so I'll check for further replies tomorrow, but it's great to discuss this.
Unfortunately, fedora with selinux enforced or debian with apparmor enforced do not block inserting netfilter rules as a user in container with
There are two sides of this coin:
That's a very good point, I didn't think of PINP... We have a podman flag to allow systemd (and try to automatically detect if it runs), perhaps we could have a similar --podman-in-podman=auto flag that allows these in this case, or just another profile for seccomp=PINP ? I also agree we want to avoid people turning everything off (which is indeed much worse), but I really don't see so many users of unshare() that we cannot work something out.
I agree the hurdle to writing one's own profile is too high, shipping multiple profiles and letting users switch more easily would be a great first step. I'll try to finish looking at what's interesting in the docker's default seccomp policy and recap the changes early of next week, so we can see how much tuning would make sense; I don't think there'll be much more than mount and unshare (basically stuff required for nested podman) but it'll be good to confirm. |
Even as the one guilty for CVE-2023-4004 (sigh, sorry), I'm not convinced that the alternative (blocking And while @Luap99 already mentioned some concerns related to Docker-in-Docker (or Podman-in-Podman), I wanted to add a specific one: But yes, I agree that shipping multiple profiles is probably a good idea. Side note, I'm spending some effort on-and-off on a project that should make writing/deploying those profiles a bit simpler, by the way. |
I guess it's a matter of where the container stands here -- I agree that the final application can do a much better job than we can at isolating itself, but most don't even try (mostly because seccomp is so hard to use; but even if something like https://justine.lol/pledge/ was made more standard many developers just don't care so I'm not hoping much here). If the application legitimately uses unshare to isolate themselves then blocks unshare (as in So it really depends on where podman wants to place itself, I still think it makes sense to be strict about these and have a --I'm-securing-myself tag for applications (not people!) who know what they do, but given containers have no way of specifying such a tag right now I'd rather be realistic and let's start with profiles first :) |
Could you prepare a seccomp-strict.json file and then we could package it up and allow users to choose it. Perhaps even make it easy in containers.conf to switch to it by default. Or at least tell users to copy /usr/share/containers/seccomp-strict.json /etc/containers/seccomp.json. But I would want a lot more then just unshare and mount syscalls turned off for a strict mode. Once this is done then we need it to be maintained. Perhaps no new syscalls are added, since that would be stricter. |
Sorry for the delay! I've gone through the differences with docker first, given it'll be a new profile that'll be hard to review so recapping the differences here to decide what we want in default profile and what we want in "hardened" profile. Unopiniated list:
I'd say we can do this in two steps. 1/ most of the riff raff can probably go to the main profile without much ceremony:
2/ once that's in, add a new hardened profile with the last points (clone, mount stuff, ns stuff) denied; I'll look at other commonly allowed syscalls and see if there's more to be moved to SYS_ADMIN-only. I'm not sure about "never adding more stuff", honestly that'll probably have to be done on a case by case basis, but I agree it'll have to be maintained. A very stupid test I have in some of my repos when I need to keep two files updated is to add a check that diffs the two files, and errors if that doesn't match the expected output, meaning that anyone modifying either file needs to also either update the other file or the diff -- this is a bit of a pain but these seccomp rules aren't updated often so that might work. What do you think? I'm still swamped, but will try to take time to open the first PR next week-ish. |
SGTM |
By the way, not coming back on the current short term plan to make a hardened filter for distros to ship as easier-to-use-than-rewriting-from-scatch examples, but when working on this I've come to notice again systemd's SystemCallFilter directive and I really like the grouping work they've done (e.g. I still think that long term something like flatpaks where the containers can request the access rights they need and a user could override this at runtime would be great. Perhaps we could abuse labels? I'm not familiar enough with the container world here, is there something such as "well known labels" that already work kind of like this, or would it be more appropriate to add a different category of metadata? Also, where would such an idea need to be discussed to involve not just podman but other participants of the oci container standard? Until then I'll keep working on a hardened profile as time allows, thanks for the reviews on the first PR |
Since almost every public image is tested to work with docker, could you please also consider adding seccomp-docker as an option? Right now a user would have to download docker's seccomp.json from the internet and point to it in containers.conf. I'm hoping for an easier flow than that. Hopefully there's no maintenance overhead since it'll be docker maintaining the rules. |
PRs welcome. |
OK. I'll wait to see how you implement the feature "to have multiple seccomp.json files that users could easily choose from", and then attempt a PR for the docker option. Hopefully at that point it will be a trivial addition. |
Hi,
I've noticed that on my systems (fedora, debian, alpine) it's possible to get network admin privileges in a user namespace within a container:
I'd have expected this to be blocked, and looking at the git history there was some attempt at making unshare only allowed for containers with CAP_SYS_ADMIN but for some reason it was also duplicated and allowed in the general case, and that got cleaned up "the wrong way" in bf297c1
I've checked the latest docker rules (running docker), and they block unshare properly by default, so it looks like a case of its the Right Thing to Do (it's not blocked for sys admin containers, as we originally had)
At this point I checked their seccomp rules and there are quite a few other changes -- I believe https://github.com/containers/common/blob/main/pkg/seccomp/default_linux.go was originally based on https://github.com/mody/moby/blob/main/profiles/seccomp/default_linux.go , but the docker one is quite more strict and has more syscalls allowed only when some caps are given.
I've started updating the file locally, but before I spend more time on this:
Thanks!
The text was updated successfully, but these errors were encountered: