-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sandbox all document processing in gVisor #590
Conversation
0615e9b
to
6a35fda
Compare
Thanks for the contribution! We are just wrapping up a release and I'll come back to this after we're done with that. |
As noted in #589 (comment), the usage of Docker is problematic. So I have updated this PR to use gVisor with Podman rather than Docker. Let me know if this is more interesting, and I can add tests, QA instructions, and better documentation. |
Thanks for pivoting from Docker to Podman. I think we now have a clearer path for adding gVisor support in Dangerzone. So, to answer your question, yes, feel free to polish this PR with tests and instructions. Personally, gVisor support is something that I really wanted us to tackle (#126) for a long time. The fact that we can now have this support for Linux, be it optional, is great. But my aspiration, and it shouldn't affect this PR btw, is to include gVisor support across all platforms. That is, running This effort is currently blocked until we have more input in google/gvisor#8205. But once we tackle #443 and we no longer need mounted directories, we can make progress in this front. Until then, this PR is super welcome, and thanks for spearheading this effort 🙂 |
Thanks for the details; great to see there's already been lots of investigatory work on this. I actually had independently tried to implement it via the "in-container The only remaining problems with that approach I had were around file permissions and ownership of the files written back to the out-of-sandbox filesystem (fixable but probably requires me to dig more into where such fix-ups would need to occur), but also how Alternatively, manually creating an OCI container spec and starting it within runsc-in-Podman by using manual invocations of If that sounds like an interesting direction, then let me know and I can go that way rather than the "just make it work in Linux with Podman" approach. |
Nice to see that your exploration has crossed the path with @apyrgio's.
This won't be a problem one we fix #443 (which we probably need to do for the next release anyways since we want to have the Qubes code shared as much as possible with the containers one). |
Hey, sorry for the delay @EtiennePerot, life got a bit in the way 😉
Hey, that's great!
I was thinking about this backwards, that the main sandbox would be the Docker/Podman runtime, and gVisor would supplement it with it's kernel protection capabilities. The way you think about this makes sense though, having a very strong inner sandbox, and using the outer sandbox for cross platform support. Before going down the full privileged container route, I'd try to figure out which capabilities are necessary to make this work, so that we minimize the blast radius in case of a
Quick note regarding this: Since #443 is a blocker for oh-so-many-things, we've already started working on this and will soon have a prototype. I'd suggest you consider this as a non-issue, i.e., that we won't mount any directories into the container.
It very much is 🙂 . If you're up for it, feel free to let us know. Else, we can try to tackle it once we're done with #443, and some Qubes fixes. In any case, thanks a lot for your persistence on this issue and sharing your gVisor knowledge. |
Another update: it seems that gVisor will soon have the ability to run within rootless Podman, which will simplify things a lot for Dangerzone. @EtiennePerot, sharing this in case you're still interested. |
Yes, I've been following that :) Still planning on getting around to this PR. |
Uploaded a new commit that uses the approach of putting gVisor inside the Dangerzone container image, and use it as a wrapper regardless of the container runtime. It works with both Podman and Docker, although I have only tested on Linux and I am not sure how file permissions with Docker volumes work on non-Linux platforms. Also, the tests will fail because the latest gVisor release does not yet include google/gvisor@88f7bb6. I have tested it against a freshly-compiled gVisor binary and all tests pass on my end. I expect google/gvisor@88f7bb6 will be part of next week's gVisor release. Please read the long comment in Let me know what you think! |
Woah, that's exciting! We're currently in the midst of releasing Dangezone 0.6.0 so I can't take a proper look right now, but I promise to do so as soon as possible. One quick comment I have while skimming through the comments for the entrypoint is this: I understand that the
So, in an nutshell, I believe that we can simplify things more, by targeting just the first container, and not worry about mounts. But I'll have more concrete things to comment on once we're done with 0.6.0. [1] Well, there's always the scenario where the code that converts pixels to PDFs is subject to an RCE, but that's a very narrower attack surface, and we can't effectively defend against it. In Qubes, for instance, this part runs on the user's host. |
Thanks for the details. I believe removing the need to have the Part of my thought process while building this was actually "can we just not have this mount at all and instead pass the files back and forth via something like a tarball over stdin/stdout?", but I guess simply not running the pixels-to-PDF part in a container solves that problem too. |
Alright, I looked more carefully into the PR. I have several questions, some of those are just basic gVisor questions, and some apply to Dangerzone specifically. Here goes:
|
@EtiennePerot kind ping on the above questions, so that we don't lose context. |
Hey, sorry I missed the set of questions.
But beyond the OCI differences (which I think are worth digging into), the main reason is simply supportability. (Well, technically the OCI spec only specifies In terms of practical OCI spec differences:
I think the main thing is, well, that the inner layer now has gVisor in the middle. I don't want to sound too salesman-y, but gVisor emulates a full-fledged POSIX kernel implementation. It's not just about one security measure being moved to one or the other layer; for some of them, it's effectively doubling them up; there are now two distinct implementations of the same security measure that need to simultaneously be exploited in order to breach through. For example, if the sandboxed workload is running as an non-root user inside the gVisor sandbox, and the gVisor sandbox is itself running as an unprivileged user on the host, then for the sandboxed workload to escalate to root on the host, they would need to have a user escalation exploit for both kernels (and, since those two kernels don't share code, the same exploit generally won't simultaneously work against both). Same thing for filesystem-level isolation: that security measure exists at both layers (technically there's actually three levels of filesystem-level isolation: gVisor's implementation, then the fact that the gVisor process places itself into a chroot + Linux mount namespace that only has the minimum possible from the host filesystem exposed, and then there's Docker/Podman's own filesystem isolation). Similar situation for PID namespaces: there's gVisor's own in-sandbox process tree, there's the fact that the gVisor process isolates itself in its own Linux PID namespace (as specified in the OCI spec), and there's the fact that the Docker/Podman container itself is running in a dedicated Linux PID namespace. Same kind of thing for the other types of namespaces. More generally speaking, it takes at least two unique kernel vulnerabilities (one specifically against the gVisor kernel, plus one specifically against the Linux kernel) in order to fully break out onto the host system. On Windows/OSX, it takes one gVisor kernel exploit plus a VM escape exploit to fully break out. There are some layers that aren't "doubled up" in this manner, like the seccomp one you point out, although actually it would be possible to add a seccomp filter enforced within the gVisor sandbox (using gVisor's In terms of thinking of security responsibilities (i.e. what each layer "ought to" provide), at a high-level, I think the framing in an earlier comment is pretty well formulated: the outer container's main responsibility is to act as a "platform compatibility" solution, whereas the inner container's responsibility is solely security. The outer layer's own security measures (e.g. non-privileged user, filesystem isolation, PID namespace, etc.) can be seen as just an added security bonus, so long as they don't interfere with the inner layer's ability to work properly (hence, as an example, the need to remove the Docker-level seccomp filter).
The user namespace handling is probably the most complicated part of the current implementation, because of the need to preserve UIDs on Linux so that files in the However, if, as per the above discussion, we can get rid of this, then all of the current user namespace stuff can be simplified and further locked down to have no relationship with any existing user on the host system. This means we could create a user that only exists inside the outer container, and run the gVisor process as that user in a user namespace that exposes no other user. (On the initial host user namespace, I believe it would appear as an unnamed user with a UID that isn't in Then, on top of that, since gVisor is its own kernel and thus implements its own notion of users and user namespaces, the workload within the sandbox can itself run in an in-sandbox user namespace that doesn't have a mapping to the host user (i.e. to the user that the gVisor process runs as). (The current implementation of this PR kind of does this already. It has two in-sandbox users: the |
Thanks Etienne for answering all the my questions in great detail. Not only I'm covered, but I think we have enough material to update the parent issue, and write down a design document. I plan to follow up on the above on Monday, and maybe offer some next steps. My guess is that our lives will be much easier once we've tackled #625, so I'll make sure to prioritize it next week. |
Sounds good. One small question: which issue do you mean by "updating the parent issue"? I agree that addressing #625 first makes sense, otherwise this PR would add temporary complexity that doesn't need to ever exist if #625 is addressed first. If you wish, I can already start simplifying this PR to what it would look like if it only needs to support PDF-to-pixels conversion. |
I was referring to this issue: #126. It doesn't have the context that this discussion has, so I'd like to move some there, for future reference.
Sure, if it's not too much of a hassle for you. I don't expect we'll have many more architectural changes in the near future, so you should be good to go. The only relevant thing I can think of is that we'll experiment with switching to a Debian image soon, but I think this should not affect this discussion. |
Quick update here. I actually prioritized implementing the on-host pixels to PDF conversion PR (#748), which is a prerequisite for vastly simplifying this one. Now that it's out, I'll follow up here soon. |
Per discussion on freedomofpress#590, the need for this volume will soon go away. This makes gVisor integration much easier, because it removes the need to preserve file access and ownership of the files in this volume from within the gVisor sandbox. The `/sandboxed_entrypoint.sh` file is no longer necessary, and the `/entrypoint.py` file is massively simplified. This also allows the use of `--userns=nomap` in Podman.
Per discussion on freedomofpress#590, the need for this volume will soon go away. This makes gVisor integration much easier, because it removes the need to preserve file access and ownership of the files in this volume from within the gVisor sandbox. The `/sandboxed_entrypoint.sh` file is no longer necessary, and the `/entrypoint.py` file is massively simplified. This also allows the use of `--userns=nomap` in Podman.
Yeah, I'd prefer if my changes where squashed into your commits, so that our iterations are not shown in our Git history. I propose the following:
This way, you can look both at the delta, and the final branch.
Nice! |
Per discussion on #590, the need for this volume will soon go away. This makes gVisor integration much easier, because it removes the need to preserve file access and ownership of the files in this volume from within the gVisor sandbox. The `/sandboxed_entrypoint.sh` file is no longer necessary, and the `/entrypoint.py` file is massively simplified. This also allows the use of `--userns=nomap` in Podman.
Done, I have added a few more commits that address your review comments and polish the code a bit.
Done, this branch is equivalent with the Next steps would be to:
|
See [this comment](freedomofpress/dangerzone#590 (comment)) for context. PiperOrigin-RevId: 641569843
This turns on optimizations for release builds. Detected in [this comment](freedomofpress/dangerzone#590 (comment)). PiperOrigin-RevId: 642076108
This turns on optimizations for release builds. Detected in [this comment](freedomofpress/dangerzone#590 (comment)). PiperOrigin-RevId: 642076108
This turns on optimizations for release builds. Detected in [this comment](freedomofpress/dangerzone#590 (comment)). PiperOrigin-RevId: 642076108
This turns on optimizations for release builds. Detected in [this comment](freedomofpress/dangerzone#590 (comment)). PiperOrigin-RevId: 642076108
This turns on optimizations for release builds. Detected in [this comment](freedomofpress/dangerzone#590 (comment)). PiperOrigin-RevId: 642129920
Thanks, I have incorporated |
Add a design document for the gVisor integration, which is currently under review. The associated pull request has lots of architectural discussions about integrating gVisor, so in this document we collect them all in one place. Refs #590
Our logic for detecting the appropriate Tesseract data directory should also take into account the canonical envvar, if explicitly passed.
Get the (major, minor) parts of the Docker/Podman version, to check if some specific features can be used, or if we need a fallback. These features are related with the upcoming gVisor integration, and will be added in subsequent commits.
Add Podman's default seccomp policy as of 2024-06-10 [1]. This policy will be used in subsequent commits in platforms with Podman version 3, whose seccomp policy does not allow the `ptrace()` syscall. [1] https://github.com/containers/common/blob/d3283f8401eeeb21f3c59a425b5461f069e199a7/pkg/seccomp/seccomp.json
This wraps the existing container image inside a gVisor-based sandbox. gVisor is an open-source OCI-compliant container runtime. It is a userspace reimplementation of the Linux kernel in a memory-safe language. It works by creating a sandboxed environment in which regular Linux applications run, but their system calls are intercepted by gVisor. gVisor then redirects these system calls and reinterprets them in its own kernel. This means the host Linux kernel is isolated from the sandboxed application, thereby providing protection against Linux container escape attacks. It also uses `seccomp-bpf` to provide a secondary layer of defense against container escapes. Even if its userspace kernel gets compromised, attackers would have to additionally have a Linux container escape vector, and that exploit would have to fit within the restricted `seccomp-bpf` rules that gVisor adds on itself. Fixes freedomofpress#126 Fixes freedomofpress#224 Fixes freedomofpress#225 Fixes freedomofpress#228
The changes look good Etienne, thanks a lot. As for the signatures, I'm having a hard time retaining the original ones prior to the merge. For instance, I had to rebase this PR on top of some recently commits on I'm waiting for the CI tests to pass one last time, and then I'll merge this PR. |
Thank you for all the work on reviewing this and the related fixes! |
See [this comment](freedomofpress/dangerzone#590 (comment)) for context. PiperOrigin-RevId: 641569843
See [this comment](freedomofpress/dangerzone#590 (comment)) for context. PiperOrigin-RevId: 643463161
When running on Linux, Dangerzone currently uses Podman with its default
crun
/runc
runtime. These runtimes rely on Linux's built-in containerization primitives (namespaces). These parts of the Linux kernel have historically been the target of many container escape vulnerabilities. This is due to the fact that the Linux host kernel is fully exposed to the application running inside the container.In particular, PDF and PostScript libraries such as Ghostscript have been notorious for having been targeted to run precisely this type of exploit. For this reason, while running these tools within Linux containers is better than running them directly on the host, it does not fully shield the host kernel from the malicious code that may be running in the container.
This pull request implements optional support for the gVisor container runtime, called
runsc
("run sandboxed container"). gVisor is an open-source OCI-compliant container runtime. It is a userspace reimplementation of the Linux kernel in a memory-safe language.It works by creating a sandboxed environment in which regular Linux applications run, but their system calls are intercepted by gVisor. gVisor reinterprets these system calls using the logic in its own kernel written in Go, and responds to the system call by itself, rather than passing it onto the host Linux kernel. This means the host Linux kernel is isolated from the sandboxed application, thereby providing a significant level of protection against Linux container escape attacks.
gVisor further hardens itself by using the typical container primitives (isolating its own view of the host filesystem, running in the various types of namespaces that Linux support), and also sets a restrictive
seccomp-bpf
policy that only allows basic system calls through. This way, even if its userspace kernel were to get compromised, attackers would have to additionally have a "typical" Linux container escape vector, and that exploit would have to fit within the restrictedseccomp-bpf
rules that gVisor adds on itself.This provides a level of protection comparable to a hardened hypervisor running workloads in a VM. However, gVisor doesn't actually use virtualization, so it is portable to all Linux environments and doesn't require virtualization support. It runs on x86 and ARM.
The initial commit of this pull request only adds support for using it inside
isolation_provider/container.py
. If there is appetite for this runtime, I'm happy to add CircleCI tests to integrate it better. Let me know what you think!Fixes #126
Fixes #224
Fixes #225
Fixes #228