Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sandbox all document processing in gVisor #590

Merged
merged 4 commits into from
Jun 12, 2024

Conversation

EtiennePerot
Copy link
Contributor

@EtiennePerot EtiennePerot commented Oct 8, 2023

When running on Linux, Dangerzone currently uses Podman with its default crun/runc runtime. These runtimes rely on Linux's built-in containerization primitives (namespaces). These parts of the Linux kernel have historically been the target of many container escape vulnerabilities. This is due to the fact that the Linux host kernel is fully exposed to the application running inside the container.

In particular, PDF and PostScript libraries such as Ghostscript have been notorious for having been targeted to run precisely this type of exploit. For this reason, while running these tools within Linux containers is better than running them directly on the host, it does not fully shield the host kernel from the malicious code that may be running in the container.

This pull request implements optional support for the gVisor container runtime, called runsc ("run sandboxed container"). gVisor is an open-source OCI-compliant container runtime. It is a userspace reimplementation of the Linux kernel in a memory-safe language.

It works by creating a sandboxed environment in which regular Linux applications run, but their system calls are intercepted by gVisor. gVisor reinterprets these system calls using the logic in its own kernel written in Go, and responds to the system call by itself, rather than passing it onto the host Linux kernel. This means the host Linux kernel is isolated from the sandboxed application, thereby providing a significant level of protection against Linux container escape attacks.

gVisor further hardens itself by using the typical container primitives (isolating its own view of the host filesystem, running in the various types of namespaces that Linux support), and also sets a restrictive seccomp-bpf policy that only allows basic system calls through. This way, even if its userspace kernel were to get compromised, attackers would have to additionally have a "typical" Linux container escape vector, and that exploit would have to fit within the restricted seccomp-bpf rules that gVisor adds on itself.

This provides a level of protection comparable to a hardened hypervisor running workloads in a VM. However, gVisor doesn't actually use virtualization, so it is portable to all Linux environments and doesn't require virtualization support. It runs on x86 and ARM.

The initial commit of this pull request only adds support for using it inside isolation_provider/container.py. If there is appetite for this runtime, I'm happy to add CircleCI tests to integrate it better. Let me know what you think!

Fixes #126
Fixes #224
Fixes #225
Fixes #228

@deeplow
Copy link
Contributor

deeplow commented Oct 11, 2023

Thanks for the contribution! We are just wrapping up a release and I'll come back to this after we're done with that.

@EtiennePerot
Copy link
Contributor Author

As noted in #589 (comment), the usage of Docker is problematic. So I have updated this PR to use gVisor with Podman rather than Docker. Let me know if this is more interesting, and I can add tests, QA instructions, and better documentation.

@apyrgio
Copy link
Contributor

apyrgio commented Oct 16, 2023

Thanks for pivoting from Docker to Podman. I think we now have a clearer path for adding gVisor support in Dangerzone. So, to answer your question, yes, feel free to polish this PR with tests and instructions.

Personally, gVisor support is something that I really wanted us to tackle (#126) for a long time. The fact that we can now have this support for Linux, be it optional, is great. But my aspiration, and it shouldn't affect this PR btw, is to include gVisor support across all platforms. That is, running runsc within the container itself, so that users in MacOS and Windows (which are the main platforms that journalists use) can also be protected.

This effort is currently blocked until we have more input in google/gvisor#8205. But once we tackle #443 and we no longer need mounted directories, we can make progress in this front. Until then, this PR is super welcome, and thanks for spearheading this effort 🙂

@EtiennePerot
Copy link
Contributor Author

EtiennePerot commented Oct 18, 2023

Thanks for the details; great to see there's already been lots of investigatory work on this.

I actually had independently tried to implement it via the "in-container runsc do" mechanism you mention; it would indeed mean better security on Windows/Mac, because it would protect the two containers from each other (otherwise, AIUI they run in the same VM that Docker desktop manages). I actually got it to mostly work with runsc do; the caveats was that instead of using rootless runsc in an unprivileged container, it runs it as root in a privileged container, and wraps the command that runs inside the gVisor sandbox by prefixing it with sudo -u dangerzone --. This sounds scary, but I actually don't think that's such a problem, given that the security boundary would move to gVisor rather than to the container boundary: with such a setup, the role of Docker/Podman would no longer be that of a security boundary, and instead just fulfill the role of a cross-platform software portability solution of sorts.

The only remaining problems with that approach I had were around file permissions and ownership of the files written back to the out-of-sandbox filesystem (fixable but probably requires me to dig more into where such fix-ups would need to occur), but also how runsc do doesn't have good control over which directories in its sandbox are writable back to the host and which aren't. Relying on gVisor's enforcement of file permissions would work but seems like it's sub-optimal. #443 sounds like it partially solves this, but also it'd be nice to just support this better in runsc itself.

Alternatively, manually creating an OCI container spec and starting it within runsc-in-Podman by using manual invocations of runsc create + runsc start would allow fine-grained control over mounted directories without needing runsc modifications. runsc do is basically just a helper command to do just that, it just doesn't have command-line flags to control the "volumes" part of that container spec.

If that sounds like an interesting direction, then let me know and I can go that way rather than the "just make it work in Linux with Podman" approach.

@deeplow
Copy link
Contributor

deeplow commented Oct 19, 2023

Nice to see that your exploration has crossed the path with @apyrgio's.

The only remaining problems with that approach I had were around file permissions and ownership of the files written back to the out-of-sandbox filesystem (fixable but probably requires me to dig more into where such fix-ups would need to occur)

This won't be a problem one we fix #443 (which we probably need to do for the next release anyways since we want to have the Qubes code shared as much as possible with the containers one).

@apyrgio
Copy link
Contributor

apyrgio commented Nov 10, 2023

Hey, sorry for the delay @EtiennePerot, life got a bit in the way 😉

I actually had independently tried to implement it via the "in-container runsc do" mechanism you mention

Hey, that's great!

with such a setup, the role of Docker/Podman would no longer be that of a security boundary, and instead just fulfill the role of a cross-platform software portability solution of sorts.

I was thinking about this backwards, that the main sandbox would be the Docker/Podman runtime, and gVisor would supplement it with it's kernel protection capabilities. The way you think about this makes sense though, having a very strong inner sandbox, and using the outer sandbox for cross platform support.

Before going down the full privileged container route, I'd try to figure out which capabilities are necessary to make this work, so that we minimize the blast radius in case of a runsc escape. One would argue that the blast radius would be the same as in the case of a runc escape of course, but a little defense in depth here wouldn't hurt.

The only remaining problems with that approach I had were around file permissions and ownership of the files written back to the out-of-sandbox filesystem (fixable but probably requires me to dig more into where such fix-ups would need to occur),

Quick note regarding this: Since #443 is a blocker for oh-so-many-things, we've already started working on this and will soon have a prototype. I'd suggest you consider this as a non-issue, i.e., that we won't mount any directories into the container.

If that sounds like an interesting direction, then let me know and I can go that way rather than the "just make it work in Linux with Podman" approach.

It very much is 🙂 . If you're up for it, feel free to let us know. Else, we can try to tackle it once we're done with #443, and some Qubes fixes. In any case, thanks a lot for your persistence on this issue and sharing your gVisor knowledge.

@apyrgio
Copy link
Contributor

apyrgio commented Dec 5, 2023

Quick update here: we have some PRs under way (see #622, #627) that fix #443, at least for the first stage of the conversion. We can start experimenting with gVisor on top of them, but this task will be much clearer once the PRs are merged.

@apyrgio
Copy link
Contributor

apyrgio commented Dec 11, 2023

Another update: it seems that gVisor will soon have the ability to run within rootless Podman, which will simplify things a lot for Dangerzone. @EtiennePerot, sharing this in case you're still interested.

@EtiennePerot
Copy link
Contributor Author

Yes, I've been following that :)

Still planning on getting around to this PR.

@EtiennePerot
Copy link
Contributor Author

Uploaded a new commit that uses the approach of putting gVisor inside the Dangerzone container image, and use it as a wrapper regardless of the container runtime. It works with both Podman and Docker, although I have only tested on Linux and I am not sure how file permissions with Docker volumes work on non-Linux platforms.

Also, the tests will fail because the latest gVisor release does not yet include google/gvisor@88f7bb6. I have tested it against a freshly-compiled gVisor binary and all tests pass on my end. I expect google/gvisor@88f7bb6 will be part of next week's gVisor release.

Please read the long comment in dangerzone/gvisor_wrapper/entrypoint.py which explains the approach and why it works the way it does... It is difficult to get all the permissions lined up properly.

Let me know what you think!

@EtiennePerot EtiennePerot changed the title Support for gVisor container runtime Sandbox all document processing in gVisor Feb 26, 2024
@apyrgio
Copy link
Contributor

apyrgio commented Feb 26, 2024

Woah, that's exciting! We're currently in the midst of releasing Dangezone 0.6.0 so I can't take a proper look right now, but I promise to do so as soon as possible.

One quick comment I have while skimming through the comments for the entrypoint is this: I understand that the /safezone mount is a factor that makes things more complicated. In practice though, it shouldn't affect the gVisor work. There are two reasons for that:

  1. In practice, we want to use gVisor for the sanitization that takes place on the first container. This container is the most sensitive one, since it's the only one [1] which is affected by RCEs. The good thing about this container is that we don't mount anything to it. It just reads the document from stdin and writes the pixels to stdout.
  2. The /safezone mount point will soon be removed, along with the second container. We are now very close to running the conversion of pixels to PDF on the user's host, which has several important benefits (smaller container image size, faster OCR, ability to update the container image from an image repository etc.)

So, in an nutshell, I believe that we can simplify things more, by targeting just the first container, and not worry about mounts. But I'll have more concrete things to comment on once we're done with 0.6.0.

[1] Well, there's always the scenario where the code that converts pixels to PDFs is subject to an RCE, but that's a very narrower attack surface, and we can't effectively defend against it. In Qubes, for instance, this part runs on the user's host.

@EtiennePerot
Copy link
Contributor Author

EtiennePerot commented Feb 27, 2024

Thanks for the details. I believe removing the need to have the /safezone mount would indeed simplify things a lot; specifically, it removes the need to have precise UID:GID alignment in the unsandboxed entrypoint, and it totally removes the need for the sandboxed entrypoint, as well as the three capabilities granted inside the gVisor sandbox.

Part of my thought process while building this was actually "can we just not have this mount at all and instead pass the files back and forth via something like a tarball over stdin/stdout?", but I guess simply not running the pixels-to-PDF part in a container solves that problem too.

@apyrgio
Copy link
Contributor

apyrgio commented Feb 29, 2024

Alright, I looked more carefully into the PR. I have several questions, some of those are just basic gVisor questions, and some apply to Dangerzone specifically. Here goes:

  1. Assuming that we use runsc only on the first container, what is the practical difference between using runsc do and runsc run?

    Context: The reason I'm asking is because the following invocation :

    runsc --rootless --network=none do env PYTHONPATH=/opt/dangerzone \
       python3 -m dangerzone.conversion.doc_to_pixels
    

    works when I test things locally, and in the spirit of keeping things as simple as possible, it's very appealing. However, I do read that this option is for testing only, so I'm skeptical if there's a footgun that I'm missing.

    (I've seen your OCI config btw, and I already spot some differences, but I'm mainly asking if there's anything foundational that I'm missing)

  2. Before including gVisor support, we should clarify what are the security guarantees that the outer container provides (Podman/Docker), and what are the security guarantees that the inner container providers (gVisor). From the code and the documentation on the entrypoint script, I understand that the outer layer:

    • Does no longer drop capabilities. This is the responsibility of the inner layer.
    • Does no longer set a seccomp filter. This is done by the inner layer, and in a more strict fashion as well.

    Anything else that's changed?

  3. We have a pending issue for supporting user namespaces (Defense in Depth - User Namespaces #228). This is something that Docker does not support, but gVisor does, so that's great! My main issue with user namespaces is that UID 0 typically maps to the owner of the namespace in the host. In Linux (Podman) that's the user who starts the Dangerzone application. In Windows / macOS, it's the root user that runs in the WSL/HyperKit VM. What's the case in gVisor?

    (My plan was to use something like userns=nomap for supported Podman versions. If gVisor can support something similar across OSes, that would be amazing)

@apyrgio
Copy link
Contributor

apyrgio commented Mar 7, 2024

@EtiennePerot kind ping on the above questions, so that we don't lose context.

@harrislapiroff harrislapiroff added this to the 0.7.0 milestone Mar 7, 2024
@EtiennePerot
Copy link
Contributor Author

EtiennePerot commented Mar 8, 2024

Hey, sorry I missed the set of questions.

Assuming that we use runsc only on the first container, what is the practical difference between using runsc do and runsc run?

runsc do is just a convenience helper. It is true that under the hood it's basically just doing runsc create + runsc run, but there are many knobs that runsc do doesn't expose. For example, runsc do exposes the whole host filesystem by default, which is unnecessarily wide for Dangerzone. There are also some subtleties around how runsc do operates to act as a convenience wrapper; for example, in root-ful mode with networking enabled, it'll do a whole dance with network namespace setup. It's meant to be for convenience, not so much for the tightest possible security.

But beyond the OCI differences (which I think are worth digging into), the main reason is simply supportability. runsc create and runsc run as specified as part of the OCI runtime interface spec, whereas runsc do is just a gVisor-specific helper. It's therefore less stable of an API and it makes no guarantee to keep working the way it does into the future (e.g. it makes no difference that its spec will stay the same and not make further security compromises). I think it's worth avoiding runsc do for that reason alone.

(Well, technically the OCI spec only specifies runsc start rather than runsc run, but the interface runsc run takes is meant to be the same as runsc start so I think the argument still applies. Though I could edit the script to use runsc start if you think it's better.)

In terms of practical OCI spec differences:

  • runsc do enables TTY emulation by default. This is a notoriously complicated part of the POSIX API, and Dangerzone doesn't need it, so better to have it disabled.
  • runsc do would mount the whole filesystem as / by default (as per above), and does so non-readonly. (In practice, the host filesystem is still effectively read-only, because gVisor implements an overlay on top of the root filesystem. But specifying the OCI spec also allows explicitly mounting the root as read-only inside the sandbox.). This also means that the in-sandbox workload can see files such as the out-of-sandbox entrypoint (entrypoint.py) and the runsc binary itself, i.e. it can trivially learn more information about its own environment than it strictly needs to know. The manual spec only mounts the specific parts of the filesystem that Dangerzone actually needs.
  • runsc do doesn't allow setting mount options, whereas the manual spec can specify mount options like nosuid, noexec, nodev to further restrict what the workload can do with files exposed from /safezone.
  • runsc do doesn't make guarantees about which host namespaces it isolates itself in, whereas the manual OCI spec causes runsc to isolate itself in a separate host PID+Network+IPC+UTS+mount namespace.
  • runsc do doesn't allow specifying an in-sandbox seccomp filter. We could add one for more defense-in-depth; see below.

Before including gVisor support, we should clarify what are the security guarantees that the outer container provides (Podman/Docker), and what are the security guarantees that the inner container providers (gVisor). [...] Anything else that's changed?

I think the main thing is, well, that the inner layer now has gVisor in the middle. I don't want to sound too salesman-y, but gVisor emulates a full-fledged POSIX kernel implementation. It's not just about one security measure being moved to one or the other layer; for some of them, it's effectively doubling them up; there are now two distinct implementations of the same security measure that need to simultaneously be exploited in order to breach through.

For example, if the sandboxed workload is running as an non-root user inside the gVisor sandbox, and the gVisor sandbox is itself running as an unprivileged user on the host, then for the sandboxed workload to escalate to root on the host, they would need to have a user escalation exploit for both kernels (and, since those two kernels don't share code, the same exploit generally won't simultaneously work against both). Same thing for filesystem-level isolation: that security measure exists at both layers (technically there's actually three levels of filesystem-level isolation: gVisor's implementation, then the fact that the gVisor process places itself into a chroot + Linux mount namespace that only has the minimum possible from the host filesystem exposed, and then there's Docker/Podman's own filesystem isolation). Similar situation for PID namespaces: there's gVisor's own in-sandbox process tree, there's the fact that the gVisor process isolates itself in its own Linux PID namespace (as specified in the OCI spec), and there's the fact that the Docker/Podman container itself is running in a dedicated Linux PID namespace. Same kind of thing for the other types of namespaces.

More generally speaking, it takes at least two unique kernel vulnerabilities (one specifically against the gVisor kernel, plus one specifically against the Linux kernel) in order to fully break out onto the host system. On Windows/OSX, it takes one gVisor kernel exploit plus a VM escape exploit to fully break out.

There are some layers that aren't "doubled up" in this manner, like the seccomp one you point out, although actually it would be possible to add a seccomp filter enforced within the gVisor sandbox (using gVisor's seccomp-bpf implementation) on top of what's already here. I can add that to this PR if you wish. But if we're going to do that, I actually think there's potential to go beyond applying something like Docker's default seccomp filter, and come up with a fine-tuned seccomp filter that specifically allows strictly what the PDF-to-pixel program needs.

In terms of thinking of security responsibilities (i.e. what each layer "ought to" provide), at a high-level, I think the framing in an earlier comment is pretty well formulated: the outer container's main responsibility is to act as a "platform compatibility" solution, whereas the inner container's responsibility is solely security. The outer layer's own security measures (e.g. non-privileged user, filesystem isolation, PID namespace, etc.) can be seen as just an added security bonus, so long as they don't interfere with the inner layer's ability to work properly (hence, as an example, the need to remove the Docker-level seccomp filter).

We have a pending issue for supporting user namespaces [...] In Linux (Podman) that's the user who starts the Dangerzone application. In Windows / macOS, it's the root user that runs in the WSL/HyperKit VM. What's the case in gVisor?

The user namespace handling is probably the most complicated part of the current implementation, because of the need to preserve UIDs on Linux so that files in the /safezone volumes are mapped to the user's UID on the host. As long as that remains necessary, then running as a user that ultimately maps to that UID on the host is unavoidable.

However, if, as per the above discussion, we can get rid of this, then all of the current user namespace stuff can be simplified and further locked down to have no relationship with any existing user on the host system. This means we could create a user that only exists inside the outer container, and run the gVisor process as that user in a user namespace that exposes no other user. (On the initial host user namespace, I believe it would appear as an unnamed user with a UID that isn't in /etc/passwd.)

Then, on top of that, since gVisor is its own kernel and thus implements its own notion of users and user namespaces, the workload within the sandbox can itself run in an in-sandbox user namespace that doesn't have a mapping to the host user (i.e. to the user that the gVisor process runs as). (The current implementation of this PR kind of does this already. It has two in-sandbox users: the python3 -m dangerzone.conversion.doc_to_pixels command runs as the "UID 1000" user which has no mapping to any user outside the sandbox, and a "UID 0" that maps to the host UID that owns /safezone so that file permissions can still line up properly.)

@apyrgio
Copy link
Contributor

apyrgio commented Mar 8, 2024

Thanks Etienne for answering all the my questions in great detail. Not only I'm covered, but I think we have enough material to update the parent issue, and write down a design document. I plan to follow up on the above on Monday, and maybe offer some next steps. My guess is that our lives will be much easier once we've tackled #625, so I'll make sure to prioritize it next week.

@EtiennePerot
Copy link
Contributor Author

EtiennePerot commented Mar 9, 2024

Sounds good. One small question: which issue do you mean by "updating the parent issue"?

I agree that addressing #625 first makes sense, otherwise this PR would add temporary complexity that doesn't need to ever exist if #625 is addressed first. If you wish, I can already start simplifying this PR to what it would look like if it only needs to support PDF-to-pixels conversion.

@apyrgio
Copy link
Contributor

apyrgio commented Mar 9, 2024

Sounds good. One small question: which issue do you mean by "updating the parent issue"?

I was referring to this issue: #126. It doesn't have the context that this discussion has, so I'd like to move some there, for future reference.

If you wish, I can already start simplifying this PR to what it would look like if it only needs to support PDF-to-pixels conversion.

Sure, if it's not too much of a hassle for you. I don't expect we'll have many more architectural changes in the near future, so you should be good to go. The only relevant thing I can think of is that we'll experiment with switching to a Debian image soon, but I think this should not affect this discussion.

@apyrgio
Copy link
Contributor

apyrgio commented Mar 14, 2024

Quick update here. I actually prioritized implementing the on-host pixels to PDF conversion PR (#748), which is a prerequisite for vastly simplifying this one. Now that it's out, I'll follow up here soon.

EtiennePerot added a commit to EtiennePerot/dangerzone that referenced this pull request Apr 15, 2024
Per discussion on freedomofpress#590,
the need for this volume will soon go away.

This makes gVisor integration much easier, because it removes the need
to preserve file access and ownership of the files in this volume from
within the gVisor sandbox. The `/sandboxed_entrypoint.sh` file is no
longer necessary, and the `/entrypoint.py` file is massively simplified.

This also allows the use of `--userns=nomap` in Podman.
EtiennePerot added a commit to EtiennePerot/dangerzone that referenced this pull request Apr 15, 2024
Per discussion on freedomofpress#590,
the need for this volume will soon go away.

This makes gVisor integration much easier, because it removes the need
to preserve file access and ownership of the files in this volume from
within the gVisor sandbox. The `/sandboxed_entrypoint.sh` file is no
longer necessary, and the `/entrypoint.py` file is massively simplified.

This also allows the use of `--userns=nomap` in Podman.
@apyrgio
Copy link
Contributor

apyrgio commented Jun 9, 2024

How would you like this PR to proceed? Should I merge your commits on the branch for this PR? And if so, should I do so now, or once everything is ready?

Yeah, I'd prefer if my changes where squashed into your commits, so that our iterations are not shown in our Git history. I propose the following:

  1. Add some extra commits in the wip-gvisor-2 branch, based on your comments. Probably will do some polishing as well in the code, since it was pretty much a WIP effort.
  2. Provide a wip-gvisor-2-squashed branch, with the exact same code, but with clean Git history.

This way, you can look both at the delta, and the final branch.

I'll change the runsc release process to turn on -c opt by default to address this.

Nice!

apyrgio pushed a commit that referenced this pull request Jun 10, 2024
Per discussion on #590,
the need for this volume will soon go away.

This makes gVisor integration much easier, because it removes the need
to preserve file access and ownership of the files in this volume from
within the gVisor sandbox. The `/sandboxed_entrypoint.sh` file is no
longer necessary, and the `/entrypoint.py` file is massively simplified.

This also allows the use of `--userns=nomap` in Podman.
@apyrgio
Copy link
Contributor

apyrgio commented Jun 10, 2024

Add some extra commits in the wip-gvisor-2 branch, based on your comments. Probably will do some polishing as well in the code, since it was pretty much a WIP effort.

Done, I have added a few more commits that address your review comments and polish the code a bit.

Provide a wip-gvisor-2-squashed branch, with the exact same code, but with clean Git history.

Done, this branch is equivalent with the wip-gvisor-2 branch, but I have squashed our iterations into a single commit. In order to make it as small as possible, but still functional, I have factored out some changes in separate commits.

Next steps would be to:

  • Take a look at these branches, and incorporate wip-gvisor-2-squashed into your own gvisor branch, if you agree.
  • Make any other changes that you want on top of mine.
  • In the meantime, I'll try one more time to run Dangerzone in the all of our supported platforms (x86/ARM macOS, and Windows).
  • ???
  • Profit Merge!

copybara-service bot pushed a commit to google/gvisor that referenced this pull request Jun 10, 2024
copybara-service bot pushed a commit to google/gvisor that referenced this pull request Jun 11, 2024
This turns on optimizations for release builds.

Detected in [this comment](freedomofpress/dangerzone#590 (comment)).

PiperOrigin-RevId: 642076108
copybara-service bot pushed a commit to google/gvisor that referenced this pull request Jun 11, 2024
This turns on optimizations for release builds.

Detected in [this comment](freedomofpress/dangerzone#590 (comment)).

PiperOrigin-RevId: 642076108
copybara-service bot pushed a commit to google/gvisor that referenced this pull request Jun 11, 2024
This turns on optimizations for release builds.

Detected in [this comment](freedomofpress/dangerzone#590 (comment)).

PiperOrigin-RevId: 642076108
copybara-service bot pushed a commit to google/gvisor that referenced this pull request Jun 11, 2024
This turns on optimizations for release builds.

Detected in [this comment](freedomofpress/dangerzone#590 (comment)).

PiperOrigin-RevId: 642076108
copybara-service bot pushed a commit to google/gvisor that referenced this pull request Jun 11, 2024
This turns on optimizations for release builds.

Detected in [this comment](freedomofpress/dangerzone#590 (comment)).

PiperOrigin-RevId: 642129920
@EtiennePerot
Copy link
Contributor Author

EtiennePerot commented Jun 12, 2024

Thanks, I have incorporated wip-gvisor-2-squashed and re-signed my commit within it. I also incorporated your CHANGELOG.md fixes that were on top of the commit, since I could not otherwise have kept your signature on it. (Also fixed bug links within CHANGELOG.md).
I left some comments on your commits but as they are all trivial comment spelling/wording changes, I figured I'd just incorporate them in my commit as well.

apyrgio added a commit that referenced this pull request Jun 12, 2024
Add a design document for the gVisor integration, which is currently
under review. The associated pull request has lots of architectural
discussions about integrating gVisor, so in this document we collect
them all in one place.

Refs #590
apyrgio and others added 4 commits June 12, 2024 13:40
Our logic for detecting the appropriate Tesseract data directory should
also take into account the canonical envvar, if explicitly passed.
Get the (major, minor) parts of the Docker/Podman version, to check if
some specific features can be used, or if we need a fallback. These
features are related with the upcoming gVisor integration, and will be
added in subsequent commits.
Add Podman's default seccomp policy as of 2024-06-10 [1]. This policy
will be used in subsequent commits in platforms with Podman version 3,
whose seccomp policy does not allow the `ptrace()` syscall.

[1] https://github.com/containers/common/blob/d3283f8401eeeb21f3c59a425b5461f069e199a7/pkg/seccomp/seccomp.json
This wraps the existing container image inside a gVisor-based sandbox.

gVisor is an open-source OCI-compliant container runtime.
It is a userspace reimplementation of the Linux kernel in a
memory-safe language.

It works by creating a sandboxed environment in which regular Linux
applications run, but their system calls are intercepted by gVisor.
gVisor then redirects these system calls and reinterprets them in
its own kernel. This means the host Linux kernel is isolated
from the sandboxed application, thereby providing protection against
Linux container escape attacks.

It also uses `seccomp-bpf` to provide a secondary layer of defense
against container escapes. Even if its userspace kernel gets
compromised, attackers would have to additionally have a Linux
container escape vector, and that exploit would have to fit within
the restricted `seccomp-bpf` rules that gVisor adds on itself.

Fixes freedomofpress#126
Fixes freedomofpress#224
Fixes freedomofpress#225
Fixes freedomofpress#228
@apyrgio
Copy link
Contributor

apyrgio commented Jun 12, 2024

The changes look good Etienne, thanks a lot. As for the signatures, I'm having a hard time retaining the original ones prior to the merge. For instance, I had to rebase this PR on top of some recently commits on main, so your signature has been replaced with mine. Oh well, that's just the way it goes.

I'm waiting for the CI tests to pass one last time, and then I'll merge this PR.

@apyrgio apyrgio merged commit f03bc71 into freedomofpress:main Jun 12, 2024
38 checks passed
@EtiennePerot
Copy link
Contributor Author

Thank you for all the work on reviewing this and the related fixes!

copybara-service bot pushed a commit to google/gvisor that referenced this pull request Jun 14, 2024
copybara-service bot pushed a commit to google/gvisor that referenced this pull request Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants