Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All conversions fail with "Unspecified error", using Orbstack #908

Closed
mstgrv opened this issue Sep 4, 2024 · 13 comments · Fixed by #926
Closed

All conversions fail with "Unspecified error", using Orbstack #908

mstgrv opened this issue Sep 4, 2024 · 13 comments · Fixed by #926
Milestone

Comments

@mstgrv
Copy link

mstgrv commented Sep 4, 2024

Trying to use Dangerzone 0.7.0 on some files recently, everything fails with "Unspecified error".

I'm using Dangerzone with Orbstack (https://orbstack.dev/) rather than Docker Desktop, so perhaps there might be some configuration issue or something? Not sure how to generate logs or anything but let me know if there's anything I can provide. I used to use Dangerzone & Orbstack without any issues but perhaps there's been a breaking change somewhere recently on either end.

@mstgrv mstgrv changed the title All conversions fail with "Unspecified error" All conversions fail with "Unspecified error", using Orbstack Sep 4, 2024
@EtiennePerot
Copy link
Contributor

Hi there. This is likely due to Orbstack vs Docker Desktop differences and how it impacts gVisor running inside. Please see issue #865 where another user is using a non-Docker container runtime and getting similar symptoms, most likely due to different permissions between the VM image used by Docker Desktop vs the one used by Colima.

See that issue for troubleshooting steps as well. The first step is to find out the container runtime command being run (docker run ..., it should show up in the console if you run Dangerzone through the command-line), then run that command directly so that you can see its output. Adding -e RUNSC_DEBUG=1 to it will help to in order to see gVisor output.

@mstgrv
Copy link
Author

mstgrv commented Sep 4, 2024

Thanks very much for that. Running Dangerzone via the commandline, this is the output showing the error:

[INFO] > /usr/local/bin/docker run --security-opt=no-new-privileges:true --cap-drop all --cap-add SYS_CHROOT --network=none -u dangerzone --rm -i --name dangerzone-doc-to-pixels-Dem_W3 dangerzone.rocks/dangerzone /usr/bin/python3 -m dangerzone.conversion.doc_to_pixels
Usage: dangerzone [OPTIONS] [FILENAMES]...
Try 'dangerzone --help' for help.

Error: No such option: -B
[ERROR] [doc Dem_W3] 0% Unspecified error
[DEBUG] Marking doc Dem_W3 as 'failed'

@EtiennePerot
Copy link
Contributor

Huh. I'm not sure which program exactly is saying the Error: No such option: -B error message. In any case, as I mentioned, please add -e RUNSC_DEBUG=1 next to the --security-opt=no-new-privileges:true argument and see if you get more output.

@mstgrv
Copy link
Author

mstgrv commented Sep 4, 2024

Yeah it's weird because it doesn't even seem like a -B is even being passed anywhere from that output.

Running with that extra flag, I get the below:

I0904 04:20:21.595886 7 main.go:197] **************** gVisor ****************
I0904 04:20:21.596042 7 namespace.go:251] *** Re-running as root in new user namespace ***
W0904 04:20:21.596088 7 util.go:64] FATAL ERROR: Error executing inside namespace: re-executing self: fork/exec /proc/self/exe: operation not permitted
Error executing inside namespace: re-executing self: fork/exec /proc/self/exe: operation not permitted
W0904 04:20:21.596199 7 main.go:227] Failure to execute command, err: 1
gVisor quit with exit code: 128

@EtiennePerot
Copy link
Contributor

EtiennePerot commented Sep 4, 2024

Can you try adding --security-opt=seccomp=unconfined? I suspect you are hitting whatever the default syscall filter of Orbstack is. If that doesn't fix it, please attach the full log output.

@mstgrv
Copy link
Author

mstgrv commented Sep 4, 2024

That did seem to fix it, that error is no longer happening and it seems to be waiting to be given a file to process. Is there a way to run Dangerzone normally via the GUI with that flag included?

@EtiennePerot
Copy link
Contributor

EtiennePerot commented Sep 5, 2024

it seems to be waiting to be given a file to process

Yes, the behavior of waiting for a file to process is expected when running this command outside of Dangerzone.

Is there a way to run Dangerzone normally via the GUI with that flag included?

Try to replace line 142 here:

# Older Docker Desktop versions may have a seccomp policy that does not
# allow `ptrace(2)`. In these cases, we specify our own. See:
# https://github.com/freedomofpress/dangerzone/issues/846
if Container.get_runtime_version() < (25, 0):
security_args += custom_seccomp_policy_arg

with if True: (i.e. unconditionally adding a --security-opt=path/to/seccomp.gvisor.json flag). This will make it use the same syscall filter as what Dangerzone uses when it detects Docker version 24 or earlier. Orbstack probably uses Docker's old syscall filter as its default, which is missing some syscalls that gVisor needs.

@apyrgio Perhaps we could just set this seccomp profile unconditionally for all runtimes in order to harmonize these differences. WDYT?

@apyrgio
Copy link
Contributor

apyrgio commented Sep 5, 2024

Thanks for debugging this Etienne. I actually wanted to run Orbstack before weighing on this issue. It could be that our custom seccomp profile does not work for whatever reasons, and need to dig deeper.

If it does work though, we have to be very careful with setting this seccomp filter unconditionally. My fear is that we may mask updates to that filter from the upstream, e.g., in case of a Linux kernel vulnerability. I'll try to see if Orbstack (and Colima #908) report themselves through docker info somehow, and use that information to set the seccomp filter accordingly.

Unfortunately, container runtimes don't offer a way to show the default seccomp filter that will be used for the container invocation. Moreover, the Linux kernel does not provide it something helpful there as well (my understanding is that actions_avail doesn't include this info).

Finally, regarding the -B argument (#908 (comment)), we actually have an issue for that already: #873. As stated there, it's benign.

@mstgrv: In order to make the change that Etienne suggests, you need to build Dangerzone from source (check out https://github.com/freedomofpress/dangerzone/blob/main/BUILD.md) or edit the installed Python module directly (nasty). Are you able to do that?

@almet
Copy link
Contributor

almet commented Sep 5, 2024

In addition (or as a replacement) to opting-in in our code for a number of container technologies, we could have an option in the interface to enable this custom seccomp profile.

Or, if we don't want to bother end users about this (it's fairly technical), we could enable this via a setting that we don't expose in the interface.

Because these technologies aren't currently supported, that would offer a way for advanced users to tweak the installation.

Note that I am also concerned about the security breaches that it could open...

@EtiennePerot
Copy link
Contributor

EtiennePerot commented Sep 5, 2024

If it does work though, we have to be very careful with setting this seccomp filter unconditionally. My fear is that we may mask updates to that filter from the upstream, e.g., in case of a Linux kernel vulnerability. I'll try to see if Orbstack (and Colima #908) report themselves through docker info somehow, and use that information to set the seccomp filter accordingly.

I don't think that's a likely scenario, but even if it happened, I don't think blocking the system call at the level of the container runtime syscall filter would make a practical difference, in my opinion.

The longer version: Container runtimes' default seccomp profile has to cater to generic workloads and thus need to allow most syscalls through by design. Their decision to block additional ones always carries the risk of breaking currently-working container workloads in ways that are hard for container users to detect. But even if this was the case, I'd still point out that gVisor "seccomps itself" much tighter than the container runtime's default seccomp filter, and this will remain true by design. These two layers of seccomp filters stack, i.e. every system call to the host needs to clear both filters in order to be executed. Now, if a Linux kernel vulnerability were to show up in such a way that it would be easy to block through seccomp filters without impacting legitimate workloads (not often the case; most vulnerabilities usually require a specific sequence of syscalls which seccomp cannot selectively prevent), then perhaps a container runtime's default seccomp filter would add it. But it's difficult to think of a scenario where that would happen while the gVisor-side seccomp filter would not also be updated to block (or already be blocking) this same system call.

I'd also point that by using the default container runtime's seccomp profile, the opposite risk exists: system calls are added to the set of allowed system calls that aren't required by Dangerzone. The fact that older versions of Docker didn't allow ptrace(2) but later ones did, for example, suggests that the default profile tends to get more permissive over time, not less. By explicitly setting its own container seccomp profile, Dangerzone would avoid having this set unnecessarily expand.

Unfortunately, container runtimes don't offer a way to show the default seccomp filter that will be used for the container invocation. Moreover, the Linux kernel does not provide it something helpful there as well (my understanding is that actions_avail doesn't include this info).

Yes, seccomp filters are fairly inscrutable, mostly to prevent applications from figuring out which syscalls they can run at all without trying to run them. This is because seccomp filters can and often are configured to kill the application rather than just return a "permission denied" error (side note: this is another difference between gVisor's seccomp filter and container runtimes', btw; gVisor's filter can afford to kill the entire container, but container runtime filters can't because that would be too disruptive to workolads). Even files like /proc/$pid/seccomp_cache are gated behind off-by-default Linux kernel compilation options.

I think the only runtime-agnostic way to find out may be to have Dangerzone run a dummy "probe" Python script that executes a trusted command inside runsc (e.g. just run /bin/true), just to find out if that succeeds, then reports back up to Dangerzone whether it worked. This could run as part of gvisor_wrapper/entrypoint.py just before starting the untrusted sandbox workload (when the data reported by the container can still be considered trustworthy), or could be done as a separate one-off container executed only once on Dangerzone startup (can be done asynchronously while the user is picking files to avoid increasing startup latency).

@apyrgio
Copy link
Contributor

apyrgio commented Sep 9, 2024

Thanks a lot for this comment Etienne, it makes lots of sense.

But it's difficult to think of a scenario where that would happen while the gVisor-side seccomp filter would not also be updated to block (or already be blocking) this same system call.

To this point, would it make sense then to offer a seccomp filter for the outer container, that holds the allowed syscalls for Sentry and Gofer? Actually, is there such a list somewhere that we can take a look at?

The only drawbacks I see in this approach are:

  • System calls that the Python interpreter (i.e, our entrypoint script) requires for some reason, but gVisor does not allow. Probably exec(2) is one of them.
  • Updating this filter whenever gVisor adds an extra required syscall for Sentry/Gofer (I guess this will happen rarely).

We can live with those, I believe.

Yes, seccomp filters are fairly inscrutable, mostly to prevent applications from figuring out which syscalls they can run at all without trying to run them. This is because seccomp filters can and often are configured to kill the application rather than just return a "permission denied" error (side note: this is another difference between gVisor's seccomp filter and container runtimes', btw; gVisor's filter can afford to kill the entire container, but container runtime filters can't because that would be too disruptive to workolads). Even files like /proc/$pid/seccomp_cache are gated behind off-by-default Linux kernel compilation options.

Nice, cool to know that.

I think the only runtime-agnostic way to find out may be to have Dangerzone run a dummy "probe" Python script that executes a trusted command inside runsc (e.g. just run /bin/true), just to find out if that succeeds, then reports back up to Dangerzone whether it worked.

Exactly, I was thinking about something like this, as well. Not for detecting seccomp config issues specifically, but mostly as a Dangerzone health-check.

@EtiennePerot
Copy link
Contributor

EtiennePerot commented Sep 10, 2024

Thanks a lot for this comment Etienne, it makes lots of sense.

But it's difficult to think of a scenario where that would happen while the gVisor-side seccomp filter would not also be updated to block (or already be blocking) this same system call.

To this point, would it make sense then to offer a seccomp filter for the outer container, that holds the allowed syscalls for Sentry and Gofer? Actually, is there such a list somewhere that we can take a look at?

The only drawbacks I see in this approach are:

* System calls that the Python interpreter (i.e, our entrypoint script) requires for some reason, but gVisor does not allow. Probably `exec(2)` is one of them.

* Updating this filter whenever gVisor adds an extra required syscall for Sentry/Gofer (I guess this will happen rarely).

We can live with those, I believe.

gVisor's self-imposed syscall filters depend a lot on gVisor's configuration. The code for generating these filters is here for the Sentry (with the main part being in config_main.go), and here for the Gofer (with the main part being in config.go). The other files in these directories are there to handle things like architecture-specific differences (the Go runtime calls different syscalls depending on whether it's running on x86_64 vs ARM64) or other optional compilation-time choices (e.g. enabling race detection).

At runtime, there are also factors that affect the filters. For example, if #898 (turning DirectFS off) is merged, it will cause the filters to start blocking the openat system call. Other configuration options include the platform (KVM vs Systrap use different system calls in order to work), network stack (no syscalls when disabled, a few syscalls when using the userspace network stack, a bunch more syscalls when using the host kernel stack), GPU proxing support, and so on.

I'm not convinced that having a tight seccomp filter at the container runtime level would really be meaningful from a security perspective, for two reasons.

The first is simplicity. As the gVisor integration design doc states, the goal of the inner sandboxing layer is to act as the security layer, while the outer one is to act mostly as a compatibility layer (with any extra lockdown settings being a bonus). As this bug and #865 demonstrate, seccomp-bpf enforcement at the host container runtime level already means different things depending on the container runtime in use, and there are likely further subtleties depending on machine architecture too that I suspect we haven't seen yet. gVisor already puts a tight seccomp filter on itself, and does so at a time when it is still trusted to do so. Having another filter doesn't really add defense-in-depth, because that just gets stacked in the same host kernel codepath as the one that gVisor's host syscalls go through. By definition, it cannot be stricter than gVisor's own filter, so all it may protect against would be either the scenario above (the container runtime's default filter being updated to ban a specific syscall ahead of Dangerzone/gVisor doing so), or a logic bug in gVisor's syscall filter generation that accidentally allows bad syscalls through, which is IMO quite unlikely. But I'm obviously biased here because I wrote a large part of gVisor's syscall filter generation logic (and I'll be the first to admit that it is quite complex... I wrote about it here). There is an an automated test that verifies that the logic produces a filter that does end up causing a process to be killed if it calls a blocked syscall. Still, it's true that it's possible a bug of this kind may still slip through.

The second reason is that the outer container's seccomp filter needs to cover the entire lifetime of the outer container. This is unlike gVisor's system call filters, which are only applied after gVisor sandbox initialization, i.e. when the sandbox is fully initialized but just before any untrusted code starts running. Not at gVisor startup time. This is why gVisor can afford to e.g. block exec. But during initialization, gVisor needs to call sensitive system calls like exec, clone, unshare, setns, pivot_root, and so on in order to isolate itself. The outer container's seccomp filter would need to allow these system calls for the entire duration of the outer container, because it starts applying right from the start of the outer container's lifetime, and cannot be updated mid-way.

Therefore, while I'd be happy to provide a programmatic way (like say, a runsc generate-seccomp-profile-json subcommand that generates a JSON seccomp profile of the union of the Sentry+Gofer filters) that could run as part of the Dangerzone release pipeline (in order to lower the maintenance burden of keeping up with this filter across gVisor releases), there would still need to be a bunch of testing and post-processing on the Dangerzone side to further merge in all the syscalls that the gVisor initialization sequence needs, and (as you've pointed out) those needed by the outer container's Python interpreter and gVisor entrypoint Python script.

But at that point, the resulting set of syscall filters would be quite wide, hence why I'm not sure whether the incremental security it provides is worth the maintenance cost; at least not when approaching the problem from this angle.

All that being said, I am in full support of removing per-container-runtime differences, because it would avoid issues like this one from occurring. And since Dangerzone already bears the burden of maintaining an explicit seccomp profile for some container runtimes (for Docker <=24), and that gVisor is confirmed to work under it without issues, then perhaps the incremental cost to tighten this filter is worth it. Then it seems to me like it would be easy to simply apply this seccomp profile under all container runtimes (since there's no reason why the same image and the same command-line would call different syscalls under different container runtimes).

Another approach that may be lower-maintenance in order to arrive at a "locked down as much as possible" seccomp filter at the outer container level (even if still relatively loose, as per above) may be the following: write a syscall filter that allows every single system call that exists (explicitly listed one by one, no wildcard), and then whittle it down as needed while ensuring that all tests still work. This process can be done automatically: Have a script try to remove each syscall in order from the "allowed" set, and see if any test breaks. Then remove all the syscalls for which the tests passed from the final generated profile. That would be another good way of arriving at a good profile that can be used at the outer container runtime across all runtimes, and is independent of gVisor implementation details.

almet added a commit that referenced this issue Sep 19, 2024
As per Etienne Perot's comment on #908:

> Then it seems to me like it would be easy to simply apply this seccomp
profile under all container runtimes (since there's no reason why the
same image and the same command-line would call different syscalls under
different container runtimes).
@almet
Copy link
Contributor

almet commented Sep 19, 2024

I confirm that changing the default seccomp policy works in this case: I've been able to run a conversion on Orbstack with the changes listed in #926

almet added a commit that referenced this issue Sep 20, 2024
As per Etienne Perot's comment on #908:

> Then it seems to me like it would be easy to simply apply this seccomp
profile under all container runtimes (since there's no reason why the
same image and the same command-line would call different syscalls under
different container runtimes).
apyrgio pushed a commit that referenced this issue Sep 24, 2024
As per Etienne Perot's comment on #908:

> Then it seems to me like it would be easy to simply apply this seccomp
profile under all container runtimes (since there's no reason why the
same image and the same command-line would call different syscalls under
different container runtimes).
almet added a commit that referenced this issue Oct 2, 2024
As per Etienne Perot's comment on #908:

> Then it seems to me like it would be easy to simply apply this seccomp
profile under all container runtimes (since there's no reason why the
same image and the same command-line would call different syscalls under
different container runtimes).
@almet almet closed this as completed in #926 Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants