-
Notifications
You must be signed in to change notification settings - Fork 709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PVF worker: Add seccomp restrictions (restrict networking) #2009
Conversation
We're already working on sandboxing by [blocking all unneeded syscalls](#882). However, due to the wide scope it will take a while longer. This PR starts with a much smaller scope, only blocking network-related syscalls until the above is ready. For security we block the following with `seccomp`: - creation of new sockets - these are unneeded in PVF jobs, and we can safely block them without affecting consensus. - `io_uring` - as discussed [here](paritytech/polkadot#7334 (comment)), io_uring allows for networking and needs to be blocked. See below for a discussion on the safety of doing this. - `connect`ing to sockets - the above two points are enough for networking and is what birdcage does (or [used to do](phylum-dev/birdcage#47)) to restrict networking. However, it is possible to [connect to abstract unix sockets](https://lore.kernel.org/landlock/20231023.ahphah4Wii4v@digikod.net/T/#u) to do some kinds of sandbox escapes, so we also block the `connect` syscall. (Intentionally left out of implementer's guide because it felt like too much detail.) `io_uring` is just a way of issuing system calls in an async manner, and there is nothing stopping wasmtime from legitimately using it. Fortunately, at the moment it does not. Generally, not many applications use `io_uring` in production yet, because of the numerous kernel CVEs discovered. It's still under a lot of development. Android outright banned `io_uring` for these reasons. Considering `io_uring`'s status, and that it very likely would get detected either by our [recently-added static analysis](#1663) or by testing, I think it is fairly safe to block it. If execution hits an edge case code path unique to a given machine, it's already taken a non-deterministic branch anyway. After all, we just care that the majority of validators reach the same result and preserve consensus. So worst-case scenario, there's a dispute, and we can always admit fault and refund the wrong validator. On the other hand, if all validators take the code path that results in a seccomp violation, then they would all vote against the current candidate, which is also fine. The violation would get logged (in big scary letters) and hopefully some validator reports it to us. Actually, a worst-worse-case scenario is that 50% of validators vote against, so that there is no consensus. But so many things would have to go wrong for that to happen: 1. An update to `wasmtime` is introduced that uses io_uring (unlikely as io_uring is mainly for IO-heavy applications) 2. The new syscall is not detected by our static analysis 3. It is never triggered in any of our tests 4. It then gets triggered on some super edge case in production on 50% of validators causing a stall (bad but very unlikely) 5. Or, it triggers on only a few validators causing a dispute (more likely but not as bad?) Considering how many things would have to go wrong here, we believe it's safe to block `io_uring`. Closes #619 Original PR in Polkadot repo: paritytech/polkadot#7334
Starting the tokio runtime was calling `socketpair` and triggering the new seccomp filter. Removed tokio since we wanted to do it soon anyway as part of #649.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
//! # Action on caught syscall | ||
//! | ||
//! TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of killing the process on violations, we can have seccomp notify the parent, having it kill the process and log what happened. We anyway want logging to ensure that the syscall detection script is really sound, before enabling a seccomp whitelist based on it in production. I've done some experiments and performance does not suffer too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you plan to have the parent notified? using ptrace or SECCOMP_RET_USER_NOTIF
?
If we only want logging, we can just rely on the kernel's Audit logging of seccomp actions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ptrace. With SECCOMP_RET_USER_NOTIF
we'd need to check if it's supported, as you previously pointed out. Maybe not a big deal as it was introduced in 5.0.
I considered audit logging, but AFAICT there are some issues with it. We'd need to manually parse the log which seems hacky (there is a Linux utility for it but not installed by default). Also, it's possible for operators to have disabled seccomp logging. ptrace seemed like the right way to do it, albeit more complex; but I'd really appreciate your input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it's possible for operators to have disabled seccomp logging.
well, users can disable ptrace too. But is this a concern? we only have access to the logs of nodes if they're run by us, right? we can configure the system whichever way we see fit.
I don't have a very strong opinion honestly. whichever option works, doesn't impact performance too much and isn't too complicated to implement.
I am guessing the most performant one is audit logging, because it's all happening in the kernel. But it's indeed more complicated to parse the logs and build the right plubming to feed them in loki
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with parsing the audit logs. Audit events are not guaranteed to be observable, but we don't necessarily need them: the logs we emit are merely informative, so that operators know what happened. Note that if we were to handle seccomp violations differently from regular worker death, then an attacker could perhaps abuse that non-determinism somehow.
Using ptrace
instead would be more reliable and (I think) the performance would be fine, but it seemed more convoluted to implement that strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good
//! # Action on caught syscall | ||
//! | ||
//! TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you plan to have the parent notified? using ptrace or SECCOMP_RET_USER_NOTIF
?
If we only want logging, we can just rely on the kernel's Audit logging of seccomp actions
NOTE: Log, but don't change the outcome. Not all validators may have auditing enabled, so we don't want attackers to abuse a non-deterministic outcome. TESTING: Some manual testing where seccomp events were triggered confirmed that the logs are parsed correctly: Prepare worker: ``` Oct 27 09:15:42.725 WARN parachain::pvf: failed to recv a prepare response: Custom { kind: UnexpectedEof, error: "early eof" } worker_pid=2691819 Oct 27 09:15:42.726 DEBUG parachain::pvf: checking audit log for seccomp violations worker_pid=2691819 audit_log_path="/var/log/syslog" Oct 27 09:15:42.727 ERROR parachain::pvf: A forbidden syscall was attempted! This is a violation of our seccomp security policy. Report an issue ASAP! worker_pid=2691819 syscall=41 pvf=Pvf { code, code_hash: 0xdc4dc649bdebd4a529b6c3eb17da74b36afa9728eb7a0d834443510382490a35, executor_params: ExecutorParams([]), prep_timeout: 3s } ``` Execute worker: ``` Oct 27 09:17:28.006 WARN parachain::pvf: failed to recv an execute response worker_pid=2692712 validation_code_hash=0xdc4dc649bdebd4a529b6c3eb17da74b36afa9728eb7a0d834443510382490a35 error=Custom { kind: UnexpectedEof, error: "early eof" } Oct 27 09:17:28.006 DEBUG parachain::pvf: checking audit log for seccomp violations worker_pid=2692712 audit_log_path="/var/log/syslog" Oct 27 09:17:28.007 ERROR parachain::pvf: A forbidden syscall was attempted! This is a violation of our seccomp security policy. Report an issue ASAP! worker_pid=2692712 syscall=41 validation_code_hash=0xdc4dc649bdebd4a529b6c3eb17da74b36afa9728eb7a0d834443510382490a35 artifact_path="/tmp/.tmp4p6bN0/wasmtime_polkadot_v1.1.0_0xdc4dc649bdebd4a529b6c3eb17da74b36afa9728eb7a0d834443510382490a35_0x03170a2e7597b7b7e3d84c05391d139a62b157e78786d8c082f29dcf4c111314" ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable. Another way of getting a log would be to use the Trap return action and install a signal handler for SIGSYS that logs and then kills the process.
But that get's tricky because you can only use async-signal-safe functions in signal handlers (and we'd have to audit all of that code constantly).
Perhaps we should only log for now, instead of killing the process. Just for one release. We haven't tested with all parachains, after all, and the I say we log, just to be safe. The jobs really shouldn't actually do any networking. (And creating a Unix socket would have already failed with our FS restrictions.) |
Sounds good to me. We'll need some script that parses the audit logs from the kernel and notifies us if there's a logged violation |
I agree that signal handlers are better avoided where possible. :P And, according to man:
IIUC, with this scheme, an attacker could overwrite our signal handlers with his own.
I believe we can just parse the logs after every job, similar to how we do it now. We don't need a script running continuously (if I understood you correctly). I've confirmed that we have time to log for one release before enabling the protections. |
You could add
Sure. We still need some kind of rough notification mechanism if a fault occurred. Or just remember to scan the logs regularly :D |
I'm not following the whole discussion so a possibly dumb question... Why do we need to parse audit logs instead of connecting to audit via netlink and observing audit events directly? |
@s0me0ne-unkn0wn I don't know what netlink is. :P Parsing logs is never ideal, but this is the best way I could find since encountering the problem back in April. If your solution is quick to implement I could do it, otherwise I've already implemented and tested the parsing way. |
@mrcnski see man 3 audit_open and further links. It's the interface to the kernel's audit subsys. I think I sent you a usage example to Matrix PM once. |
Thanks @s0me0ne-unkn0wn, it's a good idea if I can figure out how to do it. Raised a follow-up at #2080. |
Use a combination of rusty-fork (separate processes in rust tests) and new sessions to safely kill child workers in tests.
Overview
We're already working on sandboxing by blocking all unneeded syscalls. However, due to the wide scope it will take a while longer. This PR starts with a much smaller scope, only blocking network-related syscalls until the above is ready.
For security we block the following with
seccomp
:creation of new sockets - these are unneeded in PVF jobs, and we can safely block them without affecting consensus.
io_uring
- as discussed here, io_uring allows for networking and needs to be blocked. See below for a discussion on the safety of doing this.connect
ing to sockets - the above two points are enough for networking and is what birdcage does (or used to do) to restrict networking. However, it is possible to connect to abstract unix sockets to do some kinds of sandbox escapes, so we also block theconnect
syscall.Safety of blocking io_uring
(Intentionally left out of implementer's guide because it felt like too much detail.)
io_uring
is just a way of issuing system calls in an async manner, and there is nothing stopping wasmtime from legitimately using it. Fortunately, at the moment it does not. Generally, not many applications useio_uring
in production yet, because of the numerous kernel CVEs discovered. It's still under a lot of development. Android outright bannedio_uring
for these reasons.Considering
io_uring
's status, and that it very likely would get detected either by our recently-added static analysis or by testing, I think it is fairly safe to block it.Consensus analysis
If execution hits an edge case code path unique to a given machine, it's already taken a non-deterministic branch anyway. After all, we just care that the majority of validators reach the same result and preserve consensus. So worst-case scenario, there's a dispute, and we can always admit fault and refund the wrong validator. On the other hand, if all validators take the code path that results in a seccomp violation, then they would all vote against the current candidate, which is also fine. The violation would get logged (in big scary letters) and hopefully some validator reports it to us.
Actually, a worst-worse-case scenario is that 50% of validators vote against, so that there is no consensus. But so many things would have to go wrong for that to happen:
wasmtime
is introduced that uses io_uring (unlikely as io_uring is mainly for IO-heavy applications)Considering how many things would have to go wrong here, we believe it's safe to block
io_uring
.Related
Closes #619
Original PR in Polkadot repo: paritytech/polkadot#7334
Birdcage io_uring PR: phylum-dev/birdcage#34