Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: support "maximum kernel version" #11

Open
cgwalters opened this issue Aug 21, 2015 · 23 comments
Open

RFE: support "maximum kernel version" #11

cgwalters opened this issue Aug 21, 2015 · 23 comments

Comments

@cgwalters
Copy link
Contributor

As system calls are added to the kernel, I feel there is not enough discussion by default of the wide variety of applications that will suddenly gain access to a new attack surface.

The canonical example here is perf_event_open(), the source of numerous CVEs. While perf is awesome, my (e.g.) web server should not (by default) be able to use it.

It's possible to use seccomp today to blacklist. whitelists can get very difficult to manage.

One thing that might be useful is a filter for any system calls newer than a particular kernel version, say 3.10. That way, each new system call would have to be verified for use in e.g. containers before it's added. Upgrading the kernel wouldn't suddenly expose containers to new attack surface.

In a discussion with @pcmoore he indicated this could be another annotation in the struct in e.g. arch-x86-syscalls.c.

@nmav
Copy link

nmav commented Jan 20, 2016

+1
That would help make blacklists usable for mitigation of security issues.

@pcmoore
Copy link
Member

pcmoore commented Jan 20, 2016

@nmav to be clear, this RFE is for adding information to the internal syscall tables about when the syscall was first introduced to the Linux kernel, not for adding logic to determine if the current running kernel supports a given syscall. However, if you are trying to block a syscall, you can do so with libseccomp regardless of if it is supported on a particular arch/ABI and kernel version, libseccomp will do the right thing for you.

@pcmoore
Copy link
Member

pcmoore commented Jul 21, 2020

This RFE is almost five years old, and outside of a single discussion with @cgwalters I haven't seen or heard of much other interest in such a feature. With plenty of other open issues, most with higher priority, it is not clear when we would work on this, or even if such a thing would be a useful addition.

@cgwalters and @drakenclimber what do you think of this issue in 2020? I'm tempted to close this as WONTFIX, but I would like to get some comments and feedback before we take that step.

@drakenclimber
Copy link
Member

@cgwalters and @drakenclimber what do you think of this issue in 2020? I'm tempted to close this as WONTFIX, but I would like to get some comments and feedback before we take that step.

Honestly I think this is a really cool idea. Several of my in-house customers are using allowlists because of this exact reason. If they were to use a denylist and a new syscall is added to the kernel, then that syscall would be another avenue of attack.

Let's leave it open for a bit longer. I'll ask around within Oracle and see if any customers are interested enough in this feature for me to pick it up. But @cgwalters (or anyone else for that matter) is totally welcome to own it if they have the time and interest :).

@pcmoore
Copy link
Member

pcmoore commented Jul 23, 2020

Okay, as long as there is interest, I've got no problem in keeping this one open.

@cgwalters
Copy link
Contributor Author

I do still think it'd be useful!

@pcmoore
Copy link
Member

pcmoore commented Aug 19, 2020

It looks like issue #286 is the concrete issue to help drive this work forward ... even if it has been almost five years ;)

I think the first step towards this is to add a new field to the syscalls.csv file that indicates when the syscall was first introduced. That is going to be a good chunk of work as we currently have ~469 syscalls defined (!). However, we could amortize this work for the existing syscalls with an "undefined" value that we would treat simply as the syscall being created at the dawn of time. Of course all new additions to the syscalls.csv table would need to be added with the kernel version.

Some more quick thoughts:

  • syscall.csv format
#syscall (v5.8.0-rc5 2020-07-14),kver_min,x86,x86_64,...
accept,<version>,PNR,43,...

... where <version> could be something like "5_8", "UNDEF", or similar.

  • version tokens
enum kernel_version {
    KV_UNDEF = 0,
    KV_1_0,
    KV_1_1,
    KV_1_3,
    KV_2_0,
    ...
    KV_5_8,
    _KV_MAX,
};

@jethrogb
Copy link

Is kernel version the right thing to track? Is it guaranteed that newer syscalls are not backported to e.g. stable kernel branches with a lower version number?

@mathstuf
Copy link

Red Hat will backport all kinds of things to their kernels, so no.

@cyphar
Copy link

cyphar commented Nov 30, 2020

If RedHat backports a syscall to an older kernel version, they can also patch their version of libseccomp to match. Though to be fair this might matter more for certain use-cases but as an approach to fixing #286 I think it's fairly workable. The other problem is that I'm not sure there's any better approach -- syscalls can be added in non-consecutive order (for instance openat2 was added before close_range -- though this example is kind of my fault).

@pcmoore
Copy link
Member

pcmoore commented Dec 2, 2020

If RedHat backports a syscall to an older kernel version, they can also patch their version of libseccomp to match.

Yes, exactly. The upstream libseccomp project has no control over the various enterprise Linux distributions and if those distributions decide to deviate from the upstream projects (either the Linux Kernel or libseccomp) they are on their own for support. While we will do our best to help, we can't sacrifice the upstream project in favor of these enterprise distributions with their own support and engineering staff.

@pcmoore
Copy link
Member

pcmoore commented Jan 12, 2021

As a point of reference, the syscalls(2) manpage has some historical information regarding when various syscalls were introduced into the kernel:

@cyphar
Copy link

cyphar commented Jan 14, 2021

I can do the syscall spelunking to figure out a version number for each syscall -- the only question is whether we should have the version number be per-architecture since I'm pretty sure certain syscalls were added to different architectures in different releases.

@pcmoore
Copy link
Member

pcmoore commented Jan 15, 2021

... the only question is whether we should have the version number be per-architecture since I'm pretty sure certain syscalls were added to different architectures in different releases.

They most definitely were, and still are, as far as I can see. While it is going to be slightly annoying, and will definitely explode the CSV, tracking the syscall's first appearance for each arch/ABI is probably the right thing to do.

Any help you can provide on this @cyphar would be greatly appreciated.

giuseppe added a commit to giuseppe/libseccomp that referenced this issue Mar 18, 2021
it was reported by clang with the option -fsanitize=memory:

Uninitialized bytes in MemcmpInterceptorCommon at offset 0 inside [0x7070000002a0, 56)
==3791089==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x482a2c in memcmp (fuzzer+0x482a2c)
    seccomp#1 0x7fed2f120ebb in _hsh_add src/libseccomp/src/gen_bpf.c:598:9
    seccomp#2 0x7fed2f121715 in _gen_bpf_action_hsh src/libseccomp/src/gen_bpf.c:796:6
    seccomp#3 0x7fed2f121a53 in _gen_bpf_node src/libseccomp/src/gen_bpf.c:831:11
    seccomp#4 0x7fed2f121a53 in _gen_bpf_chain.isra.0 src/libseccomp/src/gen_bpf.c:1072:13
    seccomp#5 0x7fed2f121f16 in _gen_bpf_chain_lvl_res src/libseccomp/src/gen_bpf.c:977:12
    seccomp#6 0x7fed2f121c74 in _gen_bpf_chain.isra.0 src/libseccomp/src/gen_bpf.c:1124:12
    seccomp#7 0x7fed2f12253c in _gen_bpf_syscall src/libseccomp/src/gen_bpf.c:1520:10
    seccomp#8 0x7fed2f12253c in _gen_bpf_syscalls src/libseccomp/src/gen_bpf.c:1615:18
    seccomp#9 0x7fed2f12253c in _gen_bpf_arch src/libseccomp/src/gen_bpf.c:1683:7
    seccomp#10 0x7fed2f12253c in _gen_bpf_build_bpf src/libseccomp/src/gen_bpf.c:2056:11
    seccomp#11 0x7fed2f12253c in gen_bpf_generate src/libseccomp/src/gen_bpf.c:2321:7
    seccomp#12 0x7fed2f11f41c in seccomp_export_bpf src/libseccomp/src/api.c:724:7

  Uninitialized value was created by a heap allocation
    #0 0x4547ef in realloc (fuzzer+0x4547ef)
    seccomp#1 0x7fed2f121244 in _blk_resize src/libseccomp/src/gen_bpf.c:362:8
    seccomp#2 0x7fed2f121244 in _blk_append src/libseccomp/src/gen_bpf.c:394:6

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
giuseppe added a commit to giuseppe/libseccomp that referenced this issue Mar 18, 2021
it was reported by clang with the option -fsanitize=memory:

Uninitialized bytes in MemcmpInterceptorCommon at offset 0 inside [0x7070000002a0, 56)
==3791089==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x482a2c in memcmp (fuzzer+0x482a2c)
    seccomp#1 0x7fed2f120ebb in _hsh_add src/libseccomp/src/gen_bpf.c:598:9
    seccomp#2 0x7fed2f121715 in _gen_bpf_action_hsh src/libseccomp/src/gen_bpf.c:796:6
    seccomp#3 0x7fed2f121a53 in _gen_bpf_node src/libseccomp/src/gen_bpf.c:831:11
    seccomp#4 0x7fed2f121a53 in _gen_bpf_chain.isra.0 src/libseccomp/src/gen_bpf.c:1072:13
    seccomp#5 0x7fed2f121f16 in _gen_bpf_chain_lvl_res src/libseccomp/src/gen_bpf.c:977:12
    seccomp#6 0x7fed2f121c74 in _gen_bpf_chain.isra.0 src/libseccomp/src/gen_bpf.c:1124:12
    seccomp#7 0x7fed2f12253c in _gen_bpf_syscall src/libseccomp/src/gen_bpf.c:1520:10
    seccomp#8 0x7fed2f12253c in _gen_bpf_syscalls src/libseccomp/src/gen_bpf.c:1615:18
    seccomp#9 0x7fed2f12253c in _gen_bpf_arch src/libseccomp/src/gen_bpf.c:1683:7
    seccomp#10 0x7fed2f12253c in _gen_bpf_build_bpf src/libseccomp/src/gen_bpf.c:2056:11
    seccomp#11 0x7fed2f12253c in gen_bpf_generate src/libseccomp/src/gen_bpf.c:2321:7
    seccomp#12 0x7fed2f11f41c in seccomp_export_bpf src/libseccomp/src/api.c:724:7

  Uninitialized value was created by a heap allocation
    #0 0x4547ef in realloc (fuzzer+0x4547ef)
    seccomp#1 0x7fed2f121244 in _blk_resize src/libseccomp/src/gen_bpf.c:362:8
    seccomp#2 0x7fed2f121244 in _blk_append src/libseccomp/src/gen_bpf.c:394:6

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@pcmoore
Copy link
Member

pcmoore commented Apr 28, 2022

As a FYI, I'm starting to look closer at our work queue for v2.6.0 and this jumped out as one of the larger items so I spent some time on it this afternoon while avoiding other work :)

I've got "arch-syscall-validate" so that it creates CSV files with minimum kernel versions for earch sycall/ABI pair, it also imports an existing "syscalls.csv" file to obtain any existing version information so we don't have to have a separate file with version information lying around. If that sounds confusing, it will make more sense when I submit the PR.

Speaking of the PR, I need to update the rest of the code/tooling to understand the expanded CSV format, once that is done I'll submit the PR for review/merge. This initial effort will be pretty hollow (no actual version information, and nothing to make use of it ), but it will allow us to start collecting the version information in our syscall table and pave the way for additional work using the syscall version information.

@drakenclimber
Copy link
Member

drakenclimber commented Apr 28, 2022

Ironically I recently started looking at it as well :) [1]. I have everything working except for the BPF creation.

Feel free to use/discard any of my work. Or if you want, we could have a quick chat to decide what to keep/throw away.

  • I created an introduced.csv that contains the version a syscall was added for each architecture. For prototyping ease, I have only been looking at x86_64. Automating the creation of this would be really cool
  • Similar to arch-syscall-validate, I created a script, arch-introduced-validate.py, that will verify that the kernel versions in the introduced CSV match the versions in the kernel source.
  • arch-syscall-ranges.py generates a *.h file that contains the valid syscall ranges for each architecture for each kernel version. Here's its current output. Note that the data is intentionally incorrect to make hacking around easier
  • I added a filter attribute, SCMP_FLTATR_CTL_KRNL_VRSN, that allows the user to specify the kernel version they understand
  • I added another filter attribute, SCMP_FLTATR_ACT_UNKNOWN, to allow the user to specify the action to take when a new (to the userspace app) syscall is called
  • Todo - I need to write the BPF to understand all of the above. I hacked around at it a week or two ago, but ran out of time and focus and it's a spaghetti mess :/

With all of that said, I would love to see what you've put together, @pcmoore. I think what I've outlined above will work, but it may not be the optimal way to do it.

[1] https://github.com/drakenclimber/libseccomp/blob/wip/issue11

@cgwalters
Copy link
Contributor Author

Is kernel version the right thing to track? Is it guaranteed that newer syscalls are not backported to e.g. stable kernel branches with a lower version number?

But conceptually that's not actually different than upgrading the kernel to a newer version.
We don't want to somehow start using new system calls just because the host kernel actually started supporting them! Right?

@pcmoore
Copy link
Member

pcmoore commented Apr 29, 2022

Ironically I recently started looking at it as well :) [1]. I have everything working except for the BPF creation.

Feel free to use/discard any of my work. Or if you want, we could have a quick chat to decide what to keep/throw away.

I actually started on this issue yesterday because I knew you were thinking about this topic and figured I could jump start it by getting some of the basic infrastructure in place, it wasn't my intention to duplicate efforts ... oh well, the best laid plans of mice and men ;)

Regardless, I've probably only got a couple more hours if work before my basic PR is ready so I'll go ahead and submit that so you can take a look; at that point we can merge it, or drop it as a "lessons learned" sort of thing. I don't get too attached to any code I write, so feel free to reject the PR in favor of what you've got.

  • I created an introduced.csv that contains the version a syscall was added for each architecture. For prototyping ease, I have only been looking at x86_64. Automating the creation of this would be really cool

I personally think it would be good to see all of the syscall information in one table/csv. Yes, it is going to start getting a bit big, but I believe fairly strongly that having all of our syscall information in one file/database is going to be a better choice in the long run (easier updates, less worries about synchronizing the tables, etc.).

The creation/updating and automation of maintaining this file/database is a slightly different topic, but one change I've made to the "arch-syscall-validate" script is that when it is asked to generate a new syscall CVS table it optionally loads an existing CSV table and pulls the kernel version from that. This allows us to preserve the kernel version information in our CSV file and only worry about adding new entries by hand after the new CSV is generated.

I think it's important to remember that the addition of new syscalls is a relatively rare event and optimizing the process for that is not something I would worry too much about. Similarly, initially populating the syscall table with kernel versions is a one-time event that I don't think we need to worry about making repeatable; if we can hack together something to initially populate the values - correctly! - I think that's okay.

  • Similar to arch-syscall-validate, I created a script, arch-introduced-validate.py, that will verify that the kernel versions in the introduced CSV match the versions in the kernel source.

Validating the syscall table information used to be a lot more important when it was hand created, now that the table is generated from the kernel source itself the validation step isn't really necessary. The fact that our generation script has "validate" in the name is really just vestigial naming and not something to worry too much about IMO.

Even with the kernel versions being added to the table I'm not sure validation will provide much benefit, although using this script to initially generate and add the kernel versions to the syscall table would be nice.

  • arch-syscall-ranges.py generates a *.h file that contains the valid syscall ranges for each architecture for each kernel version. Here's its current output. Note that the data is intentionally incorrect to make hacking around easier

I'm not sure how I feel about pre-computing the valid syscall ranges for each version/ABI. I understand the performance advantage, but that feels very wrong to me; I think I'd rather see the library calculate that as needed right now.

  • I added a filter attribute, SCMP_FLTATR_CTL_KRNL_VRSN, that allows the user to specify the kernel version they understand
  • I added another filter attribute, SCMP_FLTATR_ACT_UNKNOWN, to allow the user to specify the action to take when a new (to the userspace app) syscall is called

For whatever reason, I've always thought of this more as a proper rule API instead of a filter attribute. I know filter attributes are seductive in the sense that they are easy and malleable to fit a wide range of uses, but in my mind restricting the filter to a specific set of syscalls available in a given kernel version seems much more like a filter rule than an attribute.

Like I said earlier, this is getting way ahead of what I was attempting to do with my initial little syscall table infrastructure PR, but I guess something like this is what I was thinking along the lines of this for the API:

enum scmp_kver {
  __SCMP_KV_NULL = 0,
  SCMP_KV_UNDEF = 1,
  ...
  SCMP_KV_5_17,
  __SCMP_KV_MAX,
};

#define SCMP_KVLE(V)  SCMP_CMP64(100, SCMP_CMP_LE, (V), 0)

int seccomp_rule_add_kver(scmp_filter_ctx ctx, uint32_t action, scmp_arg_cmp cmp);

... with an example usage being:

rc = seccomp_rule_add_kver(ctx, SCMP_ACT_ALLOW, SCMP_KVLE(SCMP_KV_5_17));

We could also just scrap the scmp_arg_cmp parameter and pass the scmp_kver enum value directly to the API function, but part of me likes keeping some flexibility in the API for future use. However, like most things on this topic, I'm not sure I have a strong feeling about this so discussion is very welcome!

  • Todo - I need to write the BPF to understand all of the above. I hacked around at it a week or two ago, but ran out of time and focus and it's a spaghetti mess :/

I'm happy to help put together the db/BPF code when we get to that point, I've tossed it around in my head a bit over the past few years and I think I have some ideas on how to make it work, but we'll have to see how well those ideas translate into proper code ;)

With all of that said, I would love to see what you've put together, @pcmoore. I think what I've outlined above will work, but it may not be the optimal way to do it.

I really should be able to get the PR out this afternoon, but if something comes up I'll post it over the weekend; we can talk a bit more about it then. Although like I said earlier, it's a far cry from being a complete solution, it is really just intended to help pave the way for a lot of the stuff you're working on.

@drakenclimber
Copy link
Member

I don't get too attached to any code I write, so feel free to reject the PR in favor of what you've got.

I was going to say the exact same thing. :)

I really should be able to get the PR out this afternoon, but if something comes up I'll post it over the weekend;

I'm excited to see what you come up with. I think we could solve this in a variety of ways and it may take a few iterations.

Thanks so much for the help!

@pcmoore
Copy link
Member

pcmoore commented Apr 29, 2022

This ended up taking a bit more time than I thought today, but my initial infrastructure PR is up at #381.

@pcmoore pcmoore changed the title RFE: Support "maximum kernel version" RFE: support "maximum kernel version" Mar 31, 2023
@tiborvass
Copy link

Hi all! I'm just curious to know if someone is already working on finishing this.

@drakenclimber
Copy link
Member

drakenclimber commented Apr 11, 2023

Hi all! I'm just curious to know if someone is already working on finishing this.

A year ago I put together a prototype that was most of the way there [1]. I believe it works (or nearly so), but it needs a lot of work to clean it up, make the commits sensible, add tests, etc. Unfortunately I've since been pulled onto other issues, and I'm not sure when I'll get back to it. I am open to others picking up the task - either by continuing my work or starting from scratch.

[1] https://github.com/drakenclimber/libseccomp/tree/wip/issue11

@pcmoore
Copy link
Member

pcmoore commented Apr 11, 2023

I haven't had an opportunity to do any further work on this, so if you are interested in working on this please let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants