-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
str() to support large strings #305
Comments
This whole issue needs some serious attention as it is a significant limitation. Nice initial exploration @Birch-san. I have hit this limitation on a number of occasions and for me it is one of bpf(traces) biggest annoyances at the minute. |
@tyroguru out of interest: what kind of script do you have that needs big strings? Clear use-cases are valuable for framing the conversation. I had another thought… probe_read_str() -> BPF stack -> BPF map Has a chance of working, but is high-effort. probe_read_str() -> BPF map Additionally: it'd be worth asking around to assess whether I've characterised the problem correctly. Maybe there's a completely different way to approach big strings. |
Hey @Birch-san - sorry but I was being lazy :-) . Take this simple example I did a couple of weeks ago for reading /dev/null:
Obviously, strings much larger than 200 bytes are written to /dev/null. I need to understand the problem better to make any form of judgement on this but it seems like this is a fundamental problem that needs addressing in the engine room of BPF - to work around it in userland would probably be high effort as you say. |
@tyroguru thanks for the example. agreed; we need way bigger strings to attack that use-case. I've had a look at how probe_read and probe_read_str are implemented in the kernel. I think what we need is a Of course, anything is possible if you blow holes in the security. So I'd like to understand whether there's any difference in the security model of BPF stack vs BPF map. For example: BPF maps can be shared between processes — though I imagine this is acceptable if the map owner has full control of whether to share the file descriptor. It definitely feels right to add kernel support, since the proposed workaround is pretty shocking (and I'm pretty sure it'd be hard to write for loops that satisfy the verifier). Getting the string into a map solves the hardest problem but creates a new one: it makes it difficult for us to compose our perf_event_output buffer (i.e. for printf). I'm not sure how to concatenate a map value with other values and exfiltrate all of those in one buffer. Well, maybe the key is to fire several perf_event_output()s -- one per printf argument. If it's performance-sensitive, then maybe a batching mechanism would be required. Not sure what a high-quality kernel method for that would look like. |
Oh, apparently it used to be possible to probe_read directly into a BPF map value?
Indeed, his kernel commit from 2017 claims to enable a very relatable example: I wonder if the "probe_read may only write to a destination on the stack" is a newer validation constraint that was added to the API interface… whereas the actual verifier algorithm is more forgiving. This would indicate a regression (API becoming stricter than necessary). |
I had a stab at implementing some new kernel helpers (probe_read/probe_read_str which can write into map value). branched from:
feature commit: commit 25f076f737be964bcd2c7ab2c277f5c306074dab (HEAD -> bpf-probe-big-strings)
Author: Alex Birch <Birch-san@users.noreply.github.com>
Date: Sun Feb 10 12:26:06 2019 +0000
bpf: helpers to probe_read into BPF map values
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 04b2f613dd06..98f0c62c231d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -677,6 +677,22 @@ union bpf_attr {
* @buf: buf to fill
* @buf_size: size of the buf
* Return : 0 on success or negative error code
+ *
+ * int bpf_probe_read_to_map(void *dst, int size, void *src)
+ * @dst: pointer to map value
+ * Return: 0 on success or negative error
+ *
+ * int bpf_probe_read_str_to_map(void *dst, int size, const void *unsafe_ptr)
+ * Copy a NUL terminated string from unsafe address. In case the string
+ * length is smaller than size, the target is not padded with further NUL
+ * bytes. In case the string length is larger than size, just count-1
+ * bytes are copied and the last byte is set to NUL.
+ * @dst: pointer to map value
+ * @size: maximum number of bytes to copy, including the trailing NUL
+ * @unsafe_ptr: unsafe address
+ * Return:
+ * > 0 length of the string including the trailing NUL on success
+ * < 0 error
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -736,7 +752,9 @@ union bpf_attr {
FN(xdp_adjust_meta), \
FN(perf_event_read_value), \
FN(perf_prog_read_value), \
- FN(getsockopt),
+ FN(getsockopt), \
+ FN(probe_read_to_map), \
+ FN(probe_read_str_to_map),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index fe2429b382ca..ac0fffdc7f90 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -96,6 +96,26 @@ static const struct bpf_func_proto bpf_probe_read_proto = {
.arg3_type = ARG_ANYTHING,
};
+BPF_CALL_3(bpf_probe_read_to_map, void *, dst, u32, size, const void *, unsafe_ptr)
+{
+ int ret;
+
+ ret = probe_kernel_read(dst, unsafe_ptr, size);
+ if (unlikely(ret < 0))
+ memset(dst, 0, size);
+
+ return ret;
+}
+
+static const struct bpf_func_proto bpf_probe_read_to_map_proto = {
+ .func = bpf_probe_read_to_map,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_UNINIT_MAP_VALUE,
+ .arg2_type = ARG_CONST_SIZE_OR_ZERO,
+ .arg3_type = ARG_ANYTHING,
+};
+
BPF_CALL_3(bpf_probe_write_user, void *, unsafe_ptr, const void *, src,
u32, size)
{
@@ -503,6 +523,36 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = {
.arg3_type = ARG_ANYTHING,
};
+BPF_CALL_3(bpf_probe_read_str_to_map, void *, dst, u32, size,
+ const void *, unsafe_ptr)
+{
+ int ret;
+
+ /*
+ * The strncpy_from_unsafe() call will likely not fill the entire
+ * buffer, but that's okay in this circumstance as we're probing
+ * arbitrary memory anyway similar to bpf_probe_read() and might
+ * as well probe the map value. Thus, memory is explicitly cleared
+ * only in error case, so that improper users ignoring return
+ * code altogether don't copy garbage; otherwise length of string
+ * is returned that can be used for bpf_perf_event_output() et al.
+ */
+ ret = strncpy_from_unsafe(dst, unsafe_ptr, size);
+ if (unlikely(ret < 0))
+ memset(dst, 0, size);
+
+ return ret;
+}
+
+static const struct bpf_func_proto bpf_probe_read_str_to_map_proto = {
+ .func = bpf_probe_read_str_to_map,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_UNINIT_MAP_VALUE,
+ .arg2_type = ARG_CONST_SIZE_OR_ZERO,
+ .arg3_type = ARG_ANYTHING,
+};
+
static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
{
switch (func_id) {
@@ -514,6 +564,8 @@ static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
return &bpf_map_delete_elem_proto;
case BPF_FUNC_probe_read:
return &bpf_probe_read_proto;
+ case BPF_FUNC_probe_read_to_map:
+ return &bpf_probe_read_to_map_proto;
case BPF_FUNC_ktime_get_ns:
return &bpf_ktime_get_ns_proto;
case BPF_FUNC_tail_call:
@@ -542,6 +594,8 @@ static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
return &bpf_get_prandom_u32_proto;
case BPF_FUNC_probe_read_str:
return &bpf_probe_read_str_proto;
+ case BPF_FUNC_probe_read_str_to_map:
+ return &bpf_probe_read_str_to_map_proto;
default:
return NULL;
}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4c223ab30293..8740e7374698 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -736,7 +736,9 @@ union bpf_attr {
FN(xdp_adjust_meta), \
FN(perf_event_read_value), \
FN(perf_prog_read_value), \
- FN(getsockopt),
+ FN(getsockopt), \
+ FN(probe_read_to_map), \
+ FN(probe_read_str_to_map),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call I got this compiled and booted into it, built libbpf and bcc against it, added new IRBuilder definitions, added some calling code and successfully compiled bpftrace… but I guess I've missed some step. The verifier says that my new helper functions don't exist. |
@Birch-san can you please bring this up on the netdev mailing list? http://vger.kernel.org/vger-lists.html#netdev This is where kernel BPF development is done. I think you'll get asked to show why bpf_probe_read_to_map() is necessary, and we can't just use bpf_probe_read() to read to a map. If that's broken, they may want to fix it, and add a test to test_verifier.c to make sure it stays fixed. |
@brendangregg thanks for pointing me to the right people. mailing lists are a new world to me, so I'm lurking for a bit first. then I'll ask whether probe_read_* was intended to be stack-only. may need to prepare a minimal example for this in bcc. |
@Birch-san I opened #750 which should solve this, you might be interested :)
I would guess it's more likely that the desired behavior changed and the API documentation was not updated. Which docs was you looking at? I don't see any reference to "stack only" for // tools/testing/selftests/bpf/test_verifier.c
{
"helper access to map: full range",
.insns = {
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
BPF_LD_MAP_FD(BPF_REG_1, 0),
BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4),
BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
BPF_MOV64_IMM(BPF_REG_2, sizeof(struct test_val)),
BPF_MOV64_IMM(BPF_REG_3, 0),
BPF_EMIT_CALL(BPF_FUNC_probe_read),
BPF_EXIT_INSN(),
},
.fixup_map2 = { 3 },
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
}, |
@mmarchini I'm indeed interested; thanks for progressing this. ==== The reason I believe that probe_read_str() refuses to write into a BPF map value (and will only write onto the BPF stack), is that I attempted it before: bd038f9...Birch-san:c5088c9a5d731165e1005a6a800e611bb63f3dc3 My attempt looks very similar to what's in your pull request. Maybe there's some significance in your using probe_read instead of probe_read_str. Or maybe your fixed-size read gives the verifier extra guarantees. The error I got at the time (commit message is in commit 52677d0 of this diff) was:
My understanding of that error was as follows:
==== I found another Yonghong-song explained the error:
==== Here's the prototype for probe_read_str, which demonstrates the constraints on it: static const struct bpf_func_proto bpf_probe_read_str_proto = {
.func = bpf_probe_read_str,
.gpl_only = true,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_UNINIT_MEM,
.arg2_type = ARG_CONST_SIZE_OR_ZERO,
.arg3_type = ARG_ANYTHING,
}; Its arg1 (*dst), has type ARG_PTR_TO_UNINIT_MEM, which is described here:
This suggests to me that ==== Recall the error I received, /* string representation of 'enum bpf_reg_type' */
static const char * const reg_type_str[] = {
// …
[PTR_TO_MAP_VALUE_OR_NULL] = "map_value_or_null",
[PTR_TO_STACK] = "fp",
// …
}; This suggests again to me that BPF required me to point to the stack, when in reality I pointed at a map value. ====
The error message I received in my attempt, and the corroboration above were the only reason for my believing this impossible. Maybe there was no change in behaviour, and the original premise #305 (comment) is still possible. ==== It's heartening that you've got probe reads working with map values as a destination. I don't understand (given the above) why it works, but that's good news. If it turns out now that we have all the kernel support necessary to do this, then I'm tempted to have another stab at it. |
I tried to raise BPFTRACE_STRLEN beyond 200 and got the error message with a link to this issue. For the next person that follows this path, I'll leave my workaround here with which I was able to print out a max ~300kb of ascii from
This was pretty much the max amount I was able to print out after some amount of fiddling. Any larger than this and the BPF program became to large. The amount of instructions inside the while loop seemed detrimental to how big a while loop we could have. So in case you want to increase the logic inside the while loop you may need to reduce the iteration count. |
Hey @mathias-nyman, your proposal seems to work for me, thx a lot. I'm wondering about
I did some tests related to
Please note that I used BPFTRACE_STRLEN=64 (and +=63). I guess this shouldn't change much... Thx for your help, Uli |
@uli-heller you're absolutely right about the lack of bounds checking. I remember being confused as to why it seemed to work for the program I was instrumenting at the time, but it did 🤷♂️ I was on a mission of extracting data and didn't care much about correctness at the time, sorry. Lemme know if you figure it out. If you only have medium-sized strings to print out without the chance of overflow, you can still use this trick by setting BPFTRACE_STRLEN=2 (and +=1) but then you also have to decrease the iteration count (the number in the while loop). |
@mathias-nyman : Thx a lot. I followed a different approach using two nested loops - one to simulate something like "strlen()". Your approach is much easier. I modified it a little bit. Now I have a solution which doesn't need and BPFTRACE variables:
There is just one potential caveat: What about multibyte characters? |
Previously, all strings needed to be stored on the stack. Given that BPF has a 512B stack limit, this put serious limits on strings. In practice, it was limited to about 200B. This commit raises the limit much higher to around 1024. The new limit we hit is from the LLVM memset() builtin. It only supports up to 1K. To go above this, we need to create our own memset() routine in BPF. This is quite easy to do and will be attempted in a later commit. The way big strings work is by assigning each str() call in the script a unique ID. Then we create a percpu array map such that for each CPU, there is a percpu entry that can accomodate each str() callsite. This makes it so scripts can keep as many strings "on the stack" as they'd like, with the tradeoff being the amount of memory we need to preallocate. As long as the kernel never allows BPF programs to nest (eg execute prog A while prog A is already on the call stack), this approach is robust to self-corruption. This closes bpftrace#305.
Previously, all strings needed to be stored on the stack. Given that BPF has a 512B stack limit, this put serious limits on strings. In practice, it was limited to about 200B. This commit raises the limit much higher to around 1024. The new limit we hit is from the LLVM memset() builtin. It only supports up to 1K. To go above this, we need to create our own memset() routine in BPF. This is quite easy to do and will be attempted in a later commit. The way big strings work is by assigning each str() call in the script a unique ID. Then we create a percpu array map such that for each CPU, there is a percpu entry that can accomodate each str() callsite. This makes it so scripts can keep as many strings "on the stack" as they'd like, with the tradeoff being the amount of memory we need to preallocate. As long as the kernel never allows BPF programs to nest (eg execute prog A while prog A is already on the call stack), this approach is robust to self-corruption. Note we do not yet remove or change default value of BPFTRACE_MAX_STRLEN. That is b/c a few other helpers (buf(), path(), maybe more) also read the value and allocate space on stack accordingly. We need to decouple or convert those helpers to scratch maps as well. This closes bpftrace#305.
Previously, all strings needed to be stored on the stack. Given that BPF has a 512B stack limit, this put serious limits on strings. In practice, it was limited to about 200B. This commit raises the limit much higher to around 1024. The new limit we hit is from the LLVM memset() builtin. It only supports up to 1K. To go above this, we need to create our own memset() routine in BPF. This is quite easy to do and will be attempted in a later commit. The way big strings work is by assigning each str() call in the script a unique ID. Then we create a percpu array map such that for each CPU, there is a percpu entry that can accomodate each str() callsite. This makes it so scripts can keep as many strings "on the stack" as they'd like, with the tradeoff being the amount of memory we need to preallocate. As long as the kernel never allows BPF programs to nest (eg execute prog A while prog A is already on the call stack), this approach is robust to self-corruption. Note we do not yet remove or change default value of BPFTRACE_MAX_STRLEN. That is b/c a few other helpers (buf(), path(), maybe more) also read the value and allocate space on stack accordingly. We need to decouple or convert those helpers to scratch maps as well. This closes bpftrace#305.
Previously, all strings needed to be stored on the stack. Given that BPF has a 512B stack limit, this put serious limits on strings. In practice, it was limited to about 200B. This commit raises the limit much higher to around 1024. The new limit we hit is from the LLVM memset() builtin. It only supports up to 1K. To go above this, we need to create our own memset() routine in BPF. This is quite easy to do and will be attempted in a later commit. The way big strings work is by assigning each str() call in the script a unique ID. Then we create a percpu array map such that for each CPU, there is a percpu entry that can accomodate each str() callsite. This makes it so scripts can keep as many strings "on the stack" as they'd like, with the tradeoff being the amount of memory we need to preallocate. As long as the kernel never allows BPF programs to nest (eg execute prog A while prog A is already on the call stack), this approach is robust to self-corruption. Note we do not yet remove or change default value of BPFTRACE_MAX_STRLEN. That is b/c a few other helpers (buf(), path(), maybe more) also read the value and allocate space on stack accordingly. We need to decouple or convert those helpers to scratch maps as well. This closes bpftrace#305.
Previously, all strings needed to be stored on the stack. Given that BPF has a 512B stack limit, this put serious limits on strings. In practice, it was limited to about 200B. This commit raises the limit much higher to around 1024. The new limit we hit is from the LLVM memset() builtin. It only supports up to 1K. To go above this, we need to create our own memset() routine in BPF. This is quite easy to do and will be attempted in a later commit. The way big strings work is by assigning each str() call in the script a unique ID. Then we create a percpu array map such that for each CPU, there is a percpu entry that can accomodate each str() callsite. This makes it so scripts can keep as many strings "on the stack" as they'd like, with the tradeoff being the amount of memory we need to preallocate. As long as the kernel never allows BPF programs to nest (eg execute prog A while prog A is already on the call stack), this approach is robust to self-corruption. Note we do not yet remove or change default value of BPFTRACE_MAX_STRLEN. That is b/c a few other helpers (buf(), path(), maybe more) also read the value and allocate space on stack accordingly. We need to decouple or convert those helpers to scratch maps as well. This closes bpftrace#305.
Previously, all strings needed to be stored on the stack. Given that BPF has a 512B stack limit, this put serious limits on strings. In practice, it was limited to about 200B. This commit raises the limit much higher to around 1024. The new limit we hit is from the LLVM memset() builtin. It only supports up to 1K. To go above this, we need to create our own memset() routine in BPF. This is quite easy to do and will be attempted in a later commit. The way big strings work is by assigning each str() call in the script a unique ID. Then we create a percpu array map such that for each CPU, there is a percpu entry that can accomodate each str() callsite. This makes it so scripts can keep as many strings "on the stack" as they'd like, with the tradeoff being the amount of memory we need to preallocate. As long as the kernel never allows BPF programs to nest (eg execute prog A while prog A is already on the call stack), this approach is robust to self-corruption. Note we do not yet remove or change default value of BPFTRACE_MAX_STRLEN. That is b/c a few other helpers (buf(), path(), maybe more) also read the value and allocate space on stack accordingly. We need to decouple or convert those helpers to scratch maps as well. This closes bpftrace#305.
Previously, all strings needed to be stored on the stack. Given that BPF has a 512B stack limit, this put serious limits on strings. In practice, it was limited to about 200B. This commit raises the limit much higher to around 1024. The new limit we hit is from the LLVM memset() builtin. It only supports up to 1K. To go above this, we need to create our own memset() routine in BPF. This is quite easy to do and will be attempted in a later commit. The way big strings work is by assigning each str() call in the script a unique ID. Then we create a percpu array map such that for each CPU, there is a percpu entry that can accomodate each str() callsite. This makes it so scripts can keep as many strings "on the stack" as they'd like, with the tradeoff being the amount of memory we need to preallocate. As long as the kernel never allows BPF programs to nest (eg execute prog A while prog A is already on the call stack), this approach is robust to self-corruption. Note we do not yet remove or change default value of BPFTRACE_MAX_STRLEN. That is b/c a few other helpers (buf(), path(), maybe more) also read the value and allocate space on stack accordingly. We need to decouple or convert those helpers to scratch maps as well. This closes #305.
Follow ups to initial support are being tracked here: #3229 |
If you cannot use newer version of bpftrace, you can resort to auxiliary methods: I used C preprocessor to preprocess bpftrace scripts and defined these macros:
And created python script, which concatenates my string properly: import sys;
import itertools as it;
for line in sys.stdin.buffer:
if rb"\0\0" in line:
string_parts = line.split(rb"\0\0");
full_out = list(it.takewhile(lambda _s: len(_s) == 63, string_parts))
remainder = list(it.dropwhile(lambda _s: len(_s) == 63, string_parts))
sys.stdout.buffer.write(b"".join(full_out + [remainder[0]]) + b"\n")
else:
sys.stdout.buffer.write(line) And then you can launch it like this:
P.S. There are edge cases (for example if string length is multiple of 63), but for my task this approach suffices. |
(Discussion extracted from #299)
This issue is to track "how could str() support kilobyte/megabyte/arbitrary string sizes?"
str() allocates strings on the BPF stack, so we reach the 512-byte limit very quickly (especially since we pay the toll again when printf() composes a perf event output buffer).
In fact, we're currently limited to tracing ~240-byte strings.
For an idea of "when would I want to trace a string larger than 240 bytes?":
My favourite dtrace command was one that was frequently guilty of maxing out the default string buffer: tracing what SQL queries were submitted to MySQL. I'd usually grow the buffer to 1024 bytes for good measure.
=====
I explored the possibility of storing the strings off-stack. The main problem there is that probe_read_str() will only write upon on-stack memory. I have ideas for workarounds (and some code progress towards those workarounds), but they're too big to be included in the scope of this pull request.
=====
Here's some code that I built to explore the possibility of storing strings off-stack:
bd038f9...Birch-san:c5088c9a5d731165e1005a6a800e611bb63f3dc3
It introduces (in codegen) a "StrCall" specialization of "Call".
StrCall has a
PERCPU_ARRAY
BPF map, to hold large data off-stack.We cannot use probe_read_str() to output directly into the BPF map, but maybe we could allocate a small on-stack buffer, and ferry bytes (one buffer at a time) from BPF stack into BPF map.
Next, printf() poses a few problems:
I think PerfEventOutput can transfer data from BPF map to userspace (that's what "join" does), so we could compose the perf event output struct inside a BPF map.
====
We may find that we want a dynamically-sized perf event output structure. Like, imagine we see that args->count is just 3 bytes. We wouldn't want to send MAX_STRLEN bytes to userspace, if we know at runtime a smaller, more specific amount.
Here's some (horrifying) code I wrote to make print decide the size of its MEMCPY based on runtime information rather than semantic analysis:
master...Birch-san:6c1f19a3a4c3cf253aa23db79f5f637b6823e5b2
Notice how expr2_ is used to pass down a strlen at runtime, instead of using the strlen provided during semantic analysis.
And notice how the str call uses a parameterised AllocInst, so instead of
isArrayTy()
, its expr_ type hasisPointerTy()
. (it's a pointer to dynamically-allocated memory)I reached a dead-end here because printf() allocates a buffer based on an LLVM StructType. It can store strings in ArrayType members, but you need to declare the size of the ArrayType up front. You cannot use an LLVM Value to dynamically size it.
The workaround here would be to stop using StructType and instead use a basic AllocInst. These support dynamic sizes based on LLVM Value. Or even better: we can use off-stack memory (a BPF map) instead of doing an allocation inside printf().
====
In conclusion: there may be a path to supporting arbitrarily-large strings. Bonus points if we can also make the perf event output buffer dynamically-sized, so we can transfer the smallest message possible (instead of our maximum buffer size). A big advantage of dynamic sizing is that the user wouldn't have to tune their maximum buffer size for the sake of performance.
The text was updated successfully, but these errors were encountered: