-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
small speedups during elf loading #1207
Conversation
e463916
to
4acb87b
Compare
4acb87b
to
0e174cf
Compare
linker.go
Outdated
} | ||
if onlyOneSymbol { | ||
return map[string]asm.Instructions{ | ||
firstSymbol: insns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently splitSymbols always returns a copy of insns. The new code will sometimes return a copy, sometimes not, which is a great bug magnet IMO.
Have you tried instead to optimize the loop below? I can see two big inefficiencies:
- append will do extra copies
- map lookup and assigns are also expensive
Getting rid of the two should make this a lot better without changing the alloc behaviour. You can probably track last name + index and do an exact allocation + copy whenever name changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, I checked that we were not modifying the result of this function but this can change in the future.
I think the main point is that we split into a map and then iterate over the map to create another array (in the only place this function is called). Would you be open to removing the map completely and replacing it with an iterator so that we skip the intermediate allocation completely ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow, sorry. You want to inline splitSymbols into loadFunctions? loadFunctions into loadProgramSections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically splitSymbols
returns a map, but we never really take advantage of the fact that it's a map (except in tests), we directly iterate on it and create something else from it. What I was suggesting is that splitSymbols
returns a lazy iterator type that would allow iteration without creating a big map with all the symbols and instructions.
I don't think it matters that much, in the latest version of the commit I improved the algorithm a lot in both the fast and slow paths.
06cbbf4
to
dd7b16d
Compare
dd7b16d
to
b815e6a
Compare
Reuse the existing manyprogs.c test to exercise the ELF loader. Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
The elf section can provide a good approximation of the amount of instruction we will unmarshal. goos: linux goarch: amd64 pkg: github.com/cilium/ebpf cpu: 12th Gen Intel(R) Core(TM) i7-1260P │ base.txt │ new.txt │ │ sec/op │ sec/op vs base │ LoadCollectionManyProgs 1.301m ± 2% 1.263m ± 0% -2.89% (p=0.000 n=10) │ base.txt │ new.txt │ │ B/op │ B/op vs base │ LoadCollectionManyProgs 826.5Ki ± 0% 767.2Ki ± 0% -7.18% (p=0.000 n=10) │ base.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ LoadCollectionManyProgs 6.932k ± 0% 6.752k ± 0% -2.60% (p=0.000 n=10) Signed-off-by: Paul Cacheux <paul.cacheux@datadoghq.com>
Make the case of a single program per section faster by removing map accesses and repeated copying via append. goos: linux goarch: amd64 pkg: github.com/cilium/ebpf cpu: 12th Gen Intel(R) Core(TM) i7-1260P │ base.txt │ new.txt │ │ sec/op │ sec/op vs base │ LoadCollectionManyProgs 1.263m ± 0% 1.173m ± 0% -7.13% (p=0.000 n=10) │ base.txt │ new.txt │ │ B/op │ B/op vs base │ LoadCollectionManyProgs 767.2Ki ± 0% 707.9Ki ± 0% -7.73% (p=0.000 n=10) │ base.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ LoadCollectionManyProgs 6.752k ± 0% 6.572k ± 0% -2.67% (p=0.000 n=10) Signed-off-by: Paul Cacheux <paul.cacheux@datadoghq.com> Co-developed-by: Lorenz Bauer <lmb@isovalent.com>
goos: linux goarch: amd64 pkg: github.com/cilium/ebpf cpu: 12th Gen Intel(R) Core(TM) i7-1260P │ base.txt │ new.txt │ │ sec/op │ sec/op vs base │ LoadCollectionManyProgs 1.173m ± 0% 1.166m ± 2% ~ (p=0.481 n=10) │ base.txt │ new.txt │ │ B/op │ B/op vs base │ LoadCollectionManyProgs 707.9Ki ± 0% 699.0Ki ± 0% -1.25% (p=0.000 n=10) │ base.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ LoadCollectionManyProgs 6.572k ± 0% 6.452k ± 0% -1.83% (p=0.000 n=10) Signed-off-by: Paul Cacheux <paul.cacheux@datadoghq.com>
goos: linux goarch: amd64 pkg: github.com/cilium/ebpf cpu: 12th Gen Intel(R) Core(TM) i7-1260P │ base.txt │ new.txt │ │ sec/op │ sec/op vs base │ LoadCollectionManyProgs 1.166m ± 2% 1.145m ± 1% -1.84% (p=0.000 n=10) │ base.txt │ new.txt │ │ B/op │ B/op vs base │ LoadCollectionManyProgs 699.0Ki ± 0% 693.3Ki ± 0% -0.81% (p=0.000 n=10) │ base.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ LoadCollectionManyProgs 6.452k ± 0% 6.092k ± 0% -5.58% (p=0.000 n=10) Signed-off-by: Paul Cacheux <paul.cacheux@datadoghq.com>
b815e6a
to
d511049
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran the new benchmark and added the results to the commit messages.
Thanks! |
This PR implements two speedups/memory usage improvements in the elf loading path.
The first one is about preallocating the instructions buffer, based on the size provided by the elf section header. Due to double width load instruction we might over allocate by a bit, but from testing on our ebpf code this over allocation is negligible compared to the pre-allocation done by
slice.append
anyway, and the win is significant in case of large instructions buffer.The second one implements a fast path for
splitSymbols
in the case where the instructions are composed of only one symbol, which is the dominant use case from my understanding (happy to be challenged on that).I implemented a small benchmark depicting some instructions on both ends of the spectrum (one symbol, and 2 symbols at both ends of the instructions stream).
Please let me know if you would like separate PRs for each of those.
Thanks