Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

small speedups during elf loading #1207

Merged
merged 5 commits into from
Nov 8, 2023

Conversation

paulcacheux
Copy link
Contributor

@paulcacheux paulcacheux commented Nov 7, 2023

This PR implements two speedups/memory usage improvements in the elf loading path.

The first one is about preallocating the instructions buffer, based on the size provided by the elf section header. Due to double width load instruction we might over allocate by a bit, but from testing on our ebpf code this over allocation is negligible compared to the pre-allocation done by slice.append anyway, and the win is significant in case of large instructions buffer.

The second one implements a fast path for splitSymbols in the case where the instructions are composed of only one symbol, which is the dominant use case from my understanding (happy to be challenged on that).

I implemented a small benchmark depicting some instructions on both ends of the spectrum (one symbol, and 2 symbols at both ends of the instructions stream).

Please let me know if you would like separate PRs for each of those.

Thanks

@paulcacheux paulcacheux force-pushed the paulcacheux/elf-read-speedup branch from e463916 to 4acb87b Compare November 7, 2023 10:02
@paulcacheux paulcacheux changed the title small speedups when during elf loading small speedups during elf loading Nov 7, 2023
@paulcacheux paulcacheux force-pushed the paulcacheux/elf-read-speedup branch from 4acb87b to 0e174cf Compare November 7, 2023 10:25
@paulcacheux paulcacheux marked this pull request as ready for review November 7, 2023 10:29
elf_reader.go Show resolved Hide resolved
linker.go Outdated
}
if onlyOneSymbol {
return map[string]asm.Instructions{
firstSymbol: insns,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently splitSymbols always returns a copy of insns. The new code will sometimes return a copy, sometimes not, which is a great bug magnet IMO.

Have you tried instead to optimize the loop below? I can see two big inefficiencies:

  • append will do extra copies
  • map lookup and assigns are also expensive

Getting rid of the two should make this a lot better without changing the alloc behaviour. You can probably track last name + index and do an exact allocation + copy whenever name changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I checked that we were not modifying the result of this function but this can change in the future.

I think the main point is that we split into a map and then iterate over the map to create another array (in the only place this function is called). Would you be open to removing the map completely and replacing it with an iterator so that we skip the intermediate allocation completely ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow, sorry. You want to inline splitSymbols into loadFunctions? loadFunctions into loadProgramSections?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically splitSymbols returns a map, but we never really take advantage of the fact that it's a map (except in tests), we directly iterate on it and create something else from it. What I was suggesting is that splitSymbols returns a lazy iterator type that would allow iteration without creating a big map with all the symbols and instructions.

I don't think it matters that much, in the latest version of the commit I improved the algorithm a lot in both the fast and slow paths.

@paulcacheux paulcacheux force-pushed the paulcacheux/elf-read-speedup branch from 06cbbf4 to dd7b16d Compare November 7, 2023 16:42
linker_test.go Outdated Show resolved Hide resolved
linker.go Outdated Show resolved Hide resolved
@paulcacheux paulcacheux force-pushed the paulcacheux/elf-read-speedup branch from dd7b16d to b815e6a Compare November 8, 2023 12:26
lmb and others added 5 commits November 8, 2023 16:23
Reuse the existing manyprogs.c test to exercise the ELF loader.

Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
The elf section can provide a good approximation of the amount of
instruction we will unmarshal.

    goos: linux
    goarch: amd64
    pkg: github.com/cilium/ebpf
    cpu: 12th Gen Intel(R) Core(TM) i7-1260P
                            │  base.txt   │              new.txt               │
                            │   sec/op    │   sec/op     vs base               │
    LoadCollectionManyProgs   1.301m ± 2%   1.263m ± 0%  -2.89% (p=0.000 n=10)

                            │   base.txt   │               new.txt               │
                            │     B/op     │     B/op      vs base               │
    LoadCollectionManyProgs   826.5Ki ± 0%   767.2Ki ± 0%  -7.18% (p=0.000 n=10)

                            │  base.txt   │              new.txt               │
                            │  allocs/op  │  allocs/op   vs base               │
    LoadCollectionManyProgs   6.932k ± 0%   6.752k ± 0%  -2.60% (p=0.000 n=10)

Signed-off-by: Paul Cacheux <paul.cacheux@datadoghq.com>
Make the case of a single program per section faster by removing
map accesses and repeated copying via append.

    goos: linux
    goarch: amd64
    pkg: github.com/cilium/ebpf
    cpu: 12th Gen Intel(R) Core(TM) i7-1260P
                            │  base.txt   │              new.txt               │
                            │   sec/op    │   sec/op     vs base               │
    LoadCollectionManyProgs   1.263m ± 0%   1.173m ± 0%  -7.13% (p=0.000 n=10)

                            │   base.txt   │               new.txt               │
                            │     B/op     │     B/op      vs base               │
    LoadCollectionManyProgs   767.2Ki ± 0%   707.9Ki ± 0%  -7.73% (p=0.000 n=10)

                            │  base.txt   │              new.txt               │
                            │  allocs/op  │  allocs/op   vs base               │
    LoadCollectionManyProgs   6.752k ± 0%   6.572k ± 0%  -2.67% (p=0.000 n=10)

Signed-off-by: Paul Cacheux <paul.cacheux@datadoghq.com>
Co-developed-by: Lorenz Bauer <lmb@isovalent.com>
    goos: linux
    goarch: amd64
    pkg: github.com/cilium/ebpf
    cpu: 12th Gen Intel(R) Core(TM) i7-1260P
                            │  base.txt   │            new.txt            │
                            │   sec/op    │   sec/op     vs base          │
    LoadCollectionManyProgs   1.173m ± 0%   1.166m ± 2%  ~ (p=0.481 n=10)

                            │   base.txt   │               new.txt               │
                            │     B/op     │     B/op      vs base               │
    LoadCollectionManyProgs   707.9Ki ± 0%   699.0Ki ± 0%  -1.25% (p=0.000 n=10)

                            │  base.txt   │              new.txt               │
                            │  allocs/op  │  allocs/op   vs base               │
    LoadCollectionManyProgs   6.572k ± 0%   6.452k ± 0%  -1.83% (p=0.000 n=10)

Signed-off-by: Paul Cacheux <paul.cacheux@datadoghq.com>
    goos: linux
    goarch: amd64
    pkg: github.com/cilium/ebpf
    cpu: 12th Gen Intel(R) Core(TM) i7-1260P
                            │  base.txt   │              new.txt               │
                            │   sec/op    │   sec/op     vs base               │
    LoadCollectionManyProgs   1.166m ± 2%   1.145m ± 1%  -1.84% (p=0.000 n=10)

                            │   base.txt   │               new.txt               │
                            │     B/op     │     B/op      vs base               │
    LoadCollectionManyProgs   699.0Ki ± 0%   693.3Ki ± 0%  -0.81% (p=0.000 n=10)

                            │  base.txt   │              new.txt               │
                            │  allocs/op  │  allocs/op   vs base               │
    LoadCollectionManyProgs   6.452k ± 0%   6.092k ± 0%  -5.58% (p=0.000 n=10)

Signed-off-by: Paul Cacheux <paul.cacheux@datadoghq.com>
@lmb lmb force-pushed the paulcacheux/elf-read-speedup branch from b815e6a to d511049 Compare November 8, 2023 17:17
Copy link
Collaborator

@lmb lmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the new benchmark and added the results to the commit messages.

@lmb lmb merged commit 1027911 into cilium:main Nov 8, 2023
@lmb
Copy link
Collaborator

lmb commented Nov 8, 2023

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants