Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String concatenation is slow in later versions of fmt #3133

Closed
iwubcode opened this issue Oct 11, 2022 · 9 comments
Closed

String concatenation is slow in later versions of fmt #3133

iwubcode opened this issue Oct 11, 2022 · 9 comments

Comments

@iwubcode
Copy link

iwubcode commented Oct 11, 2022

Running the concat-benchmark both on godbolt and on a local machine, the fmt string concatenation tests are slower than the naive case (tested with 9.1.0 and master). With godbolt I had trouble compiling with anything but trunk, however, on a local machine I tested with gcc 11.2.0 and 12.1.0. On godbolt the runtime test is very slow whereas on my machine the runtime test is slower but at least close to the naive implementation.

I found the results surprising. At least at one time I know fmt was close to on par with the append test.

@vitaut
Copy link
Contributor

vitaut commented Oct 11, 2022

The reason why godbolt results are slow is because the {fmt} library deployed there is compiled with optimizations disabled. Here are the results on macOS with optimizations enabled:

Running ./concat-benchmark
Run on (8 X 2300 MHz CPU s)
CPU Caches:
  L1 Data 49K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 524K (x4)
  L3 Unified 8388K (x1)
Load Average: 4.32, 3.79, 3.51
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
naive                    136 ns          136 ns      5136333
append                   101 ns          100 ns      6375750
appendWithReserve        102 ns          102 ns      6240472
format_compile           185 ns          184 ns      3614284
format_runtime           136 ns          136 ns      4835289
format_to                104 ns          102 ns      6588111
nullop                 0.262 ns        0.261 ns   1000000000

Runtime format is comparable with manual concatenation which is pretty reasonable considering that it has to do much more work. Format string compilation is slower in this particular case, there are probably some simple optimization opportunities.

@vitaut
Copy link
Contributor

vitaut commented Oct 11, 2022

there are probably some simple optimization opportunities.

Specifically we should inline copying in https://godbolt.org/z/bxabqfv3G:

        call    char* fmt::v9::detail::copy_str_noinline<char, char const*, char*>(char const*, char const*, char*) [clone .isra.0]

It was made noinline to reduce code bloat elsewhere but this doesn't make sense for compilation.

@iwubcode
Copy link
Author

Thanks for looking into it @vitaut .

Maybe a dumb question but is there a reason we don't know the size of the string (at least in specific cases, like when all parameters to format are strings) where we could do the same reserve logic that the benchmark is doing to get similar numbers for format_runtime ?

@vitaut
Copy link
Contributor

vitaut commented Oct 12, 2022

In principle it should be possible to do reserve once in cases like this but it's not implemented yet.

@vitaut
Copy link
Contributor

vitaut commented Jun 15, 2023

Format string compilation has been optimized for this and similar cases in 1daae55.

2023-06-15 10:36:12
Running ./concat-benchmark
Run on (8 X 2300 MHz CPU s)
CPU Caches:
  L1 Data 49K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 524K (x4)
  L3 Unified 8388K (x1)
Load Average: 1.98, 2.56, 2.74
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
naive                    140 ns          138 ns      4698964
append                   100 ns         98.8 ns      7192175
appendWithReserve        103 ns          101 ns      6914605
format_compile           114 ns          112 ns      6342820
format_runtime           136 ns          134 ns      5160643
format_to                105 ns          103 ns      6837206
nullop                 0.288 ns        0.283 ns   1000000000

@vitaut vitaut closed this as completed Jun 15, 2023
@iwubcode
Copy link
Author

Thank you for working on this @vitaut ! Curious to see append and appendWithReserve, are they all within a margin of error from the fmt results do you think? Regardless, it's great to see these numbers now better than naive. Really excited to play with this!

@vitaut
Copy link
Contributor

vitaut commented Jun 16, 2023

The remaining difference (which is minor in this case) may come from the fact that we still don't reserve in advance. It is possible to do but the RoI is low.

@iwubcode
Copy link
Author

iwubcode commented Jun 16, 2023

Oh I see. Yeah, it'd be roughly ~10% I guess. Worth opening an issue for it, maybe someone someday will do it?

@vitaut
Copy link
Contributor

vitaut commented Jun 17, 2023

We don't such things as issues but a PR would be welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants