Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why does fmt::format_to_n perform so much worse than fmt::format_to #3484

Closed
mentalmap opened this issue Jun 12, 2023 · 4 comments
Closed

why does fmt::format_to_n perform so much worse than fmt::format_to #3484

mentalmap opened this issue Jun 12, 2023 · 4 comments

Comments

@mentalmap
Copy link

fmt Version:

10.0.0

Benchmark:

#include <benchmark/benchmark.h>
#include <fmt/chrono.h>
#include <fmt/compile.h>
#include <fmt/core.h>
#include <fmt/format.h>

static void BM_format_to(benchmark::State &state) {
  char out[1024] = {0};
  auto format = FMT_COMPILE("{} - {} - {} - {} - {} - {} - {} - {} - {} - {}");

  for (auto _ : state) {
    fmt::format_to(out, format, "abcdef", 12345, "abcdef", 12345, "abcdef", 12345, "abcdef", 12345,
                   "abcdef", 12345);
  }
}
BENCHMARK(BM_format_to);

static void BM_format_to_n(benchmark::State &state) {
  char out[1024] = {0};
  auto format = FMT_COMPILE("{} - {} - {} - {} - {} - {} - {} - {} - {} - {}");

  for (auto _ : state) {
    fmt::format_to_n(out, sizeof(out), format, "abcdef", 12345, "abcdef", 12345, "abcdef", 12345,
                     "abcdef", 12345, "abcdef", 12345);
  }
}
BENCHMARK(BM_format_to_n);

BENCHMARK_MAIN();

Result:

Running ./benchmark_format_to.fmt10
Run on (16 X 2595.12 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 4096 KiB (x8)
  L3 Unified 16384 KiB (x2)
Load Average: 1.17, 1.10, 1.59
---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
BM_format_to         78.3 ns         78.3 ns      8849844
BM_format_to_n        564 ns          564 ns      1242982
@vitaut
Copy link
Contributor

vitaut commented Jun 12, 2023

format_to_n hasn't been optimized for format string compilation yet. The default (not compiled) API is actually much faster in this case:

Run on (8 X 2300 MHz CPU s)
CPU Caches:
  L1 Data 49K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 524K (x4)
  L3 Unified 8388K (x1)
Load Average: 3.99, 3.63, 3.01
---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
BM_format_to         97.2 ns         96.9 ns      6949063
BM_format_to_n        227 ns          227 ns      2990584

An easy way to optimize format_to_n would be by applying the same buffering to the compiled API:

fmt/include/fmt/core.h

Lines 2794 to 2802 in de0757b

template <typename OutputIt, typename... T,
FMT_ENABLE_IF(detail::is_output_iterator<OutputIt, char>::value)>
auto vformat_to_n(OutputIt out, size_t n, string_view fmt, format_args args)
-> format_to_n_result<OutputIt> {
using traits = detail::fixed_buffer_traits;
auto buf = detail::iterator_buffer<OutputIt, char, traits>(out, n);
detail::vformat_to(buf, fmt, args, {});
return {buf.out(), buf.count()};
}

A PR would be welcome.

@vitaut
Copy link
Contributor

vitaut commented Jul 20, 2023

Applied the optimization in 436c131 which gave ~2x speedup on the given benchmark (tested on macOS with M1 and clang). Compared to your original timing the improvement is even larger possibly due to some other changes.

Before:

Run on (8 X 24.1212 MHz CPU s)
CPU Caches:
  L1 Data 65K (x8)
  L1 Instruction 131K (x8)
  L2 Unified 4194K (x4)
Load Average: 1.70, 2.40, 2.43
---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
BM_format_to         75.5 ns         75.5 ns      7828927
BM_format_to_n        317 ns          317 ns      2210356

After:

Run on (8 X 24.1211 MHz CPU s)
CPU Caches:
  L1 Data 65K (x8)
  L1 Instruction 131K (x8)
  L2 Unified 4194K (x4)
Load Average: 6.92, 4.27, 3.17
---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
BM_format_to         75.5 ns         75.5 ns      8163741
BM_format_to_n        165 ns          165 ns      4229658

@vitaut vitaut closed this as completed Jul 20, 2023
@mentalmap
Copy link
Author

fmt 10.0.0:

2023-07-27T14:55:22+08:00
Running ./benchmark.fmt10
Run on (16 X 2595.12 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 4096 KiB (x8)
  L3 Unified 16384 KiB (x2)
Load Average: 0.57, 0.69, 0.30
---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
BM_format_to         78.9 ns         78.9 ns      8881746
BM_format_to_n        568 ns          568 ns      1232089

fmt master:

2023-07-27T14:55:28+08:00
Running ./benchmark.fmt_master
Run on (16 X 2595.12 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 4096 KiB (x8)
  L3 Unified 16384 KiB (x2)
Load Average: 0.52, 0.67, 0.30
---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
BM_format_to         54.9 ns         54.9 ns     12727944
BM_format_to_n        133 ns          133 ns      5257795

👍

@vitaut
Copy link
Contributor

vitaut commented Jul 28, 2023

Thanks for testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants