Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression since 10.0.0 #3674

Closed
breyerml opened this issue Oct 9, 2023 · 7 comments
Closed

Performance regression since 10.0.0 #3674

breyerml opened this issue Oct 9, 2023 · 7 comments

Comments

@breyerml
Copy link

breyerml commented Oct 9, 2023

We saw a performance regression of a factor of 2-3 when switching from fmt version 8.1.0 to 10.0.0 (also happens with 10.1.1).

I extracted a MWE from our code:

cmake_minimum_required (VERSION 3.20)
project (Test)

set(CMAKE_CXX_STANDARD 17)

add_executable(prog main.cpp)

include(FetchContent)
FetchContent_Declare(fmt
        GIT_REPOSITORY https://github.com/fmtlib/fmt.git
        GIT_TAG 8.1.0
        GIT_SHALLOW TRUE
        QUIET

        set (FMT_TEST OFF CACHE BOOL "" FORCE)
        set (FMT_DOC OFF CACHE BOOL "" FORCE)
        set (FMT_INSTALL OFF CACHE BOOL "" FORCE)
        set (FMT_SYSTEM_HEADERS ON CACHE BOOL "" FORCE)
        )
FetchContent_MakeAvailable(fmt)
set_property(TARGET fmt PROPERTY POSITION_INDEPENDENT_CODE ON)
add_dependencies(prog fmt)
target_include_directories(prog PUBLIC ${fmt_SOURCE_DIR}/include)
target_link_libraries(prog PUBLIC fmt::fmt)

find_package(OpenMP REQUIRED)
target_compile_options(prog PUBLIC -fopenmp)
target_link_libraries(prog PUBLIC -fopenmp)
#include "fmt/compile.h"
#include "fmt/format.h"
#include "fmt/chrono.h"
#include "fmt/os.h"

#include "omp.h"

#include <vector>
#include <random>
#include <string>

using real_type = double;

template <typename T>
using matrix = std::vector<std::vector<T>>;

template <typename label_type>
inline void write(const std::string &filename, const std::vector<label_type> &labels, const matrix<real_type> &support_vectors) {
    
    const std::size_t num_features = support_vectors.front().size();
    const std::size_t num_classes = 2;
    const std::size_t num_alpha_per_point = num_classes;
    const std::vector<label_type> label_order = { 0, 1 };

    // create file
    fmt::ostream out = fmt::output_file(filename);
    // write timestamp for current date time
    // note: commented out since the resulting model file cannot be read be LIBSVM
    //    out.print("# This model file has been created at {}\n", detail::current_date_time());

    // the maximum size of one formatted LIBSVM entry, e.g., 1234:1.365363e+10
    // biggest number representable as std::size_t: 18446744073709551615 -> 20 chars
    // scientific notation: 3 chars (number in front of decimal separator including a sign + decimal separator) + 10 chars (part after the decimal separator, specified during formatting) +
    //                      5 chars exponent (e + sign + maximum potential exponent (308 -> 3 digits)
    // separators: 2 chars (: between index and feature + whitespace after feature value)
    // -> 40 chars in total
    // -> increased to 48 chars to be on the safe side
    static constexpr std::size_t CHARS_PER_BLOCK = 48;
    // results in 48 B * 128 B = 6 KiB stack buffer per thread
    static constexpr std::size_t BLOCK_SIZE = 128;
    // use 1 MiB as buffer per thread
    constexpr std::size_t STRING_BUFFER_SIZE = 1024 * 1024;

    // format one output-line
    auto format_libsvm_line = [](std::string &output, const std::vector<real_type> &a, const matrix<real_type> &d, const std::size_t point) {
        static constexpr std::size_t STACK_BUFFER_SIZE = BLOCK_SIZE * CHARS_PER_BLOCK;
        static char buffer[STACK_BUFFER_SIZE];
        #pragma omp threadprivate(buffer)

        output.append(fmt::format("{:.10e} ", fmt::join(a, " ")));
        for (typename std::vector<real_type>::size_type j = 0; j < d.front().size(); j += BLOCK_SIZE) {
            char *ptr = buffer;
            for (std::size_t i = 0; i < std::min<std::size_t>(BLOCK_SIZE, d.front().size() - j); ++i) {
                if (d[point][j + i] != real_type{ 0.0 }) {
                    // add 1 to the index since LIBSVM assumes 1-based feature indexing
                    ptr = fmt::format_to(ptr, FMT_COMPILE("{}:{:.10e} "), j + i + 1, d[point][j + i]);
                }
            }
            output.append(buffer, ptr - buffer);
        }
        output.push_back('\n');
    };

    // initialize volatile array
    auto counts = std::make_unique<volatile int[]>(label_order.size() + 1);
    counts[0] = std::numeric_limits<int>::max();
    #pragma omp parallel default(none) shared(counts, format_libsvm_line, label_order, labels, support_vectors, out) firstprivate(BLOCK_SIZE, CHARS_PER_BLOCK, num_features, num_classes, num_alpha_per_point)
    {
        // preallocate string buffer, only ONE allocation
        std::string out_string;
        out_string.reserve(STRING_BUFFER_SIZE + (num_features + num_alpha_per_point) * CHARS_PER_BLOCK);  // oversubscribe buffer that at least one additional line fits into it
        std::vector<real_type> alpha_per_point(num_alpha_per_point);

        // loop over all classes, since they must be sorted
        for (typename std::vector<label_type>::size_type l = 0; l < label_order.size(); ++l) {
            // the support vectors with the l-th class
            #pragma omp for nowait
            for (typename std::vector<real_type>::size_type i = 0; i < support_vectors.size(); ++i) {
                if (labels[i] == label_order[l]) {
                    format_libsvm_line(out_string, alpha_per_point, support_vectors, i);

                    // if the buffer is full, write it to the file
                    if (out_string.size() > STRING_BUFFER_SIZE) {
                        // wait for all threads to write support vectors for previous class
#ifdef _OPENMP
                        while (counts[l] < omp_get_num_threads()) {
                        }
#endif
                        #pragma omp critical
                        {
                            out.print("{}", out_string);
                        }
                        // clear buffer
                        out_string.clear();
                    }
                }
            }
            // wait for all threads to write support vectors for previous class
#ifdef _OPENMP
            while (counts[l] < omp_get_num_threads()) {
            }
#endif

            #pragma omp critical
            {
                if (!out_string.empty()) {
                    out.print("{}", out_string);
                    out_string.clear();
                }
                counts[l + 1] = counts[l + 1] + 1;
            }
        }
    }
}



int main() {

  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_real_distribution<double> dist(-1.0, 1.0);

  std::vector<int> labels(73257);
  matrix<real_type> support_vectors(73257, std::vector<real_type>(3072));

  for (std::size_t i = 0; i < labels.size(); ++i) {
    labels[i] = i % 2;
    for (std::size_t j = 0; j < support_vectors.front().size(); ++j) {
      support_vectors[i][j] = dist(mt);
    }
  }


  auto start = std::chrono::steady_clock::now();
  write("test.libsvm", labels, support_vectors);
  auto end = std::chrono::steady_clock::now();
  fmt::print("{}\n", std::chrono::duration_cast<std::chrono::milliseconds>(end - start));

  return 0;
}

I compiled the above code with g++ 11.4.0: cmake -DCMAKE_BUILD_TYPE=Release .. && make

On our machine (AMD EPYC 7763 64-Core Processor) I get the following runtimes for the MWE writing to a SSD:
8.1.0: 4067ms
9.1.0: 4093ms
10.0.0: 12954ms
10.1.1: 13009ms

Am I missing something or shouldn't there be such a loss in performance simply by using a newer {fmt} version?

@vitaut
Copy link
Contributor

vitaut commented Oct 10, 2023

Could be related to FP changes. Have you tried bisecting?

@breyerml
Copy link
Author

No, I haven't. I didn't even know that this git command exists but it looks extremely nice.
I will try bisecting as soon as I've time.

@breyerml
Copy link
Author

8.1.0                                          3.385s
10.0.0                                        11.434s
9beddd08f9708ced1f154d399c40f1be3f23eef1       3.981s
a585571e90e7f5645bcf6548cef42078db940f49      11.647s
61844b6b67166690adeb297c585062aa5bebf390       3.585s
fd0d0ec8df94be190a555aa4b0c77a144595b396      11.298s
0ccaed3a6c745ed2a625b8d6531111ba668e4fea      13.481s
3178bb9a265e36c5fca244f1e36a7d0628d7bbde      11.302s
6b8144a5acde60d97920d99399ccd0fba2675c80      10.956s
192859c2b5dda3a015a7b1bf023f0c2475f0350c      12.904s
e2f6d7665b134c20bc704f0c3d1d0d9305b4f564       3.880s

-> first bad commit
192859c2b5dda3a015a7b1bf023f0c2475f0350c      13.265s

So according to git bisectthe offending commit is 192859c.
But at first glance, I see nothing that could warrant such a performance loss.

@sunmy2019
Copy link
Contributor

sunmy2019 commented Oct 15, 2023

I got that simplified. The regression mostly comes from this:

std::string buf(100000, '1');
fmt::ostream out = fmt::output_file("123");
out.print("{}", buf);

On my computer:

e2f6d76 : 301µs
192859c : 5387µs

Also on Godbolt:
https://godbolt.org/z/YbfnYcdbK

@vitaut
Copy link
Contributor

vitaut commented Oct 15, 2023

Fixed in d9063ba. Thanks @breyerml for reporting and bisecting and @sunmy2019 for reducing the repro!

@vitaut vitaut closed this as completed Oct 15, 2023
@sunmy2019
Copy link
Contributor

Oh, I just got it fixed.

I proudly opened the page and tried to report it here.

Then I saw it fixed, and sadly went away.

@vitaut
Copy link
Contributor

vitaut commented Oct 15, 2023

A race condition =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants