Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

charconv improvements #77

Merged
merged 29 commits into from
Mar 16, 2022
Merged

charconv improvements #77

merged 29 commits into from
Mar 16, 2022

Conversation

biojppm
Copy link
Owner

@biojppm biojppm commented Mar 3, 2022

No description provided.

@biojppm biojppm marked this pull request as draft March 3, 2022 02:38
@biojppm biojppm force-pushed the charconv branch 2 times, most recently from 6f9b843 to ff61ad9 Compare March 3, 2022 02:49
@biojppm
Copy link
Owner Author

biojppm commented Mar 4, 2022

[NOTE: always open the benchmark links in a new tab, or github will not go to the linked line, HTH].

@fargies tagging you here because you already know a bit about this part of the code and suggested some optimizations. I removed the branching when appending characters in write_dec() et al, by requiring the output buffer (before entering the loop) to have a size big enough for the largest value of the type. This resulted in a 10-20% improvement overall.

But while looking closer at the benchmark results, I noticed that atoi() and atou() are much slower than read_dec() et al. And not just a bit slower -- a lot slower.

For example int32_t MSVC has read_dec() 10% faster than std::from_chars() (nice for the simplicity, yay!) but then c4::atoi() is 20x slower. The same for uint32_t MSVC: 20x slower.

In GCC the relationship is much better but still basically the same: uint32_t GCC has atou() 3-4x slower than read_dec() and int32_t GCC is 3-4x slower.

I haven't profiled the code yet as my time is extremely limited, but looking at the code for atoi() or for atou() I'm not seeing anything egregious to cause this sort of performance degradation from the main worker functions read_dec() et all. Also, the code is as simple as it can be given what it has to do.

Since you had some ideas in #75 , I'm asking here if you see anything that may be contributing to this cost.

@fargies
Copy link

fargies commented Mar 4, 2022

I also saw that whilst working on overflow, and tried to mitigate by using switch/case having a single return in the functions (helps the compiler in pipelining stuff), but enhancements were not that big.

I've ran the code through cachegrind and callgrind, but due to a lot of inlining it's quite hard to identify where time is spent (also I was focused on read_x implementation).

The xtoa_digits_ code could eventually be used in overflow detection.

How did the check on buffer length made a 10/20% enhancement ? is the benchmark being fed with incompatible numbers (too large for buffer) ? meaning when you benchmark you do only with nominal cases, not erroneous ones right ?

I'll have a closer look, thanks for involving me 👍

src/c4/charconv.hpp Outdated Show resolved Hide resolved
@biojppm
Copy link
Owner Author

biojppm commented Mar 4, 2022

Also, before this changeset, the benchmarks had some inconsistencies and were testing different number sets; the figures I linked above are now consistent across benchmarks.

@fargies
Copy link

fargies commented Mar 4, 2022

Found it 🎉

Simply because atou is not inlined, the jumps are expensive.

Before modification:

atox_c4_read_dec<uint32_t>       8.19 ns         8.18 ns     85412143 bytes_per_second=466.207M/s items_per_second=122.213M/s
atox_c4_atou<uint32_t>           15.6 ns         15.5 ns     45232819 bytes_per_second=245.397M/s items_per_second=64.3293M/s

After modification (see patch, inlining revealed some issues in overflow detection, value is left untouched in error cases):

atox_c4_read_dec<uint32_t>       8.11 ns         8.10 ns     85329788 bytes_per_second=470.769M/s items_per_second=123.409M/s
atox_c4_atou<uint32_t>           8.11 ns         8.11 ns     86150068 bytes_per_second=470.514M/s items_per_second=123.342M/s
diff --git a/src/c4/charconv.hpp b/src/c4/charconv.hpp
index 8af02a3..297abc3 100644
--- a/src/c4/charconv.hpp
+++ b/src/c4/charconv.hpp
@@ -880,6 +880,7 @@ inline size_t atoi_first(csubstr str, T * C4_RESTRICT v)
  *
  * @see atou_first() if the string is not trimmed to the value to read. */
 template<class T>
+C4_ALWAYS_INLINE
 bool atou(csubstr str, T * C4_RESTRICT v)
 {
     C4_STATIC_ASSERT(std::is_integral<T>::value);
diff --git a/test/test_charconv.cpp b/test/test_charconv.cpp
index 423ca6b..c98d713 100644
--- a/test/test_charconv.cpp
+++ b/test/test_charconv.cpp
@@ -1216,7 +1216,7 @@ void test_no_overflows(std::initializer_list<const char *> args)
 template<class T>
 void test_overflows_hex()
 {
-    T x;
+    T x = 0;
     std::string str;
     if (std::is_unsigned<T>::value)
     {
@@ -1259,7 +1259,7 @@ void test_overflows_hex()
 template<class T>
 void test_overflows_bin()
 {
-    T x;
+    T x = 0;
     std::string str;
     if (std::is_unsigned<T>::value)
     {
@@ -1383,7 +1383,7 @@ TEST_CASE("overflows.u64")
     
     { /* with leading zeroes */
         std::string str;
-        uint64_t x;
+        uint64_t x = 1;
         str = "0o01" + std::string(21, '7');
         CHECK_MESSAGE(!overflows<uint64_t>(to_csubstr(str)), "num=" << str);
         str = "0o02" + std::string(21, '0');

@biojppm
Copy link
Owner Author

biojppm commented Mar 4, 2022

Amazing! Thanks! I'm going to try that.

@biojppm
Copy link
Owner Author

biojppm commented Mar 4, 2022

I have tried it and now everything is consistent: read_dec(), atoi(), atox() and from_chars() all have the same throughput. 🎉

And the results are really nice: eg, for u32/i32 I am getting a speedup of 3x over std::from_chars(), 163x over scanf() and amazingly (but predictably) 650x over sstream (see the commit above for the data).

🥳

EDIT @fargies turns out these results were inflated because the benchmarks were optimized out. See below for up-to-date results.

@fargies
Copy link

fargies commented Mar 4, 2022

How would you explain that regarding sstream ?

@biojppm
Copy link
Owner Author

biojppm commented Mar 4, 2022

How would you explain that regarding sstream ?

I guess it's because of how much stuff is in there... every single character results in a virtual call, and it uses the locale which must be acquired from the OS, resulting in a mutex. scanf only pays for the locale, and not for abstracting buffer characters as streams do.

@biojppm
Copy link
Owner Author

biojppm commented Mar 4, 2022

@fargies
Copy link

fargies commented Mar 4, 2022

Really interesting 👀 , thanks

@biojppm biojppm force-pushed the charconv branch 3 times, most recently from d2fd9f5 to cb78b54 Compare March 4, 2022 23:40
test/test_charconv.cpp Outdated Show resolved Hide resolved
@biojppm
Copy link
Owner Author

biojppm commented Mar 5, 2022

@fargies While looking closer at the results, all the c4 atox functions were suspiciously consistent (even with overflow check!). It turns out that the benchmark was optimized out of existence for those functions. I refactored the benchmark to create some side-effects. The results are still very good, but no longer out-of-this-world as they seemed above; c4::from_chars<int32_t>() is ...

  • 40% better than std::from_chars()
  • 10x better than scanf()
  • 40x better than sstream.

Eg, see this CI benchmark run (but open the link in another browser tab).

image

@fargies
Copy link

fargies commented Mar 5, 2022

@fargies While looking closer at the results, all the c4 atox functions were suspiciously consistent (even with overflow check!). It turns out that the benchmark was optimized out of existence for those functions. I refactored the benchmark to create some side-effects. The results are still very good, but no longer out-of-this-world as they seemed above; c4::from_chars<int32_t>() is ...

* 40% better than `std::from_chars()`

* 10x better than `scanf()`

* 40x better than `sstream`.

That's still really decent results

@biojppm biojppm marked this pull request as ready for review March 5, 2022 19:34
@biojppm biojppm changed the title [wip] charconv improvements charconv improvements Mar 5, 2022
@biojppm
Copy link
Owner Author

biojppm commented Mar 7, 2022

@fargies I'm revisiting the write_dec() et al functions. One thing that occurred to me (and maybe should have dawned on me sooner) is this: using prior knowledge of the number of digits saves the need to reverse the string. That is, we can just write backwards from the end of the string (if we know where the end is), and we're done.

This is very likely to be a significant speedup, and therefore a very compelling reason to compute the digits as you were suggesting. Plus, it comes with the added advantage that the buffer size is no longer overestimated as you were objecting to before.

But of course I'm going to try it first (with intrinsics and everything) and see how faster or slower it is.

So I'm writing code to compute the number of digits, and I came back to the notes you made above:

Or there may be compromise solutions (maybe first comparing max then using a more accurate computation):
* use msb for binary format (by the way maybe using compilers builtins in memory_util.hpp could help)
* use (msb / 8 + 1) *2 for hex (mainly bit shifts)
* use (msb / 3 + 1) for octal
* would have to scratch my head for dec XD

I've tagged you again because I'm trying to figure out the octal expression you came to. Can you explain the reasoning?

@biojppm
Copy link
Owner Author

biojppm commented Mar 7, 2022

BTW, I also had to scratch my head for decimal; this is what I came up with, and seems to be working in preliminary testing:

template<class T>
C4_CONSTEXPR14 unsigned digits_dec(T v) noexcept
{
    C4_STATIC_ASSERT(std::is_integral<T>::value);
    C4_ASSERT(v >= 0);
    if(v)
    {
        // https://stackoverflow.com/a/18054857/5875572
        unsigned t = ((unsigned)msb<T>(v) + 1u) * 1233u >> 12;
        C4_ASSERT(t < detail::powers_of_10<T>::values_size);
        return 1u + t - (v < detail::powers_of_10<T>::values[t]);
    }
    return 1u;
}

@fargies
Copy link

fargies commented Mar 7, 2022

@fargies I'm revisiting the write_dec() et al functions. One thing that occurred to me (and maybe should have dawned on me sooner) is this: using prior knowledge of the number of digits saves the need to reverse the string. That is, we can just write backwards from the end of the string (if we know where the end is), and we're done.

This is very likely to be a significant speedup, and therefore a very compelling reason to compute the digits as you were suggesting. Plus, it comes with the added advantage that the buffer size is no longer overestimated as you were objecting to before.

That would remove the reverse_range, do you see other eventual performance gain ?

But of course I'm going to try it first (with intrinsics and everything) and see how faster or slower it is.

So I'm writing code to compute the number of digits, and I came back to the notes you made above:

Or there may be compromise solutions (maybe first comparing max then using a more accurate computation):

  • use msb for binary format (by the way maybe using compilers builtins in memory_util.hpp could help)
  • use (msb / 8 + 1) *2 for hex (mainly bit shifts)
  • use (msb / 3 + 1) for octal
  • would have to scratch my head for dec XD

I've tagged you again because I'm trying to figure out the octal expression you came to. Can you explain the reasoning?

For octal my assumptions were that there's simply 3 bits per symbol ? but now you made me doubting.

For the decimals I had in mind doing the same computation than for hex which gives the most significant byte, and then bucket this for the 8 possible combinations, having max 2 comparisons per bucket:

switch (msb<T>(v) / 8)
{
case 0:
  if (v >= 100)
    return 3;
  return (v >= 10) ? 2 : 1;
case 1:
  if (v >= 10000)
    return 5;
  return (v >= 1000) ? 4 : 3;
...

But that may be a bit naïve.

Regarding the code you've referenced, I'm not sure on how the obvious way compares to this (article is not so clear about it).

@biojppm biojppm force-pushed the charconv branch 13 times, most recently from 78877e4 to ec485bb Compare March 16, 2022 13:33
@codecov
Copy link

codecov bot commented Mar 16, 2022

Codecov Report

Merging #77 (d4345f9) into master (3100b70) will increase coverage by 0.24%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #77      +/-   ##
==========================================
+ Coverage   94.09%   94.34%   +0.24%     
==========================================
  Files          53       54       +1     
  Lines       11699    12251     +552     
==========================================
+ Hits        11008    11558     +550     
- Misses        691      693       +2     
Impacted Files Coverage Δ
src/c4/format.hpp 96.26% <ø> (-0.08%) ⬇️
test/test_charconv.cpp 99.93% <ø> (-0.07%) ⬇️
test/test_numbers.hpp 0.00% <ø> (ø)
src/c4/charconv.hpp 99.18% <100.00%> (+1.36%) ⬆️
src/c4/memory_util.hpp 100.00% <100.00%> (ø)
src/c4/utf.cpp 97.05% <100.00%> (+0.08%) ⬆️
test/test_dump.cpp 100.00% <100.00%> (ø)
test/test_format.cpp 100.00% <100.00%> (ø)
test/test_memory_util.cpp 100.00% <100.00%> (ø)
test/test_utf.cpp 100.00% <100.00%> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3100b70...d4345f9. Read the comment docs.

@biojppm biojppm force-pushed the charconv branch 7 times, most recently from 8fd9527 to 2fd5583 Compare March 16, 2022 20:33
@biojppm
Copy link
Owner Author

biojppm commented Mar 16, 2022

This last part was super hard. Glad to see finally see all tests green.

@biojppm biojppm merged commit d6fa036 into master Mar 16, 2022
@biojppm biojppm deleted the charconv branch March 16, 2022 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants