charconv improvements #77

biojppm · 2022-03-03T02:38:23Z

No description provided.

biojppm · 2022-03-04T10:06:05Z

[NOTE: always open the benchmark links in a new tab, or github will not go to the linked line, HTH].

@fargies tagging you here because you already know a bit about this part of the code and suggested some optimizations. I removed the branching when appending characters in write_dec() et al, by requiring the output buffer (before entering the loop) to have a size big enough for the largest value of the type. This resulted in a 10-20% improvement overall.

But while looking closer at the benchmark results, I noticed that atoi() and atou() are much slower than read_dec() et al. And not just a bit slower -- a lot slower.

For example int32_t MSVC has read_dec() 10% faster than std::from_chars() (nice for the simplicity, yay!) but then c4::atoi() is 20x slower. The same for uint32_t MSVC: 20x slower.

In GCC the relationship is much better but still basically the same: uint32_t GCC has atou() 3-4x slower than read_dec() and int32_t GCC is 3-4x slower.

I haven't profiled the code yet as my time is extremely limited, but looking at the code for atoi() or for atou() I'm not seeing anything egregious to cause this sort of performance degradation from the main worker functions read_dec() et all. Also, the code is as simple as it can be given what it has to do.

Since you had some ideas in #75 , I'm asking here if you see anything that may be contributing to this cost.

fargies · 2022-03-04T10:26:01Z

I also saw that whilst working on overflow, and tried to mitigate by using switch/case having a single return in the functions (helps the compiler in pipelining stuff), but enhancements were not that big.

I've ran the code through cachegrind and callgrind, but due to a lot of inlining it's quite hard to identify where time is spent (also I was focused on read_x implementation).

The xtoa_digits_ code could eventually be used in overflow detection.

How did the check on buffer length made a 10/20% enhancement ? is the benchmark being fed with incompatible numbers (too large for buffer) ? meaning when you benchmark you do only with nominal cases, not erroneous ones right ?

I'll have a closer look, thanks for involving me 👍

src/c4/charconv.hpp

biojppm · 2022-03-04T10:59:33Z

Also, before this changeset, the benchmarks had some inconsistencies and were testing different number sets; the figures I linked above are now consistent across benchmarks.

fargies · 2022-03-04T11:30:45Z

Found it 🎉

Simply because atou is not inlined, the jumps are expensive.

Before modification:

atox_c4_read_dec<uint32_t>       8.19 ns         8.18 ns     85412143 bytes_per_second=466.207M/s items_per_second=122.213M/s
atox_c4_atou<uint32_t>           15.6 ns         15.5 ns     45232819 bytes_per_second=245.397M/s items_per_second=64.3293M/s

After modification (see patch, inlining revealed some issues in overflow detection, value is left untouched in error cases):

atox_c4_read_dec<uint32_t>       8.11 ns         8.10 ns     85329788 bytes_per_second=470.769M/s items_per_second=123.409M/s
atox_c4_atou<uint32_t>           8.11 ns         8.11 ns     86150068 bytes_per_second=470.514M/s items_per_second=123.342M/s

diff --git a/src/c4/charconv.hpp b/src/c4/charconv.hpp
index 8af02a3..297abc3 100644
--- a/src/c4/charconv.hpp
+++ b/src/c4/charconv.hpp
@@ -880,6 +880,7 @@ inline size_t atoi_first(csubstr str, T * C4_RESTRICT v)
  *
  * @see atou_first() if the string is not trimmed to the value to read. */
 template<class T>
+C4_ALWAYS_INLINE
 bool atou(csubstr str, T * C4_RESTRICT v)
 {
     C4_STATIC_ASSERT(std::is_integral<T>::value);
diff --git a/test/test_charconv.cpp b/test/test_charconv.cpp
index 423ca6b..c98d713 100644
--- a/test/test_charconv.cpp
+++ b/test/test_charconv.cpp
@@ -1216,7 +1216,7 @@ void test_no_overflows(std::initializer_list<const char *> args)
 template<class T>
 void test_overflows_hex()
 {
-    T x;
+    T x = 0;
     std::string str;
     if (std::is_unsigned<T>::value)
     {
@@ -1259,7 +1259,7 @@ void test_overflows_hex()
 template<class T>
 void test_overflows_bin()
 {
-    T x;
+    T x = 0;
     std::string str;
     if (std::is_unsigned<T>::value)
     {
@@ -1383,7 +1383,7 @@ TEST_CASE("overflows.u64")
     
     { /* with leading zeroes */
         std::string str;
-        uint64_t x;
+        uint64_t x = 1;
         str = "0o01" + std::string(21, '7');
         CHECK_MESSAGE(!overflows<uint64_t>(to_csubstr(str)), "num=" << str);
         str = "0o02" + std::string(21, '0');

biojppm · 2022-03-04T11:39:58Z

Amazing! Thanks! I'm going to try that.

biojppm · 2022-03-04T12:32:35Z

I have tried it and now everything is consistent: read_dec(), atoi(), atox() and from_chars() all have the same throughput. 🎉

And the results are really nice: eg, for u32/i32 I am getting a speedup of 3x over std::from_chars(), 163x over scanf() and amazingly (but predictably) 650x over sstream (see the commit above for the data).

🥳

EDIT @fargies turns out these results were inflated because the benchmarks were optimized out. See below for up-to-date results.

fargies · 2022-03-04T12:38:09Z

How would you explain that regarding sstream ?

biojppm · 2022-03-04T12:55:10Z

How would you explain that regarding sstream ?

I guess it's because of how much stuff is in there... every single character results in a virtual call, and it uses the locale which must be acquired from the OS, resulting in a mutex. scanf only pays for the locale, and not for abstracting buffer characters as streams do.

biojppm · 2022-03-04T13:05:15Z

Eg: https://aras-p.info/blog/2022/02/25/Curious-lack-of-sprintf-scaling/

fargies · 2022-03-04T14:31:17Z

Really interesting 👀 , thanks

test/test_charconv.cpp

biojppm · 2022-03-05T11:10:52Z

@fargies While looking closer at the results, all the c4 atox functions were suspiciously consistent (even with overflow check!). It turns out that the benchmark was optimized out of existence for those functions. I refactored the benchmark to create some side-effects. The results are still very good, but no longer out-of-this-world as they seemed above; c4::from_chars<int32_t>() is ...

40% better than std::from_chars()
10x better than scanf()
40x better than sstream.

Eg, see this CI benchmark run (but open the link in another browser tab).

fargies · 2022-03-05T16:56:19Z

@fargies While looking closer at the results, all the c4 atox functions were suspiciously consistent (even with overflow check!). It turns out that the benchmark was optimized out of existence for those functions. I refactored the benchmark to create some side-effects. The results are still very good, but no longer out-of-this-world as they seemed above; c4::from_chars<int32_t>() is ...
* 40% better than `std::from_chars()`

* 10x better than `scanf()`

* 40x better than `sstream`.

That's still really decent results

biojppm · 2022-03-07T19:46:10Z

@fargies I'm revisiting the write_dec() et al functions. One thing that occurred to me (and maybe should have dawned on me sooner) is this: using prior knowledge of the number of digits saves the need to reverse the string. That is, we can just write backwards from the end of the string (if we know where the end is), and we're done.

This is very likely to be a significant speedup, and therefore a very compelling reason to compute the digits as you were suggesting. Plus, it comes with the added advantage that the buffer size is no longer overestimated as you were objecting to before.

But of course I'm going to try it first (with intrinsics and everything) and see how faster or slower it is.

So I'm writing code to compute the number of digits, and I came back to the notes you made above:

Or there may be compromise solutions (maybe first comparing max then using a more accurate computation):
* use msb for binary format (by the way maybe using compilers builtins in memory_util.hpp could help)
* use (msb / 8 + 1) *2 for hex (mainly bit shifts)
* use (msb / 3 + 1) for octal
* would have to scratch my head for dec XD

I've tagged you again because I'm trying to figure out the octal expression you came to. Can you explain the reasoning?

biojppm · 2022-03-07T19:47:54Z

BTW, I also had to scratch my head for decimal; this is what I came up with, and seems to be working in preliminary testing:

template<class T>
C4_CONSTEXPR14 unsigned digits_dec(T v) noexcept
{
    C4_STATIC_ASSERT(std::is_integral<T>::value);
    C4_ASSERT(v >= 0);
    if(v)
    {
        // https://stackoverflow.com/a/18054857/5875572
        unsigned t = ((unsigned)msb<T>(v) + 1u) * 1233u >> 12;
        C4_ASSERT(t < detail::powers_of_10<T>::values_size);
        return 1u + t - (v < detail::powers_of_10<T>::values[t]);
    }
    return 1u;
}

fargies · 2022-03-07T23:58:35Z

@fargies I'm revisiting the write_dec() et al functions. One thing that occurred to me (and maybe should have dawned on me sooner) is this: using prior knowledge of the number of digits saves the need to reverse the string. That is, we can just write backwards from the end of the string (if we know where the end is), and we're done.

This is very likely to be a significant speedup, and therefore a very compelling reason to compute the digits as you were suggesting. Plus, it comes with the added advantage that the buffer size is no longer overestimated as you were objecting to before.

That would remove the reverse_range, do you see other eventual performance gain ?

But of course I'm going to try it first (with intrinsics and everything) and see how faster or slower it is.

So I'm writing code to compute the number of digits, and I came back to the notes you made above:

Or there may be compromise solutions (maybe first comparing max then using a more accurate computation):

use msb for binary format (by the way maybe using compilers builtins in memory_util.hpp could help)

use (msb / 8 + 1) *2 for hex (mainly bit shifts)

use (msb / 3 + 1) for octal

would have to scratch my head for dec XD

I've tagged you again because I'm trying to figure out the octal expression you came to. Can you explain the reasoning?

For octal my assumptions were that there's simply 3 bits per symbol ? but now you made me doubting.

For the decimals I had in mind doing the same computation than for hex which gives the most significant byte, and then bucket this for the 8 possible combinations, having max 2 comparisons per bucket:

switch (msb<T>(v) / 8)
{
case 0:
  if (v >= 100)
    return 3;
  return (v >= 10) ? 2 : 1;
case 1:
  if (v >= 10000)
    return 5;
  return (v >= 1000) ? 4 : 3;
...

But that may be a bit naïve.

Regarding the code you've referenced, I'm not sure on how the obvious way compares to this (article is not so clear about it).

codecov · 2022-03-16T14:00:16Z

Codecov Report

Merging #77 (d4345f9) into master (3100b70) will increase coverage by 0.24%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #77      +/-   ##
==========================================
+ Coverage   94.09%   94.34%   +0.24%     
==========================================
  Files          53       54       +1     
  Lines       11699    12251     +552     
==========================================
+ Hits        11008    11558     +550     
- Misses        691      693       +2

Impacted Files	Coverage Δ
src/c4/format.hpp	`96.26% <ø> (-0.08%)`	⬇️
test/test_charconv.cpp	`99.93% <ø> (-0.07%)`	⬇️
test/test_numbers.hpp	`0.00% <ø> (ø)`
src/c4/charconv.hpp	`99.18% <100.00%> (+1.36%)`	⬆️
src/c4/memory_util.hpp	`100.00% <100.00%> (ø)`
src/c4/utf.cpp	`97.05% <100.00%> (+0.08%)`	⬆️
test/test_dump.cpp	`100.00% <100.00%> (ø)`
test/test_format.cpp	`100.00% <100.00%> (ø)`
test/test_memory_util.cpp	`100.00% <100.00%> (ø)`
test/test_utf.cpp	`100.00% <100.00%> (ø)`
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3100b70...d4345f9. Read the comment docs.

biojppm · 2022-03-16T23:13:18Z

This last part was super hard. Glad to see finally see all tests green.

biojppm marked this pull request as draft March 3, 2022 02:38

biojppm force-pushed the charconv branch 2 times, most recently from 6f9b843 to ff61ad9 Compare March 3, 2022 02:49

biojppm commented Mar 4, 2022

View reviewed changes

src/c4/charconv.hpp Outdated Show resolved Hide resolved

biojppm force-pushed the charconv branch from f9ae7cc to 8069137 Compare March 4, 2022 19:27

biojppm added 3 commits March 4, 2022 19:27

[chore] update cmake

5b5c211

[impl] rng: add methods to set the seed

b6792af

[bm] charconv: ensure consistency for the benchmarks

eb556c8

biojppm force-pushed the charconv branch 3 times, most recently from d2fd9f5 to cb78b54 Compare March 4, 2022 23:40

fargies reviewed Mar 5, 2022

View reviewed changes

test/test_charconv.cpp Outdated Show resolved Hide resolved

biojppm marked this pull request as ready for review March 5, 2022 19:34

biojppm changed the title ~~[wip] charconv improvements~~ charconv improvements Mar 5, 2022

biojppm force-pushed the charconv branch from 2e494c4 to 19c8354 Compare March 9, 2022 00:36

biojppm force-pushed the charconv branch 13 times, most recently from 78877e4 to ec485bb Compare March 16, 2022 13:33

biojppm force-pushed the charconv branch 7 times, most recently from 8fd9527 to 2fd5583 Compare March 16, 2022 20:33

biojppm added 5 commits March 16, 2022 22:15

[ci] disable coveralls reporting

096b876

[fix] read_dec() et al no longer accepts an empty input buffer

f99cedb

[fix] msb()/lsb() now use intrinsics

6148707

[test] improve unit tests for charconv

9375b19

[bm] improve charconv benchmarks

d4345f9

biojppm force-pushed the charconv branch from 2fd5583 to d4345f9 Compare March 16, 2022 22:15

biojppm merged commit d6fa036 into master Mar 16, 2022

biojppm deleted the charconv branch March 16, 2022 23:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

charconv improvements #77

charconv improvements #77

biojppm commented Mar 3, 2022

biojppm commented Mar 4, 2022 •

edited

Loading

fargies commented Mar 4, 2022

biojppm commented Mar 4, 2022 •

edited

Loading

fargies commented Mar 4, 2022

biojppm commented Mar 4, 2022

biojppm commented Mar 4, 2022 •

edited

Loading

fargies commented Mar 4, 2022

biojppm commented Mar 4, 2022

biojppm commented Mar 4, 2022

fargies commented Mar 4, 2022

biojppm commented Mar 5, 2022

fargies commented Mar 5, 2022

biojppm commented Mar 7, 2022

biojppm commented Mar 7, 2022

fargies commented Mar 7, 2022

codecov bot commented Mar 16, 2022 •

edited

Loading

biojppm commented Mar 16, 2022

charconv improvements #77

charconv improvements #77

Conversation

biojppm commented Mar 3, 2022

biojppm commented Mar 4, 2022 • edited Loading

fargies commented Mar 4, 2022

biojppm commented Mar 4, 2022 • edited Loading

fargies commented Mar 4, 2022

biojppm commented Mar 4, 2022

biojppm commented Mar 4, 2022 • edited Loading

fargies commented Mar 4, 2022

biojppm commented Mar 4, 2022

biojppm commented Mar 4, 2022

fargies commented Mar 4, 2022

biojppm commented Mar 5, 2022

fargies commented Mar 5, 2022

biojppm commented Mar 7, 2022

biojppm commented Mar 7, 2022

fargies commented Mar 7, 2022

codecov bot commented Mar 16, 2022 • edited Loading

Codecov Report

biojppm commented Mar 16, 2022

biojppm commented Mar 4, 2022 •

edited

Loading

biojppm commented Mar 4, 2022 •

edited

Loading

biojppm commented Mar 4, 2022 •

edited

Loading

codecov bot commented Mar 16, 2022 •

edited

Loading