String API additions and SIMD optimizations #129

mosra · 2022-03-31T09:26:16Z

A meta-issue tracking various ideas for SIMD-optimized string algorithms. A reason why we're making our own string APIs is because the C string library is made for null-terminated strings, which is quite useless when the major use case is working on slices of larger strings, such as in parsers. And (of course) the C++ counterparts are too bloated and either impose an implicit allocation or require a too new C++ standard.

SIMD in general

Needs Compile-time and runtime CPU feature (SIMD) detection and dispatch #115 to be merged
- ~~Figure out how to dispatch~~ handled in Compile-time and runtime CPU feature (SIMD) detection and dispatch #115 as well

Construction

Add a way to automatically mark a view as Global if it's a C string literal (using __constant_string_p() compiler builtin, details in constexpr StringView without literals #123)
Add a way to automatically mark an arbitrary const char* as Global by checking if the pointer is in a read-only memory page -- https://github.com/RobloxResearch/SIMDString/blob/main/SIMDString.cpp#L54-L75

Searching

Comparison

Check pointer equality before calling into memcmp() in StringView::operator==(), could save a lot especially when comparing literals that the compiler might have deduplicated
- Don't do that in String tho
Would we gain anything by implementing memcmp() ourselves?
- especially for SSO strings that have a fixed size, which could be a single (masked) instruction?
- by not having to explicitly test for nullptr when size == 0 just to not hit an UB because the standard is stupid and generally disallows passing nullptr to any string/memory function even if the size is zero?
Case-insensitive comparison -- http://www.phoronix.com/scan.php?page=news_item&px=Glibc-strcasecmp-AVX2-EVEX
- Possibly useful for extension comparison in Any* plugins, OTOH there it's probably faster to normalize the extension first and then do 100 memcmp()s

Unicode

UTF-16 surrogates to UTF-8 and back, for JSON \u parsing
SIMD-ified UTF-8 codepoint counting and validation -- https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/
UTF-8 character counting including counting decomposed characters (such as a and ¨ as one), needs some Unicode categorization table
- does https://github.com/ridiculousfish/widecharwidth work like that or not?

Number-to-string

For Utility::Debug, Utility::format() etc. The core should be a direct overhead-less API working on builtin types (writing into a statically-sized char[], e.g.), with convenience wrappers above.

integer-to-string conversion instead of stupidly using printf() -- even a dumb while loop would be better
- https://lemire.me/blog/2022/03/28/converting-integers-to-decimal-strings-faster-with-avx-512/
- https://github.com/jeaiii/itoa (and explanation from https://jk-jeon.github.io/posts/2022/02/jeaiii-algorithm/), doesn't seem to use any explicit SIMD?
separate optimized hex, bin and oct variant
- https://lemire.me/blog/2021/05/17/converting-binary-integers-to-ascii-characters-apple-m1-vs-amd-zen2/
- http://0x80.pl/articles/convert-to-hex.html
float-to-string conversion instead of abusing printf()
- Ryu is outdated, latest and greatest is https://github.com/jk-jeon/dragonbox but it's C++17, quite large with a lot of options we don't need, do a clean-room implementation from the paper alone?
  - MSVC to_chars used Ryu, not sure if it's still: https://twitter.com/StephanTLavavej/status/992881364142702593
  - Older benchmarks here: https://github.com/miloyip/dtoa-benchmark
- Explicit option to write or fail on NaN and Inf (for JSON compat)
- Uppercase or lowercase exponent
- Option to write hex floats (or a separate function that doesn't need the complex algo?)
Any easy gains with a vectorized variant? "Convert 8 numbers to 8 strings at once"
- Useful for JSON, OBJ, ... export

String-to-number

Because strto*() has insane usability issues.

string-to-integer conversion instead of stupidly using strtol() -- even a dumb while loop could be better
- have 8-, 16-, 32- bit variants that check the range on their own
- remove the stupid handling of - for unsigned types (who thought that was a good idea??)
- https://lemire.me/blog/2022/01/21/swar-explained-parsing-eight-digits/
- https://kholdstare.github.io/technical/2020/05/26/faster-integer-parsing.html and the lengthy post linked from there
separate optimized hex, bin and oct variant
- http://0x80.pl/notesen/2014-10-09-pext-convert-ascii-hex-to-num.html
string-to-float conversion instead of using the shitty & slow strtof() / strtod()
- What's the algorithm here? Some pointers could be in comments at https://lemire.me/blog/2020/09/10/parsing-floats-in-c-benchmarking-strtod-vs-from_chars/
- Explicit option to parse or fail on NaN and Inf (for JSON compat)
- Explicit option to write or ignore hex floats (or a separate function that doesn't need the complex algo?)
- Also watch out for accumulated rounding errors when going to float through a double: https://www.exploringbinary.com/double-rounding-errors-in-decimal-to-double-to-float-conversions/
Any easy gains with a vectorized variant? "Convert 8 strings to 8 numbers at once"
- Useful for JSON, OBJ, ... parsing

General printing

Is some printf compatibility desired? like https://github.com/tfc/pprintpp
Optimization ideas from Rust: https://internals.rust-lang.org/t/changing-core-fmt-for-speed/5154

The text was updated successfully, but these errors were encountered:

mosra added this to the 2022.0a milestone Mar 31, 2022

mosra mentioned this issue Apr 3, 2022

Compile-time and runtime CPU feature (SIMD) detection and dispatch #115

Merged

74 tasks

mosra added this to Corrade / Utility Jul 24, 2024

mosra moved this to TODO in Corrade / Utility Jul 24, 2024

mosra added this to Corrade / Containers Jul 24, 2024

mosra moved this to TODO in Corrade / Containers Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String API additions and SIMD optimizations #129

String API additions and SIMD optimizations #129

mosra commented Mar 31, 2022 •

edited

Loading

String API additions and SIMD optimizations #129

String API additions and SIMD optimizations #129

Comments

mosra commented Mar 31, 2022 • edited Loading

SIMD in general

Construction

Searching

Comparison

Unicode

Number-to-string

String-to-number

General printing

mosra commented Mar 31, 2022 •

edited

Loading