Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD-enabled utf-8 validation #68455

Open
yoshuawuyts opened this issue Jan 22, 2020 · 57 comments
Open

SIMD-enabled utf-8 validation #68455

yoshuawuyts opened this issue Jan 22, 2020 · 57 comments
Labels
A-simd Area: SIMD (Single Instruction Multiple Data) A-target-feature Area: Enabling/disabling target features like AVX, Neon, etc. A-unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs Relevant to the library team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@yoshuawuyts
Copy link
Member

Introduction

The "Parsing Gigabytes of JSON per second" post (ArXiv - langdale, lemire) proposes a novel approach for parsing JSON that is fast enough that on many systems it moves the bottleneck to the disk and network instead of the parser. This is done through the clever use of SIMD instructions.

Something that stood out to me from the post is that JSON is required to be valid utf-8, and they had come up with new algorithms to validate utf-8 using SIMD instructions that function much faster than conventional approaches.

Since rustc does a lot of utf-8 validation (each .rs source file needs to be valid utf-8), it
got me curious about what rustc currently does. Validation seems to be done by the following routine:

rust/src/libcore/str/mod.rs

Lines 1500 to 1618 in 2f688ac

/// Walks through `v` checking that it's a valid UTF-8 sequence,
/// returning `Ok(())` in that case, or, if it is invalid, `Err(err)`.
#[inline]
fn run_utf8_validation(v: &[u8]) -> Result<(), Utf8Error> {
let mut index = 0;
let len = v.len();
let usize_bytes = mem::size_of::<usize>();
let ascii_block_size = 2 * usize_bytes;
let blocks_end = if len >= ascii_block_size { len - ascii_block_size + 1 } else { 0 };
let align = v.as_ptr().align_offset(usize_bytes);
while index < len {
let old_offset = index;
macro_rules! err {
($error_len: expr) => {
return Err(Utf8Error { valid_up_to: old_offset, error_len: $error_len });
};
}
macro_rules! next {
() => {{
index += 1;
// we needed data, but there was none: error!
if index >= len {
err!(None)
}
v[index]
}};
}
let first = v[index];
if first >= 128 {
let w = UTF8_CHAR_WIDTH[first as usize];
// 2-byte encoding is for codepoints \u{0080} to \u{07ff}
// first C2 80 last DF BF
// 3-byte encoding is for codepoints \u{0800} to \u{ffff}
// first E0 A0 80 last EF BF BF
// excluding surrogates codepoints \u{d800} to \u{dfff}
// ED A0 80 to ED BF BF
// 4-byte encoding is for codepoints \u{1000}0 to \u{10ff}ff
// first F0 90 80 80 last F4 8F BF BF
//
// Use the UTF-8 syntax from the RFC
//
// https://tools.ietf.org/html/rfc3629
// UTF8-1 = %x00-7F
// UTF8-2 = %xC2-DF UTF8-tail
// UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
// %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
// UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
// %xF4 %x80-8F 2( UTF8-tail )
match w {
2 => {
if next!() & !CONT_MASK != TAG_CONT_U8 {
err!(Some(1))
}
}
3 => {
match (first, next!()) {
(0xE0, 0xA0..=0xBF)
| (0xE1..=0xEC, 0x80..=0xBF)
| (0xED, 0x80..=0x9F)
| (0xEE..=0xEF, 0x80..=0xBF) => {}
_ => err!(Some(1)),
}
if next!() & !CONT_MASK != TAG_CONT_U8 {
err!(Some(2))
}
}
4 => {
match (first, next!()) {
(0xF0, 0x90..=0xBF) | (0xF1..=0xF3, 0x80..=0xBF) | (0xF4, 0x80..=0x8F) => {}
_ => err!(Some(1)),
}
if next!() & !CONT_MASK != TAG_CONT_U8 {
err!(Some(2))
}
if next!() & !CONT_MASK != TAG_CONT_U8 {
err!(Some(3))
}
}
_ => err!(Some(1)),
}
index += 1;
} else {
// Ascii case, try to skip forward quickly.
// When the pointer is aligned, read 2 words of data per iteration
// until we find a word containing a non-ascii byte.
if align != usize::max_value() && align.wrapping_sub(index) % usize_bytes == 0 {
let ptr = v.as_ptr();
while index < blocks_end {
// SAFETY: since `align - index` and `ascii_block_size` are
// multiples of `usize_bytes`, `block = ptr.add(index)` is
// always aligned with a `usize` so it's safe to dereference
// both `block` and `block.offset(1)`.
unsafe {
let block = ptr.add(index) as *const usize;
// break if there is a nonascii byte
let zu = contains_nonascii(*block);
let zv = contains_nonascii(*block.offset(1));
if zu | zv {
break;
}
}
index += ascii_block_size;
}
// step from the point where the wordwise loop stopped
while index < len && v[index] < 128 {
index += 1;
}
} else {
index += 1;
}
}
}
Ok(())
}

This doesn't appear to use SIMD anywhere, not even conditionally. But it's run a lot, so I figured it might be interesting to use a more efficient algorithm for.

Performance improvements

The post "Validating UTF-8 strings using as little as 0.7 cycles per byte" shows about an order of magnitude performance improvement on validating utf-8, going from 8 cycles per byte parsed to 0.7 cycles per byte parsed.

When passing Rust's validation code through the godbolt decompiler, from_utf8_unchecked outputs 7 instructions, and from_utf8 outputs 57 instructions. In the case of from_utf8 most instructions seem to occur inside a loop. Which makes it likely we'll be able to observe a performance improvement by using a SIMD-enabled utf-8 parsing algorithm. Especially since economies of scale would apply here -- it's not uncommon for the compiler to parse several million bytes of input in a run. Any improvements here would quickly add up.

All examples linked have been compiled with -O -C target-cpu=native.

Also ecosystem libraries such as serde_json perform utf-8 validation in several locations, so would likely also benefit from performance improvements to Rust's utf-8 validation routines.

Implementation

There are two known Rust implementations of lemire's algorithm available in Rust today:

The latter even includes benchmarks against the compiler's algorithm (which makes it probable I'm not be the first person to think of this). But I haven't been able to successfully compile the benches, so I don't know how they stack up against the current implementation.

I'm not overly familiar with rustc's internals. But it seems we would likely want to keep the current algorithm, and through feature detection enable SIMD algorithms. The simdjson library has different algorithms for different architectures, but we could probably start with instructions that are widely available and supported on tier-1 targets (such as AVX2).

These changes wouldn't require an RFC because no APIs would change. The only outcome would be a performance improvement.

Future work

Lemire's post also covers parsing ASCII in as little as 0.1 cycles per byte parsed. Rust's current ASCII validation algorithm validates bytes one at the time, and could likely benefit from similar optimizations.

rust/src/libcore/str/mod.rs

Lines 4136 to 4141 in 2f688ac

pub fn is_ascii(&self) -> bool {
// We can treat each byte as character here: all multibyte characters
// start with a byte that is not in the ascii range, so we will stop
// there already.
self.bytes().all(|b| b.is_ascii())
}

Speeding this up would likely have ecosystem implications as well. For example HTTP headers must be valid ASCII, and are often performance sensitive. If the stdlib sped up ASCII validation, it would likely benefit the wider ecosystem as well.

Conclusion

In this issue I propose to use a SIMD-enabled algorithm for utf-8 validation in rustc. This seems like an interesting avenue to explore since there's a reasonable chance it might yield a performance improvement for many rust programs.

I'm somewhat excited to have stumbled upon this, but was also surprised no issue had been filed for this yet. I'm a bit self-aware posting this since I'm not a rustc compiler engineer; but I hope this proves to be useful!

cc/ @jonas-schievink @nnethercote

References

@jonas-schievink jonas-schievink added A-unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Jan 22, 2020
@CryZe
Copy link
Contributor

CryZe commented Jan 22, 2020

assembly for str::from_utf8 (godbolt) - 57 lines

This does not do any validation, it just calls to core::str::from_utf8 which is not part of the assembly.

@yoshuawuyts
Copy link
Member Author

@CryZe ah, my bad. Yeah I was confused about the exact output, so for good measure I also copied over std's algorithm into godbolt (third link) to see what would happen. Thanks for clarifying what's actually going on!

@Mark-Simulacrum
Copy link
Member

UTF-8 validation is hopefully not where the compiler spends its time :)

However, I could imagine this having some impact on "smallest possible" compile times (e.g., UI tests, hello world).

My recommendation is to replace the algorithm in core::str::from_utf8 (or where-ever it is in core) with direct use of AVX2 or some similar set, and we can then run that by perf.rust-lang.org as a loose benchmark. That's likely not tenable in reality (we would need to conditionally, likely at runtime, gate use of SIMD instructions) but would give I believe the best possible performance wins (since it would apply to all uses of from_utf8.

@jonas-schievink jonas-schievink added the A-simd Area: SIMD (Single Instruction Multiple Data) label Jan 22, 2020
@CryZe
Copy link
Contributor

CryZe commented Jan 22, 2020

Here's pretty much a 1:1 port:
https://godbolt.org/z/NSop8w

@Licenser
Copy link

Going to add the benchmarks from isutf8 (run on my machine RUSTFLAGS='-C target-cpu=native' cargo +nightly bench):

     Running target/release/deps/validate_utf8-6efda746d5fb1b3a
Gnuplot not found, disabling plotting
random_bytes/libcore    time:   [4.9977 ns 5.0048 ns 5.0117 ns]                                  
                        thrpt:  [ 97428 GiB/s  97562 GiB/s  97702 GiB/s]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
random_bytes/lemire_sse time:   [69.970 us 69.985 us 70.000 us]                                    
                        thrpt:  [6.9755 GiB/s 6.9769 GiB/s 6.9784 GiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
random_bytes/lemire_avx time:   [62.055 us 62.084 us 62.109 us]                                    
                        thrpt:  [7.8617 GiB/s 7.8649 GiB/s 7.8685 GiB/s]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild
random_bytes/lemire_avx_ascii_path                                                                            
                        time:   [62.549 us 62.582 us 62.615 us]
                        thrpt:  [7.7982 GiB/s 7.8023 GiB/s 7.8064 GiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
random_bytes/range_sse  time:   [62.763 us 62.772 us 62.782 us]                                   
                        thrpt:  [7.7774 GiB/s 7.7787 GiB/s 7.7798 GiB/s]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low severe
  2 (2.00%) high mild
random_bytes/range_avx  time:   [45.737 us 45.746 us 45.755 us]                                    
                        thrpt:  [10.672 GiB/s 10.674 GiB/s 10.676 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

mostly_ascii/libcore    time:   [166.94 ns 167.23 ns 167.54 ns]                                 
                        thrpt:  [17.988 GiB/s 18.022 GiB/s 18.053 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
mostly_ascii/lemire_sse time:   [430.20 ns 430.45 ns 430.68 ns]                                    
                        thrpt:  [6.9977 GiB/s 7.0014 GiB/s 7.0054 GiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
mostly_ascii/lemire_avx time:   [382.89 ns 383.07 ns 383.25 ns]                                    
                        thrpt:  [7.8637 GiB/s 7.8673 GiB/s 7.8711 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
mostly_ascii/lemire_avx_ascii_path                                                                            
                        time:   [65.208 ns 65.255 ns 65.311 ns]
                        thrpt:  [46.145 GiB/s 46.184 GiB/s 46.218 GiB/s]
Found 24 outliers among 100 measurements (24.00%)
  2 (2.00%) low severe
  12 (12.00%) low mild
  2 (2.00%) high mild
  8 (8.00%) high severe
mostly_ascii/range_sse  time:   [384.43 ns 384.64 ns 384.87 ns]                                   
                        thrpt:  [7.8307 GiB/s 7.8352 GiB/s 7.8395 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
mostly_ascii/range_avx  time:   [286.05 ns 286.26 ns 286.47 ns]                                   
                        thrpt:  [10.521 GiB/s 10.528 GiB/s 10.536 GiB/s]

ascii/libcore           time:   [72.975 ns 73.076 ns 73.189 ns]                          
                        thrpt:  [39.307 GiB/s 39.368 GiB/s 39.422 GiB/s]
Found 14 outliers among 100 measurements (14.00%)
  6 (6.00%) high mild
  8 (8.00%) high severe
ascii/lemire_sse        time:   [423.11 ns 423.35 ns 423.62 ns]                             
                        thrpt:  [6.7912 GiB/s 6.7954 GiB/s 6.7993 GiB/s]
ascii/lemire_avx        time:   [373.82 ns 374.45 ns 375.43 ns]                             
                        thrpt:  [7.6628 GiB/s 7.6830 GiB/s 7.6958 GiB/s]
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) low mild
  1 (1.00%) high severe
ascii/lemire_avx_ascii_path                                                                            
                        time:   [50.353 ns 50.588 ns 50.925 ns]
                        thrpt:  [56.492 GiB/s 56.869 GiB/s 57.133 GiB/s]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
ascii/range_sse         time:   [375.11 ns 375.87 ns 376.96 ns]                            
                        thrpt:  [7.6318 GiB/s 7.6538 GiB/s 7.6695 GiB/s]
Found 35 outliers among 100 measurements (35.00%)
  23 (23.00%) low severe
  8 (8.00%) high mild
  4 (4.00%) high severe
ascii/range_avx         time:   [272.39 ns 272.59 ns 272.82 ns]                            
                        thrpt:  [10.545 GiB/s 10.554 GiB/s 10.562 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

utf8/libcore            time:   [9.0154 us 9.0263 us 9.0389 us]                          
                        thrpt:  [1.7096 GiB/s 1.7119 GiB/s 1.7140 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
utf8/lemire_sse         time:   [2.1554 us 2.1568 us 2.1581 us]                             
                        thrpt:  [7.1601 GiB/s 7.1645 GiB/s 7.1693 GiB/s]
Found 15 outliers among 100 measurements (15.00%)
  11 (11.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe
utf8/lemire_avx         time:   [1.9184 us 1.9188 us 1.9192 us]                             
                        thrpt:  [8.0515 GiB/s 8.0530 GiB/s 8.0547 GiB/s]
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
utf8/lemire_avx_ascii_path                                                                             
                        time:   [1.4670 us 1.4679 us 1.4691 us]
                        thrpt:  [10.518 GiB/s 10.527 GiB/s 10.534 GiB/s]
utf8/range_sse          time:   [1.9426 us 1.9452 us 1.9491 us]                            
                        thrpt:  [7.9280 GiB/s 7.9439 GiB/s 7.9544 GiB/s]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
utf8/range_avx          time:   [1.4656 us 1.4733 us 1.4833 us]                            
                        thrpt:  [10.418 GiB/s 10.489 GiB/s 10.544 GiB/s]
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  9 (9.00%) high severe

Benchmarking all_utf8/libcore: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.7s or reduce sample count to 50
all_utf8/libcore        time:   [1.8757 ms 1.8790 ms 1.8832 ms]                              
                        thrpt:  [2.1672 GiB/s 2.1721 GiB/s 2.1759 GiB/s]
all_utf8/lemire_sse     time:   [586.47 us 586.93 us 587.47 us]                                
                        thrpt:  [6.9474 GiB/s 6.9538 GiB/s 6.9593 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
all_utf8/lemire_avx     time:   [517.80 us 518.97 us 521.39 us]                                
                        thrpt:  [7.8279 GiB/s 7.8644 GiB/s 7.8822 GiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
all_utf8/lemire_avx_ascii_path                                                                            
                        time:   [523.97 us 524.27 us 524.63 us]
                        thrpt:  [7.7796 GiB/s 7.7849 GiB/s 7.7894 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
all_utf8/range_sse      time:   [525.94 us 527.21 us 528.57 us]                               
                        thrpt:  [7.7216 GiB/s 7.7415 GiB/s 7.7601 GiB/s]
Found 21 outliers among 100 measurements (21.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  8 (8.00%) high mild
  11 (11.00%) high severe
all_utf8/range_avx      time:   [392.25 us 392.91 us 393.80 us]                               
                        thrpt:  [10.364 GiB/s 10.388 GiB/s 10.405 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high severe

all_utf8_with_garbage/libcore                                                                             
                        time:   [3.6752 ns 3.7353 ns 3.8034 ns]
                        thrpt:  [1137275 GiB/s 1158024 GiB/s 1176952 GiB/s]
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  8 (8.00%) high severe
all_utf8_with_garbage/lemire_sse                                                                            
                        time:   [616.73 us 616.89 us 617.03 us]
                        thrpt:  [7.0103 GiB/s 7.0119 GiB/s 7.0136 GiB/s]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high severe
all_utf8_with_garbage/lemire_avx                                                                            
                        time:   [551.11 us 552.09 us 554.06 us]
                        thrpt:  [7.8070 GiB/s 7.8349 GiB/s 7.8488 GiB/s]
Found 17 outliers among 100 measurements (17.00%)
  5 (5.00%) low severe
  9 (9.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
all_utf8_with_garbage/lemire_avx_ascii_path                                                                            
                        time:   [557.08 us 557.28 us 557.51 us]
                        thrpt:  [7.7587 GiB/s 7.7618 GiB/s 7.7646 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
all_utf8_with_garbage/range_sse                                                                            
                        time:   [554.79 us 555.10 us 555.42 us]
                        thrpt:  [7.7879 GiB/s 7.7924 GiB/s 7.7967 GiB/s]
all_utf8_with_garbage/range_avx                                                                            
                        time:   [417.05 us 417.38 us 417.74 us]
                        thrpt:  [10.355 GiB/s 10.364 GiB/s 10.372 GiB/s]
Found 13 outliers among 100 measurements (13.00%)
  13 (13.00%) low mild

@Licenser
Copy link

I re-read some of the Lemire algorithm and there are some key differences which might make it not that suitable for the general string validation.

The two key points are:

  1. its build to just validate not find the error, so there is no error tracking of where an error is in the string
  2. to make the first point worse it defers testing for errors until it checked the entire data. This makes a lot of sense in its context since it's safe to assume that for a JSON parser most input will be valid JSON and errors are an edge case. This also explains why for the benchmarks of invalid data the stdlib implementation is a lot faster.
  3. It requires the string to be dividable by 256 bytes if it is not it will copy the last part of it to a buffer.
  4. It allocates a struct for each check.

I think ™️ that might make them slower for small payloads.

@ArniDagur
Copy link

ArniDagur commented Jan 22, 2020

I'm at work now, so I will read this thread later in more detail, but here are some things I want to say:

The latter even includes benchmarks against the compiler's algorithm (which makes it probable I'm not be the first person to think of this).

Indeed. I created the crate originally to be included in the Rust internals. I still feel that some of the algorithms need a little bit of refactoring to make them more idiomatic, before being included. I also want to add @zwegner's algorithm (https://github.com/zwegner/faster-utf8-validator), which seems to be the fastest algorithm that has been invented to date.

But I haven't been able to successfully compile the benches, so I don't know how they stack up against the current implementation.

If the build fails, can you open an issue? :) I would appreciate that.

argnidagur/rust-isutf8

Also: You misspelled my name ;)

@CryZe
Copy link
Contributor

CryZe commented Jan 22, 2020

It requires the string to be dividable by 256 bytes if it is not it will copy the last part of it to a buffer.

Is that the AVX version or something? The SSE version just allocates 16 bytes on the stack for the remaining <16 bytes.

It allocates a struct for each check.

I don't even know what this means. Are we talking about this one?

struct processed_utf_bytes {
  __m128i rawbytes;
  __m128i high_nibbles;
  __m128i carried_continuations;
};

This one gets completely optimized away.

@Licenser
Copy link

Is that the AVX version or something? The SSE version just allocates 16 bytes.

Sorry, a bit tired, bit not byte :)

This one gets completely optimized away.

Good point

@CryZe
Copy link
Contributor

CryZe commented Jan 22, 2020

Since this would have to go into core, can core even use runtime checks for target features yet (we need at least SSE 4.1 it seems)?

@ArniDagur
Copy link

ArniDagur commented Jan 22, 2020

In reply to @Licenser:

its build to just validate not find the error, so there is no error tracking of where an error is in the string

I have thought about this and we have a few options:

  1. Have, as a compile time feature, functions which for every iteration of the loop checks if we have errors, then the error can be narrowed down. This should cost around 5% performance. If we wrapped the check in an unlikely! macro (maybe a compile time feature called optimize_for_correct), this number would probably decrease. The 5% number comes from Zwegner's findings IIRC.
  2. First go through all the text to see if it's valid. We may possible assume that most of the text is valid UTF-8. Then, in case of errors, search for the error using scalar code or some variation of option 1.

We would need to think more about this.

@ArniDagur
Copy link

Since this would have to go into core, can core even use runtime checks for target features yet (we need at least SSE 4.1 it seems)?

It can be compile-time only for now. Those who self-compile (such as myself) would see an improvement.

@zwegner
Copy link

zwegner commented Jan 22, 2020

Hey, thanks for the shout out. I want to note that I have made several improvements to my code that aren't present in the master branch (but are in the wip branch), and it is significantly faster than when it was published (about 40%). Most importantly, I was able to completely elide the separate check for the first continuation byte.

When not memory-bandwidth bound, my algorithm is now over 2x faster (in my tests, on AVX2) than the one in Daniel Lemire's original repo (the one described in the blog post), and my SSE4 path is faster than the AVX2 path of that repo. The algorithm used in simdjson has made some improvements, but last I checked I think my algorithm is still faster.

I still need to finish writing up documentation for the new algorithm. The code's definitely more hairy, from dealing with more architectures (AVX-512 and NEON), handling pointer alignment, generating lookup tables with a Python script, etc... But I'm happy to help out if anyone wants to use it/port it.

@pickfire
Copy link
Contributor

@Licenser Are you still working on this?

@Licenser
Copy link

I ported the validator https://github.com/simd-lite/faster-utf8-validator-rs but from what I understand this all falls apart with the stdlib not being able to use CPU native features under the hood?

@hanna-kruppe
Copy link
Contributor

It would need to be gated under runtime CPUID checks or left disabled by default, gated under a cfg(target_feature=...) that is only true if users rebuild libstd with those target features enabled (currently an unstable Cargo feature, but should get more accessible in the future). The latter is easy but only helps software that can afford to only run on newer processors, the former faces the challenge of not regressing performance but would help more users if it worked.

ifuncs or similar mechanisms might also be an option on certain platforms, but they're not very portable and have obscure and hard-to-satisfy constraints. Manually emulating them with function pointers initialized on first call might have more overhead, not sure.

@bprosnitz
Copy link

My understanding is that benchmarks can be misleading for AVX (and possibly SSE) in general purpose code, because:

  • AVX has a startup time that won't be measured when repeatedly looping in a microbenchmark
  • AVX draws a lot of power and can end up slowing down other cores due to power draw and increased heat

Curious what others think about the suitability? Does it make sense only for sufficiently large strings?

@milkey-mouse
Copy link

This article goes into detail on when it makes sense to use AVX (AVX-512 especially). The most relevant parts:

Downclocking, when it happens, is per core and for a short time after you have used particular instructions (e.g., ~2ms).

The bar for light AVX-512 is lower. Even if the work is spread on all cores, you may only get a 15% frequency on some chips like a Xeon Gold. So you only have to check that AVX-512 gives you a greater than 15% gain for your overall application on a per-cycle basis.

Some older CPUs downclock all cores when any of them are using AVX2 instructions, but for newer ones, they mostly only affect the core running them. Also, the SIMD instructions used for string validation would fall in the "light" category of instructions as they don't involve the floating-point unit.

As @bprosnitz mentioned, take them with a grain of salt, but microbenchmarks certainly imply there is something to be gained in using an accelerated validator.

@lemire
Copy link

lemire commented Jul 14, 2020

Note that we have an updated validator called lookup 4...

simdjson/simdjson#993

It is going to be really hard to beat.

@bprosnitz

 My understanding is that benchmarks can be misleading for AVX (and possibly SSE) in general purpose code, because: AVX has a startup time that won't be measured when repeatedly looping in a microbenchmark, AVX draws a lot of power and can end up slowing down other cores due to power draw and increased heat, Curious what others think about the suitability? Does it make sense only for sufficiently large strings?

Firstly, let us set aside AVX-512 and "heavy" (numerical) AVX2 instructions. They have their uses (e.g., in machine learning, simulation). But that's probably not what you have in mind.

This being said...

Regarding power usage, it is generally true that the faster code is the code that uses less energy. So if you can multiply your speed using NEON, SSE, AVX, go ahead. You'll come up on top. It is a bit like being concerned with climate change and observing that buses and trains use more energy than cars. They use more energy in total, but less energy per work done. So you have to hold the work constant if you are going to make comparisons. Does it take more energy to do 4 additions, or to use one instruction that does 4 additions at once?

So SIMD instructions are the public transportation of computing. They are green and should be used as much as possible. (Again, I am setting aside AVX-512 and numerical AVX2 instructions that are something more controversial.)

Regarding the fear that SIMD instructions are somehow exotic and rare, and that if you ever use it, you will trigger a chain reaction of slowness... You are using AVX all the time... Read this commit where the committer identified that the hot function in his benchmark was __memmove_avx_unaligned_erms. You can bet that this function is AVX-based. The Golang runtime uses AVX, Glibc uses AVX, LLVM uses AVX, Java, and so forth. Even PHP uses SIMD instructions for some string algorithms. And yes, Rust programs use AVX or other SIMD instructions.

@lemire
Copy link

lemire commented Jul 14, 2020

@hanna-kruppe

Manually emulating them with function pointers initialized on first call might have more overhead

I don't think so. We have a well tested approach. It is as simple as that...

internal::atomic_ptr<const implementation> active_implementation{&detect_best_supported_implementation_on_first_use_singleton};


const implementation *detect_best_supported_implementation_on_first_use::set_best() const noexcept {
  return active_implementation = available_implementations.detect_best_supported();
}

So you just need an atomic function pointer.

Obviously, you do not get inlining, but that's about the only cost. Loading an atomic pointer is no more expensive, really, than loading an ordinary pointer. So this is pretty much free... except for the first time. You pay a price of first use, but that's not a fundamental limitation: you could set the best function any time, including at startup.

@zwegner
Copy link

zwegner commented Jul 14, 2020

Note that we have an updated validator called lookup 4...

simdjson/simdjson#993

It is going to be really hard to beat.

I mentioned this off-hand on Twitter, but to clarify, in my benchmarks of pure UTF-8 validation, my code with scalar continuation byte checks and unaligned loads is still faster than lookup4 (or at least my translation of lookup4 to C). The difference depends on compiler (on my HSW, testing 1M of UTF-8, GCC gives .227 (me) vs .319 (L4) cycles/byte, while LLVM has .240 (me) vs .266 (L4)).

The picture gets more complicated in the context of simdjson, when plenty of other code is competing for the same execution ports (and the best algorithm is less clear), but I think in the case of Rust, the pure UTF-8 microbenchmarks are probably more representative.

@milkey-mouse
Copy link

@lemire to clarify, my benchmarks are already using the lookup4 implementation in simdjson (from the pull request's branch).

@lemire
Copy link

lemire commented Jul 15, 2020

@milkey-mouse Fantastic.

cc @jkeiser

@lemire
Copy link

lemire commented Jul 17, 2020

Blog post: The cost of runtime dispatch.

@pickfire
Copy link
Contributor

Thanks for explaining, but this all seems backwards to me. -Ctarget-cpu=native should be the default and if people want to distribute a binary then additional options should be specified to specify the target platform. The fact that it's even an unstable option to rebuild the standard library also seems backwards. The standard library at the very least should get built every time you build a project (with cross-project caching of the native version, and local caching if a non-native version is used).

I think you should learn that from a packaging standpoint we would not want to package so many binaries such that it is only used by certain processor that certain people have. So instead of creating one binary we will have to create hundreds of binaries. This is also why linux distributions will only build with a very general SIMD that almost every one have during compile time. But with runtime detection, it can keep a single binary at the cost of runtime checks when the application starts and use a faster version without many drawbacks, I believe the performance cost is insignificant compared to the cost for building it for very specific CPU and keeping all the binaries for different specific CPU. This is also how ripgrep does SIMD, which is using runtime detection. Try discussing it with package distributor then you will understand more why is this the case.

@BurntSushi
Copy link
Member

That's how everything I know of uses SIMD unless it's doing some kind of JIT, but there's no need for a JIT here.

To follow up with what @pickfire said:

This is also how ripgrep does SIMD, which is using runtime detection.

And ripgrep is in good company. glibc does the same thing.

ripgrep does detection via its dependencies:

@mlindner
Copy link

mlindner commented Apr 21, 2021

Perhaps my view is just too narrow as I'm used to dealing with only fixed server platforms (only targeting one specific CPU) where we build everything destined for the system and my local Mac OS which has a constrained CPU instruction set (and fat binaries for ARM vs x86_64). Either way, I think it should never be excluded as an option if people wish it to be at compile time. Thanks for taking your time and explaining.

@BurntSushi
Copy link
Member

@mlindner Yeah indeed. If I only had to deploy to a fixed set of server platforms where I knew exactly what ISA extensions were needed, then I'd totally skip the complexity of runtime CPU feature detection and just rely on compile time.

@pickfire
Copy link
Contributor

And ripgrep is in good company. glibc does the same thing.

So I guess feature detection will be done multiple times? Wouldn't it be better to only have the cost of feature detection once across all crates?

@hkratz
Copy link
Contributor

hkratz commented Apr 21, 2021

So I guess feature detection will be done multiple times? Wouldn't it be better to only have the cost of feature detection once across all crates?

Feature detection is cached in std. Also most crates use an atomic function pointer, so the decision what implementation to use is done only once.

@bjorn3
Copy link
Member

bjorn3 commented Apr 21, 2021

If you use -Ctarget-cpu or -Ctarget-feature to enable target features, "runtime" detection through is_x86_feature_detected!() for those target features known to be enabled at compile time will compile to true without any runtime detection taking place at all.

@hkratz
Copy link
Contributor

hkratz commented Apr 21, 2021

If you use -Ctarget-cpu or -Ctarget-feature to enable target features, "runtime" detection through is_x86_feature_detected!() for those target features known to be enabled at compile time will compile to true without any runtime detection taking place at all.

That did never work for me (Godbolt).

@bjorn3
Copy link
Member

bjorn3 commented Apr 21, 2021

Looks like the cfg!() that is supposed to do this is inside libstd instead of inside the is_x86_feature_detected!() macro: https://github.com/rust-lang/stdarch/blob/d385078ee7cd8656c4a43bcc321578d0f65ba691/crates/std_detect/src/detect/macros.rs#L97

I opened rust-lang/stdarch#1135 for this.

@hkratz
Copy link
Contributor

hkratz commented Apr 26, 2021

The compatible API in simdutf8 v0.1.1 is now as a fast on small valid inputs and up to 22 times faster on large valid non-ASCII strings (three times as fast on ASCII) - x86-64 with AVX 2 support.

image

If there is a way forward to get this into std or core I would be happy to work on it.

@pickfire
Copy link
Contributor

@hkratz What about arm and those cpu without simd? Will there be a regression?

@hkratz
Copy link
Contributor

hkratz commented Apr 28, 2021

@hkratz What about arm and those cpu without simd? Will there be a regression?

On arm we would just use the current standard library from_utf8() implementation until the required SIMD intrinsics are implemented. On older x86 CPUs without SSE 4.2 or AVX 2 we fall back to the scalar implementation at runtime after the initial detection. The overhead is just one additional fn pointer call on subsequent from_utf8() calls.

@Voultapher
Copy link
Contributor

@m-ou-se having read the entire conversation in this thread, it's not entirely clear to me what is blocking this from progressing. Would this set a precedent for runtime feature detection, and or something else? I'd appreciate if you, or someone else from the lib team could help me and others like @hkratz understand what would be need to get this into libcore.

@bjorn3
Copy link
Member

bjorn3 commented Jul 18, 2021

@Voultapher The main problem is how to detect which target features are supported from inside libcore. Except for x86, all platforms require asking the OS for which target features are supported by the CPU. libcore by definition doesn't know anything about the OS and as such can't know which target features are supported by the CPU. It is not safe to use target features before checking if they are supported. The results of doing so can range from crashes to unexpected instructions executing. In any case it will not result in the correct results.

@lemire
Copy link

lemire commented Jul 18, 2021

The Rust runtime does rely on runtime feature detection, but, as far as I can tell, only indirectly (e.g., by calling memcpy).

Languages like Go, Java, C/C++... do use (quite often) runtime feature detection as part of the core libraries.

A sensible way forward would be to add runtime CPU feature detection directly in the Rust core.

@Voultapher
Copy link
Contributor

Voultapher commented Jul 18, 2021

I agree, there are enough examples that demonstrate it can be done within the larger constrains of the Rust language. A potential nice side-effect would be efficient and correct feature detection in the standard library, there are enough libraries and tools ripgrep, memchr etc. that have to hand roll it today. And tbh intrinsic support feels incomplete without feature detection.

@bjorn3
Copy link
Member

bjorn3 commented Jul 18, 2021

CPU feature detection is supported as part of libstd, as this is the part of the standard library that is allowed to interface with the OS, but libcore does not. How should CPU feature detection work when compiling a kernel against libcore? That would require some way to pass the supported features to libcore. What if the kernel works right now with stable rustc and doesn't provide the supported features to libcore as there is currently no such thing. How can the target feature detection support be added to libcore without breaking this kernel that doesn't pass the supported target features to libcore.

@Voultapher
Copy link
Contributor

Voultapher commented Jul 18, 2021

To clarify, I poorly worded my earlier comment. I'm thinking about both feature detection and corresponding dispatch. For example implemented with atomic function pointer overwrite.

I'm not deep enough into the Rust standard library to even fully understand the implications of what you said. I'm looking at this from a user perspective. Many Rust applications use the standard library, not only libcore. Part of the standard library interface are directly and transitively calls that perform UTF-8 validation. Somewhere in this stack of abstractions, feature detection and 'dispatch' can happen to use more efficient versions.

Naively, if it can't easily be added to libcore, why not add it to the stdlib only, as first step?

@hkratz
Copy link
Contributor

hkratz commented Jul 18, 2021

CPU feature detection is supported as part of libstd, as this is the part of the standard library that is allowed to interface with the OS, but libcore does not. How should CPU feature detection work when compiling a kernel against libcore? That would require some way to pass the supported features to libcore. What if the kernel works right now with stable rustc and doesn't provide the supported features to libcore as there is currently no such thing. How can the target feature detection support be added to libcore without breaking this kernel that doesn't pass the supported target features to libcore.

One way would be to tell the core library that SIMD implementations are supported (core::somewhere::set_cpu_features_enabled(...)) or setting a callback which returns if they are supported (core::somewhere::set_cpu_features_check_fn(...)) from the std startup code.

For aarch64 detection would not even be needed because ASIMD/neon is enabled by default on nightly Rust so LLVM autovectorizes to ASIMD/neon code already.

@hkratz
Copy link
Contributor

hkratz commented Jul 18, 2021

Naively, if it can't easily be added to libcore, why not add it to the stdlib only as first step?

std::str::from_utf8() is just an alias (via pub use) for core::str::from_utf8(). Changing that would be a breaking change.

@CryZe
Copy link
Contributor

CryZe commented Jul 18, 2021

With LLVM / gcc / System V having recently introduced a concept of x86-64-v1/2/3/4 there's a reasonable chance we'll get actual targets for those. Also there's always the possibility of using build-std for a specific set of features / target cpu. So I'd say for an initial implementation we wouldn't need any runtime detection at all.

@Voultapher
Copy link
Contributor

Voultapher commented Jul 18, 2021

It seems this issue is suffering a bit from https://xkcd.com/2368/. How can we get to an MVP, even as small as, you only get this feature if you compile std with a feature flag, or only x86-64 etc. ?

Given that it's clear that:

  • Runtime feature detection based features can be done (glibc, go, etc.)
  • It's desirable

All that remains is breaking the problem into smaller chunks and doing them.

@hkratz
Copy link
Contributor

hkratz commented Jul 18, 2021

I am currently working on speeding up using SIMD UTF-8 validation for strings < 128 bytes. I plan to work on a PR adding SIMD UTF-8 support in core for aarch64 as that is compiled with SIMD support anyway afterwards. Following that we can have a look at the tradeoffs and performance in real world projects and then maybe implement X86-64 support with runtime detection in core (checking CPUID ond CR4.OSXSAVE should work on all operating systems, and if not that could be enabled with a flag std sets for when compiling core as part of std).

@bjorn3
Copy link
Member

bjorn3 commented Jul 18, 2021

checking CPUID ond CR4.OSXSAVE should work on all operating systems, and if not that could be enabled with a flag std sets for when compiling core as part of std

It doesn't work inside an SGX secure enclave: https://github.com/rust-lang/stdarch/blob/ec989330959b31348d182767a0d40291519ba9d2/crates/core_arch/src/x86/cpuid.rs#L101-L104

@hkratz
Copy link
Contributor

hkratz commented Jul 18, 2021

checking CPUID ond CR4.OSXSAVE should work on all operating systems, and if not that could be enabled with a flag std sets for when compiling core as part of std

It doesn't work inside an SGX secure enclave: https://github.com/rust-lang/stdarch/blob/ec989330959b31348d182767a0d40291519ba9d2/crates/core_arch/src/x86/cpuid.rs#L101-L104

But we can just check for target_env = "sgx" at compile-time and fall back to scalar UTF-8 validation, right?

@bjorn3
Copy link
Member

bjorn3 commented Jul 18, 2021

Yes, or check has_cpuid().

@dralley
Copy link
Contributor

dralley commented Jul 12, 2022

I am currently working on speeding up using SIMD UTF-8 validation for strings < 128 bytes. I plan to work on a PR adding SIMD UTF-8 support in core for aarch64 as that is compiled with SIMD support anyway afterwards.

@hkratz How did that go?

@workingjubilee workingjubilee added the A-target-feature Area: Enabling/disabling target features like AVX, Neon, etc. label Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-simd Area: SIMD (Single Instruction Multiple Data) A-target-feature Area: Enabling/disabling target features like AVX, Neon, etc. A-unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs Relevant to the library team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests