str: Implement a faster Chars iterator for &str #15638

bluss · 2014-07-13T00:09:14Z

Reimplement the string slice's Iterator<char> by wrapping the already efficient
slice iterator.

The iterator uses our guarantee that the string contains valid UTF-8, but its only unsafe
code is transmuting the decoded u32 into char.

Benchmarks suggest that the runtime of Chars benchmarks are reduced by up to 30%,
runtime of Chars reversed reduced by up to 60%.

BEFORE
test str::bench::char_indicesator                          ... bench:       124 ns/iter (+/- 1)
test str::bench::char_indicesator_rev                      ... bench:       188 ns/iter (+/- 9)
test str::bench::char_iterator                             ... bench:       122 ns/iter (+/- 2)
test str::bench::char_iterator_ascii                       ... bench:       302 ns/iter (+/- 41)
test str::bench::char_iterator_for                         ... bench:       123 ns/iter (+/- 4)
test str::bench::char_iterator_rev                         ... bench:       189 ns/iter (+/- 14)
test str::bench::char_iterator_rev_for                     ... bench:       177 ns/iter (+/- 4)

AFTER
test str::bench::char_indicesator                          ... bench:        85 ns/iter (+/- 3)
test str::bench::char_indicesator_rev                      ... bench:        82 ns/iter (+/- 2)
test str::bench::char_iterator                             ... bench:       100 ns/iter (+/- 3)
test str::bench::char_iterator_ascii                       ... bench:       317 ns/iter (+/- 3)
test str::bench::char_iterator_for                         ... bench:        86 ns/iter (+/- 2)
test str::bench::char_iterator_rev                         ... bench:        80 ns/iter (+/- 6)
test str::bench::char_iterator_rev_for                     ... bench:        68 ns/iter (+/- 0)

Note: Branch name is no longer indicative of the implementation.

errordeveloper · 2014-07-13T00:49:20Z

Wow, 20-30% 👍

huonw · 2014-07-13T01:02:33Z

Just to be clear, there's no way to get similar speed in safe(r) code using self.as_bytes().iter() rather than the raw pointers directly because that does bounds checking for the end of the slice when decoding multibyte characters, even though the UTF8 requirement makes this unnecessary?

bluss · 2014-07-13T01:25:58Z

It may well be possible @huonw. An initial prototype doesn't beat the current libstd implementation though; I'll come back to that tomorrow

Here we have to ask -- what's the cheapest way to do something like it.next().unwrap()?

pub struct Citems<'a> {
    iter: std::slice::Items<'a, u8>
}

impl<'a> Iterator<char> for Citems<'a>
{
    #[inline]
    fn next(&mut self) -> Option<char>
    {
        #[inline(never)]
        fn decode_multibyte<'a>(fst: u8, it: &mut std::slice::Items<'a, u8>)
            -> Option<char>
        {
            let w = UTF8_CHAR_WIDTH[fst as uint];
            let mut ch = utf8_first_byte!(fst, w as uint);
            ch = utf8_acc_cont_byte!(ch, it.next().map_or(0, |x| *x));
            if w > 2 {
                ch = utf8_acc_cont_byte!(ch, it.next().map_or(0, |x| *x));
                if w > 3 {
                    ch = utf8_acc_cont_byte!(ch, it.next().map_or(0, |x| *x));
                }
            }
            unsafe {
                Some(transmute(ch))
            }
        }

        match self.iter.next() {
            Some(&next_byte) => {
                if next_byte < 128 {
                    Some(next_byte as char)
                } else {
                    decode_multibyte(next_byte, &mut self.iter)
                }
            }
            None => None
        }
    }
}

alexcrichton · 2014-07-13T03:21:24Z

This is awesome, thanks for optimizing this! I, like @huonw, am also interested if this could be done with a bytes iterator, it seems like it should be able to compile down to the same thing.

bluss · 2014-07-13T10:12:26Z

I agree -- but I haven't been able to write an implementation based on the slice iterator that performs as well. My understanding is simplified, but I tried to put the multibyte case in the slow path and make the ASCII-only case fast but the slice iterator doesn't perform as well as the pointer arithmetic based chars iterator, especially not on ASCII-dominated text.

Is this related to #11751 ?

My playground code for comparing the three iterator implementations is here, the slice iterator based implementation is on this line

Benchmarks from the playground code (wikitext is decoding the En Wikipedia front page source)

test bench::char_iterator               ... bench:       112 ns/iter (+/- 0)
test bench::char_iterator_ascii         ... bench:       301 ns/iter (+/- 1)
test bench::char_iterator_rev           ... bench:       181 ns/iter (+/- 2)
test bench::char_iterator_wikitext      ... bench:     77445 ns/iter (+/- 1578)
test bench::citems_iterator             ... bench:       111 ns/iter (+/- 1)
test bench::citems_iterator_ascii       ... bench:       292 ns/iter (+/- 4)
test bench::citems_iterator_wikitext    ... bench:     77120 ns/iter (+/- 341)
test bench::new_chars_iterator          ... bench:        84 ns/iter (+/- 11)
test bench::new_chars_iterator_ascii    ... bench:       225 ns/iter (+/- 1)
test bench::new_chars_iterator_rev      ... bench:       109 ns/iter (+/- 3)
test bench::new_chars_iterator_wikitext ... bench:     54683 ns/iter (+/- 211)

huonw · 2014-07-13T13:05:02Z

Yes #11751 will get in the way, theoretically rust-lang/llvm#14 will improve it (if you felt like it, you could manipulate your src/llvm submodule to include that to test; I'm not really sure how easy that is to do).

bluss · 2014-07-13T14:14:02Z

Updated comment because of major thinko -- nothing seems to be working the way I thought it was.

bluss · 2014-07-13T16:17:04Z

Now I've managed to compile a rustc that knows about -C null-check-elimination (pull in the llvm submodule; configure to disable submodule management; make)

The result is, no improvement (citems is the slice iterator-based impl)

test bench::citems_iterator             ... bench:       115 ns/iter (+/- 15)
test bench::citems_iterator_ascii       ... bench:       311 ns/iter (+/- 44)
test bench::citems_iterator_wikitext    ... bench:     77793 ns/iter (+/- 630)
test bench::new_chars_iterator          ... bench:       104 ns/iter (+/- 12)
test bench::new_chars_iterator_ascii    ... bench:       223 ns/iter (+/- 4)
test bench::new_chars_iterator_rev      ... bench:       121 ns/iter (+/- 16)
test bench::new_chars_iterator_wikitext ... bench:     55178 ns/iter (+/- 323)

alexcrichton · 2014-07-13T21:07:59Z

I'm a little uneasy adding more unsafe iteration when it already exists for vectors (duplication of logic). Is there a performance win for this rewrite using safe iteration over the previous code, or is it only a win using unsafe code?

bluss · 2014-07-13T23:44:17Z

Duplication of logic is a good argument against, but in the end we need to provide a great Chars iterator. I don't feel that this code is particularly risky in any way, just that Rust's string handling is mature enough now that we don't have lots of holes to slip in invalid encoded data through libstd APIs anymore.

The implementation using the slice iterator is a none to a few percent performance increase by itself, we could wait to see if more code generation improvements make up for the rest there. It's not even important to merge that now, but it would be cool if @zwarich could say if his improvements are going to help us.

alexcrichton · 2014-07-14T19:35:57Z

Perhaps we could land the other optimizations and hold off on the unsafe pointers until rust-lang/llvm#14 is merged into master?

bluss · 2014-07-14T20:07:07Z

I did test those changes already, but if there are more it would be very interesting to see. It's not unreasonable. I should complete the implementation using just the slice iterator (the chars().rev() impl is missing), and we can see if it's worth adding on its own, it's ok.

I don't feel a reimplementation of the slice iterator is that scary, it's a great model of a useful pattern itself -- it will come back when users want to implement "slices with stride > 1" for example.

bluss · 2014-07-16T23:29:15Z

Tons of elbow grease creates a slice iterator decoder that's faster than both the present Chars iterator as well as the pointer arithmetic Chars iterator in all my testcases. It's not pretty, though. Will update PR tomorrow.

#[deriving(Clone)]
pub struct CharItems<'a> {
    iter: slice::Items<'a, u8>
}

#[inline]
fn unwrap_or_0(opt: Option<&u8>) -> u8 {
    match opt {
        Some(&byte) => byte,
        None => 0,
    }
}

impl<'a> Iterator<char> for CharItems<'a> {
    #[inline]
    fn next(&mut self) -> Option<char> {
        fn decode_multibyte<'a>(x: u8, it: &mut slice::Items<'a, u8>) -> char {
            // NOTE: Performance is very sensitive to the exact formulation here
            // Decode from a byte combination out of: [[[x y] z] w]
            let CONT_MASK = 0x3F; // continuation byte mask
            let init = utf8_first_byte!(x, 2);
            let mut ch;
            let y = unwrap_or_0(it.next());
            ch = init << 6 | (y & CONT_MASK) as u32;
            if x >= 0xE0 {
                let z = unwrap_or_0(it.next());

                let y_z = (((y & CONT_MASK) as u32) << 6) | (z & CONT_MASK) as u32;
                ch = init << 12 | y_z;
                if x >= 0xF0 {
                    let w = unwrap_or_0(it.next());
                    ch = (init & 7) << 18 | y_z << 6 | (w & CONT_MASK) as u32;
                }
            }
            unsafe {
                mem::transmute(ch)
            }
        }

        match self.iter.next() {
            None => None,
            Some(&next_byte) => {
                if next_byte < 128 {
                    Some(next_byte as char)
                } else {
                    Some(decode_multibyte(next_byte, &mut self.iter))
                }
            }
        }
    }
}

bluss · 2014-07-17T17:37:43Z

~~There is no conclusive evidence for the new impl (wrapping slice::Items<'a, u8>) is faster.~~ How well it compiles heavily depends on the context -- even though that it's faster in a standalone testing program, for some reason it is slower in libcore bench tests. :(

What worries me here is the result for the test called char_iterator_for -- that I added, which runs for ch in s.chars() { black_box(ch) }. Without that regression, I'd love to merge it.

Before

test str::bench::char_indicesator                          ... bench:       125 ns/iter (+/- 293)
test str::bench::char_indicesator_rev                      ... bench:       189 ns/iter (+/- 26)
test str::bench::char_iterator                             ... bench:       152 ns/iter (+/- 296)
test str::bench::char_iterator_ascii                       ... bench:       302 ns/iter (+/- 485)
test str::bench::char_iterator_for                         ... bench:       121 ns/iter (+/- 2)
test str::bench::char_iterator_rev                         ... bench:       418 ns/iter (+/- 357)
test str::bench::char_iterator_rev_for                     ... bench:       195 ns/iter (+/- 413)

after

test str::bench::char_indicesator                          ... bench:       126 ns/iter (+/- 2)
test str::bench::char_indicesator_rev                      ... bench:       137 ns/iter (+/- 15)
test str::bench::char_iterator                             ... bench:       113 ns/iter (+/- 11)
test str::bench::char_iterator_ascii                       ... bench:       310 ns/iter (+/- 14)
test str::bench::char_iterator_for                         ... bench:       179 ns/iter (+/- 3)
test str::bench::char_iterator_rev                         ... bench:       107 ns/iter (+/- 8)
test str::bench::char_iterator_rev_for                     ... bench:       167 ns/iter (+/- 11)

bluss · 2014-07-17T18:18:11Z

It's the inlining decisions -- inlining even the multibyte case makes the difference -- how does that look with the update?

Benching on a laptop always has you wondering, is that faster now because of some turbo boost stuff I can't control? Either way, the for loop test case sped up to the level of the others, using inline.

test str::bench::char_indicesator                          ... bench:        85 ns/iter (+/- 3)
test str::bench::char_indicesator_rev                      ... bench:        82 ns/iter (+/- 6)
test str::bench::char_iterator                             ... bench:        85 ns/iter (+/- 2)
test str::bench::char_iterator_ascii                       ... bench:       317 ns/iter (+/- 2)
test str::bench::char_iterator_for                         ... bench:        87 ns/iter (+/- 3)
test str::bench::char_iterator_rev                         ... bench:        79 ns/iter (+/- 1)
test str::bench::char_iterator_rev_for                     ... bench:        68 ns/iter (+/- 2)

bluss · 2014-07-17T18:20:42Z

Updating main PR text in case this is merged.

Test using for ch s.chars() { black_box(ch) } to have a test that should force the iterator to run its full decoding computations.

Re-use the vector iterator to implement the chars iterator. The iterator uses our guarantee that the string contains valid UTF-8, but its only unsafe code is transmuting the decoded u32 into char.

huonw · 2014-07-17T21:42:13Z

src/libcore/str.rs

+        fn decode_multibyte<'a>(x: u8, it: &mut slice::Items<'a, u8>) -> char {
+            // NOTE: Performance is very sensitive to the exact formulation here
+            // Decode from a byte combination out of: [[[x y] z] w]
+            let cont_mask = 0x3F; // continuation byte mask


Could this be a static?

Test iterating (decoding) every codepoint.

alexcrichton · 2014-07-18T00:58:56Z

src/libcore/str.rs

+    match opt {
+        Some(&byte) => byte,
+        None => 0,
+    }
 }


Was this more performant than foo.unwrap_or(0) (the same width of characters basically)

Thank you for the comments! I didn't like the look of *foo.unwrap_or(&0), I think this is better

alexcrichton · 2014-07-18T01:02:31Z

Awesome work @blake2-ppc! It's always nice to hear that unsafe code isn't necessary when safe code suffices.

huonw · 2014-07-18T01:25:12Z

src/libcore/str.rs

-
-static TAG_CONT_U8: u8 = 128u8;
+/// Mask of the value bits of a continuation byte
+static CONT_MASK: u8 = 0x3Fu8;


These could be binary constants too, 0b0011_1111 and 0b1000_0000.

huonw · 2014-07-18T01:28:10Z

Could you add the benchmarks (or at least mention loose percentages) to a commit message? (So that someone history diving can understand the reasons that this code is designed like this.)

@huonw

Thanks to comments from @huonw, clarify decoding details and use statics for important constants for UTF-8 decoding. Convert some magic numbers scattered in the same file to use the statics too.

@alexcrichton

Thanks to comments from @alexcrichton, write the next/next_back function bodies without nested functions in a more top-to-bottom flow style. Also improve comment style and motivate the unsafe blocks with comments.

bluss · 2014-07-19T11:23:36Z

Commits pushed to incorporate the excellent suggestions. Benchmarking results added to PR message (will be in merge commit log)

bluss · 2014-07-19T12:34:15Z

In-tree benchmarking is very time consuming, but I felt it was necessary to justify the change.

The development benchmark was using the html source of en.wikipedia.org/wiki/Main_Page, a typical almost ascii only document with small chunks of of UTF-8. Those benchmarks are even more convincing, for some reason. (Maybe code size? The size of the .next() method maybe pays off in a long-running simple loop like this)

new Chars vs old — s.chars().count()

NEW test bench::char_items_wikitext     ... bench:    126095 ns/iter (+/- 8568)
NEW test bench::char_items_wikitext_rev ... bench:    120053 ns/iter (+/- 2175)
OLD test bench::chars_wikitext          ... bench:    169647 ns/iter (+/- 4876)
OLD test bench::chars_wikitext_rev      ... bench:    228570 ns/iter (+/- 2267)

new Chars vs old — for ch in s.chars() { black_box(ch) }

NEW test bench::char_items_wikitext     ... bench:    120153 ns/iter (+/- 1482)
NEW test bench::char_items_wikitext_rev ... bench:     83574 ns/iter (+/- 1181)
OLD test bench::chars_wikitext          ... bench:    278724 ns/iter (+/- 2909)
OLD test bench::chars_wikitext_rev      ... bench:    202086 ns/iter (+/- 1790)

huonw · 2014-07-19T12:46:11Z

src/libcore/str.rs

+fn unwrap_or_0(opt: Option<&u8>) -> u8 {
+    match opt {
+        Some(&byte) => byte,
+        None => 0,


I idly wonder if this is a place where we could use an (hypothetical) unreachable intrinsic and possibly even gain a little more speed, since the str invariant guarantees this code will never be hit.

Nice thought. I didn't want to put fail!() or similar in there, it would add failure to an otherwise failure less loop.

I forgot I did try

unsafe { let ptr: *const u8 = mem::transmute(opt); *ptr }

as well, but it didn't improve ~~anything~~ much. @thestinger is a wizard and would know why, but I don't.

Well, the assembly/IR for that is certainly really nice: http://is.gd/MLxXeK

; Function Attrs: nounwind readonly uwtable define i8 @deref(i8* nocapture readonly) unnamed_addr #0 { entry-block: %1 = load i8* %0, align 1 ret i8 %1 }

This branch will be really easy to predict, since the None arm is never taken, so maybe that is making it essentially costless?

Only one uint is needed to keep track of the offset from the original full string.

Reimplement the string slice's `Iterator<char>` by wrapping the already efficient slice iterator. The iterator uses our guarantee that the string contains valid UTF-8, but its only unsafe code is transmuting the decoded `u32` into `char`. Benchmarks suggest that the runtime of `Chars` benchmarks are reduced by up to 30%, runtime of `Chars` reversed reduced by up to 60%. ``` BEFORE test str::bench::char_indicesator ... bench: 124 ns/iter (+/- 1) test str::bench::char_indicesator_rev ... bench: 188 ns/iter (+/- 9) test str::bench::char_iterator ... bench: 122 ns/iter (+/- 2) test str::bench::char_iterator_ascii ... bench: 302 ns/iter (+/- 41) test str::bench::char_iterator_for ... bench: 123 ns/iter (+/- 4) test str::bench::char_iterator_rev ... bench: 189 ns/iter (+/- 14) test str::bench::char_iterator_rev_for ... bench: 177 ns/iter (+/- 4) AFTER test str::bench::char_indicesator ... bench: 85 ns/iter (+/- 3) test str::bench::char_indicesator_rev ... bench: 82 ns/iter (+/- 2) test str::bench::char_iterator ... bench: 100 ns/iter (+/- 3) test str::bench::char_iterator_ascii ... bench: 317 ns/iter (+/- 3) test str::bench::char_iterator_for ... bench: 86 ns/iter (+/- 2) test str::bench::char_iterator_rev ... bench: 80 ns/iter (+/- 6) test str::bench::char_iterator_rev_for ... bench: 68 ns/iter (+/- 0) ``` Note: Branch name is no longer indicative of the implementation.

japaric · 2014-07-19T16:52:32Z

I repeated the last benchmark suite (iterating over the wikipedia main page) using my criterion library, and got a consistent 20%+ improvement in all the benchmarks.

Benchmarking chars/count
(...)
Comparing with previous sample
> Bootstrapping with 100489 resamples
  > mean    -20.243% ± 0.0606% [-20.355% -20.117%] 95% CI
  > median  -20.346% ± 0.0103% [-20.349% -20.322%] 95% CI
  > mean has improved by 20.24%
  > median has improved by 20.35%

Benchmarking chars/for
(...)
Comparing with previous sample
> Bootstrapping with 100489 resamples
  > mean    -26.500% ± 0.0571% [-26.608% -26.388%] 95% CI
  > median  -26.493% ± 0.0154% [-26.515% -26.461%] 95% CI
  > mean has improved by 26.50%
  > median has improved by 26.49%

Benchmarking chars/rev_count
(...)
Comparing with previous sample
> Bootstrapping with 100489 resamples
  > mean    -25.287% ± 0.0624% [-25.418% -25.174%] 95% CI
  > median  -25.298% ± 0.0318% [-25.380% -25.252%] 95% CI
  > mean has improved by 25.29%
  > median has improved by 25.30%

Benchmarking chars/rev_for
(...)
Comparing with previous sample
> Bootstrapping with 100489 resamples
  > mean    -22.016% ± 0.1673% [-22.364% -21.711%] 95% CI
  > median  -21.800% ± 0.0492% [-21.885% -21.698%] 95% CI
  > mean has improved by 22.02%
  > median has improved by 21.80%

@blake2-ppc Really nice speed-ups! 👍

bluss · 2014-07-19T18:45:20Z

Thanks for running the benchmark, that is cool!

root added 2 commits July 17, 2014 20:21

str: Add better tests for string slice's Chars iterator

d6b42c2

Test using for ch s.chars() { black_box(ch) } to have a test that should force the iterator to run its full decoding computations.

core::str: Implement Chars iterator using slice::Items

42357d7

Re-use the vector iterator to implement the chars iterator. The iterator uses our guarantee that the string contains valid UTF-8, but its only unsafe code is transmuting the decoded u32 into char.

huonw reviewed Jul 17, 2014
View reviewed changes

Add more tests for str Chars iterator

9e59c76

Test iterating (decoding) every codepoint.

alexcrichton reviewed Jul 18, 2014
View reviewed changes

huonw reviewed Jul 18, 2014
View reviewed changes

root added 2 commits July 19, 2014 00:28

Clarify str Chars iterator implementation

bbb299a

Thanks to comments from @huonw, clarify decoding details and use statics for important constants for UTF-8 decoding. Convert some magic numbers scattered in the same file to use the statics too.

Write multibyte case for str Chars iterator in-line

4592164

Thanks to comments from @alexcrichton, write the next/next_back function bodies without nested functions in a more top-to-bottom flow style. Also improve comment style and motivate the unsafe blocks with comments.

huonw reviewed Jul 19, 2014
View reviewed changes

Simplify str CharOffsets iterator

c5e0736

Only one uint is needed to keep track of the offset from the original full string.

bors closed this Jul 19, 2014

bors merged commit c5e0736 into rust-lang:master Jul 19, 2014

bluss deleted the ptr-arithmetic-chars branch July 19, 2014 15:50

huonw mentioned this pull request Aug 8, 2014

Guide macros and unsafe #16331

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

str: Implement a faster Chars iterator for &str #15638

str: Implement a faster Chars iterator for &str #15638

bluss commented Jul 13, 2014

errordeveloper commented Jul 13, 2014

huonw commented Jul 13, 2014

bluss commented Jul 13, 2014

alexcrichton commented Jul 13, 2014

bluss commented Jul 13, 2014

huonw commented Jul 13, 2014

bluss commented Jul 13, 2014

bluss commented Jul 13, 2014

alexcrichton commented Jul 13, 2014

bluss commented Jul 13, 2014

alexcrichton commented Jul 14, 2014

bluss commented Jul 14, 2014

bluss commented Jul 16, 2014

bluss commented Jul 17, 2014

bluss commented Jul 17, 2014

bluss commented Jul 17, 2014

huonw Jul 17, 2014

alexcrichton Jul 18, 2014

bluss Jul 18, 2014

alexcrichton commented Jul 18, 2014

huonw Jul 18, 2014

bluss Jul 19, 2014

huonw commented Jul 18, 2014

bluss commented Jul 19, 2014

bluss commented Jul 19, 2014

huonw Jul 19, 2014

bluss Jul 19, 2014

bluss Jul 19, 2014

huonw Jul 19, 2014

japaric commented Jul 19, 2014

bluss commented Jul 19, 2014

str: Implement a faster Chars iterator for &str #15638

str: Implement a faster Chars iterator for &str #15638

Conversation

bluss commented Jul 13, 2014

errordeveloper commented Jul 13, 2014

huonw commented Jul 13, 2014

bluss commented Jul 13, 2014

alexcrichton commented Jul 13, 2014

bluss commented Jul 13, 2014

huonw commented Jul 13, 2014

bluss commented Jul 13, 2014

bluss commented Jul 13, 2014

alexcrichton commented Jul 13, 2014

bluss commented Jul 13, 2014

alexcrichton commented Jul 14, 2014

bluss commented Jul 14, 2014

bluss commented Jul 16, 2014

bluss commented Jul 17, 2014

bluss commented Jul 17, 2014

bluss commented Jul 17, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexcrichton commented Jul 18, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huonw commented Jul 18, 2014

bluss commented Jul 19, 2014

bluss commented Jul 19, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

japaric commented Jul 19, 2014

bluss commented Jul 19, 2014