Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str: Implement a faster Chars iterator for &str #15638

Merged
merged 6 commits into from
Jul 19, 2014
Merged

str: Implement a faster Chars iterator for &str #15638

merged 6 commits into from
Jul 19, 2014

Conversation

bluss
Copy link
Member

@bluss bluss commented Jul 13, 2014

Reimplement the string slice's Iterator<char> by wrapping the already efficient
slice iterator.

The iterator uses our guarantee that the string contains valid UTF-8, but its only unsafe
code is transmuting the decoded u32 into char.

Benchmarks suggest that the runtime of Chars benchmarks are reduced by up to 30%,
runtime of Chars reversed reduced by up to 60%.

BEFORE
test str::bench::char_indicesator                          ... bench:       124 ns/iter (+/- 1)
test str::bench::char_indicesator_rev                      ... bench:       188 ns/iter (+/- 9)
test str::bench::char_iterator                             ... bench:       122 ns/iter (+/- 2)
test str::bench::char_iterator_ascii                       ... bench:       302 ns/iter (+/- 41)
test str::bench::char_iterator_for                         ... bench:       123 ns/iter (+/- 4)
test str::bench::char_iterator_rev                         ... bench:       189 ns/iter (+/- 14)
test str::bench::char_iterator_rev_for                     ... bench:       177 ns/iter (+/- 4)

AFTER
test str::bench::char_indicesator                          ... bench:        85 ns/iter (+/- 3)
test str::bench::char_indicesator_rev                      ... bench:        82 ns/iter (+/- 2)
test str::bench::char_iterator                             ... bench:       100 ns/iter (+/- 3)
test str::bench::char_iterator_ascii                       ... bench:       317 ns/iter (+/- 3)
test str::bench::char_iterator_for                         ... bench:        86 ns/iter (+/- 2)
test str::bench::char_iterator_rev                         ... bench:        80 ns/iter (+/- 6)
test str::bench::char_iterator_rev_for                     ... bench:        68 ns/iter (+/- 0)

Note: Branch name is no longer indicative of the implementation.

@errordeveloper
Copy link
Contributor

Wow, 20-30% 👍

@huonw
Copy link
Member

huonw commented Jul 13, 2014

Just to be clear, there's no way to get similar speed in safe(r) code using self.as_bytes().iter() rather than the raw pointers directly because that does bounds checking for the end of the slice when decoding multibyte characters, even though the UTF8 requirement makes this unnecessary?

@bluss
Copy link
Member Author

bluss commented Jul 13, 2014

It may well be possible @huonw. An initial prototype doesn't beat the current libstd implementation though; I'll come back to that tomorrow

Here we have to ask -- what's the cheapest way to do something like it.next().unwrap()?

pub struct Citems<'a> {
    iter: std::slice::Items<'a, u8>
}

impl<'a> Iterator<char> for Citems<'a>
{
    #[inline]
    fn next(&mut self) -> Option<char>
    {
        #[inline(never)]
        fn decode_multibyte<'a>(fst: u8, it: &mut std::slice::Items<'a, u8>)
            -> Option<char>
        {
            let w = UTF8_CHAR_WIDTH[fst as uint];
            let mut ch = utf8_first_byte!(fst, w as uint);
            ch = utf8_acc_cont_byte!(ch, it.next().map_or(0, |x| *x));
            if w > 2 {
                ch = utf8_acc_cont_byte!(ch, it.next().map_or(0, |x| *x));
                if w > 3 {
                    ch = utf8_acc_cont_byte!(ch, it.next().map_or(0, |x| *x));
                }
            }
            unsafe {
                Some(transmute(ch))
            }
        }

        match self.iter.next() {
            Some(&next_byte) => {
                if next_byte < 128 {
                    Some(next_byte as char)
                } else {
                    decode_multibyte(next_byte, &mut self.iter)
                }
            }
            None => None
        }
    }
}

@alexcrichton
Copy link
Member

This is awesome, thanks for optimizing this! I, like @huonw, am also interested if this could be done with a bytes iterator, it seems like it should be able to compile down to the same thing.

@bluss
Copy link
Member Author

bluss commented Jul 13, 2014

I agree -- but I haven't been able to write an implementation based on the slice iterator that performs as well. My understanding is simplified, but I tried to put the multibyte case in the slow path and make the ASCII-only case fast but the slice iterator doesn't perform as well as the pointer arithmetic based chars iterator, especially not on ASCII-dominated text.

Is this related to #11751 ?

My playground code for comparing the three iterator implementations is here, the slice iterator based implementation is on this line

Benchmarks from the playground code (wikitext is decoding the En Wikipedia front page source)

test bench::char_iterator               ... bench:       112 ns/iter (+/- 0)
test bench::char_iterator_ascii         ... bench:       301 ns/iter (+/- 1)
test bench::char_iterator_rev           ... bench:       181 ns/iter (+/- 2)
test bench::char_iterator_wikitext      ... bench:     77445 ns/iter (+/- 1578)
test bench::citems_iterator             ... bench:       111 ns/iter (+/- 1)
test bench::citems_iterator_ascii       ... bench:       292 ns/iter (+/- 4)
test bench::citems_iterator_wikitext    ... bench:     77120 ns/iter (+/- 341)
test bench::new_chars_iterator          ... bench:        84 ns/iter (+/- 11)
test bench::new_chars_iterator_ascii    ... bench:       225 ns/iter (+/- 1)
test bench::new_chars_iterator_rev      ... bench:       109 ns/iter (+/- 3)
test bench::new_chars_iterator_wikitext ... bench:     54683 ns/iter (+/- 211)

@huonw
Copy link
Member

huonw commented Jul 13, 2014

Yes #11751 will get in the way, theoretically rust-lang/llvm#14 will improve it (if you felt like it, you could manipulate your src/llvm submodule to include that to test; I'm not really sure how easy that is to do).

@bluss
Copy link
Member Author

bluss commented Jul 13, 2014

Updated comment because of major thinko -- nothing seems to be working the way I thought it was.

@bluss
Copy link
Member Author

bluss commented Jul 13, 2014

Now I've managed to compile a rustc that knows about -C null-check-elimination (pull in the llvm submodule; configure to disable submodule management; make)

The result is, no improvement (citems is the slice iterator-based impl)

test bench::citems_iterator             ... bench:       115 ns/iter (+/- 15)
test bench::citems_iterator_ascii       ... bench:       311 ns/iter (+/- 44)
test bench::citems_iterator_wikitext    ... bench:     77793 ns/iter (+/- 630)
test bench::new_chars_iterator          ... bench:       104 ns/iter (+/- 12)
test bench::new_chars_iterator_ascii    ... bench:       223 ns/iter (+/- 4)
test bench::new_chars_iterator_rev      ... bench:       121 ns/iter (+/- 16)
test bench::new_chars_iterator_wikitext ... bench:     55178 ns/iter (+/- 323)

@alexcrichton
Copy link
Member

I'm a little uneasy adding more unsafe iteration when it already exists for vectors (duplication of logic). Is there a performance win for this rewrite using safe iteration over the previous code, or is it only a win using unsafe code?

@bluss
Copy link
Member Author

bluss commented Jul 13, 2014

Duplication of logic is a good argument against, but in the end we need to provide a great Chars iterator. I don't feel that this code is particularly risky in any way, just that Rust's string handling is mature enough now that we don't have lots of holes to slip in invalid encoded data through libstd APIs anymore.

The implementation using the slice iterator is a none to a few percent performance increase by itself, we could wait to see if more code generation improvements make up for the rest there. It's not even important to merge that now, but it would be cool if @zwarich could say if his improvements are going to help us.

@alexcrichton
Copy link
Member

Perhaps we could land the other optimizations and hold off on the unsafe pointers until rust-lang/llvm#14 is merged into master?

@bluss
Copy link
Member Author

bluss commented Jul 14, 2014

I did test those changes already, but if there are more it would be very interesting to see. It's not unreasonable. I should complete the implementation using just the slice iterator (the chars().rev() impl is missing), and we can see if it's worth adding on its own, it's ok.

I don't feel a reimplementation of the slice iterator is that scary, it's a great model of a useful pattern itself -- it will come back when users want to implement "slices with stride > 1" for example.

@bluss
Copy link
Member Author

bluss commented Jul 16, 2014

Tons of elbow grease creates a slice iterator decoder that's faster than both the present Chars iterator as well as the pointer arithmetic Chars iterator in all my testcases. It's not pretty, though. Will update PR tomorrow.

#[deriving(Clone)]
pub struct CharItems<'a> {
    iter: slice::Items<'a, u8>
}

#[inline]
fn unwrap_or_0(opt: Option<&u8>) -> u8 {
    match opt {
        Some(&byte) => byte,
        None => 0,
    }
}

impl<'a> Iterator<char> for CharItems<'a> {
    #[inline]
    fn next(&mut self) -> Option<char> {
        fn decode_multibyte<'a>(x: u8, it: &mut slice::Items<'a, u8>) -> char {
            // NOTE: Performance is very sensitive to the exact formulation here
            // Decode from a byte combination out of: [[[x y] z] w]
            let CONT_MASK = 0x3F; // continuation byte mask
            let init = utf8_first_byte!(x, 2);
            let mut ch;
            let y = unwrap_or_0(it.next());
            ch = init << 6 | (y & CONT_MASK) as u32;
            if x >= 0xE0 {
                let z = unwrap_or_0(it.next());

                let y_z = (((y & CONT_MASK) as u32) << 6) | (z & CONT_MASK) as u32;
                ch = init << 12 | y_z;
                if x >= 0xF0 {
                    let w = unwrap_or_0(it.next());
                    ch = (init & 7) << 18 | y_z << 6 | (w & CONT_MASK) as u32;
                }
            }
            unsafe {
                mem::transmute(ch)
            }
        }

        match self.iter.next() {
            None => None,
            Some(&next_byte) => {
                if next_byte < 128 {
                    Some(next_byte as char)
                } else {
                    Some(decode_multibyte(next_byte, &mut self.iter))
                }
            }
        }
    }
}

@bluss
Copy link
Member Author

bluss commented Jul 17, 2014

There is no conclusive evidence for the new impl (wrapping slice::Items<'a, u8>) is faster. How well it compiles heavily depends on the context -- even though that it's faster in a standalone testing program, for some reason it is slower in libcore bench tests. :(

What worries me here is the result for the test called char_iterator_for -- that I added, which runs for ch in s.chars() { black_box(ch) }. Without that regression, I'd love to merge it.

Before

test str::bench::char_indicesator                          ... bench:       125 ns/iter (+/- 293)
test str::bench::char_indicesator_rev                      ... bench:       189 ns/iter (+/- 26)
test str::bench::char_iterator                             ... bench:       152 ns/iter (+/- 296)
test str::bench::char_iterator_ascii                       ... bench:       302 ns/iter (+/- 485)
test str::bench::char_iterator_for                         ... bench:       121 ns/iter (+/- 2)
test str::bench::char_iterator_rev                         ... bench:       418 ns/iter (+/- 357)
test str::bench::char_iterator_rev_for                     ... bench:       195 ns/iter (+/- 413)

after

test str::bench::char_indicesator                          ... bench:       126 ns/iter (+/- 2)
test str::bench::char_indicesator_rev                      ... bench:       137 ns/iter (+/- 15)
test str::bench::char_iterator                             ... bench:       113 ns/iter (+/- 11)
test str::bench::char_iterator_ascii                       ... bench:       310 ns/iter (+/- 14)
test str::bench::char_iterator_for                         ... bench:       179 ns/iter (+/- 3)
test str::bench::char_iterator_rev                         ... bench:       107 ns/iter (+/- 8)
test str::bench::char_iterator_rev_for                     ... bench:       167 ns/iter (+/- 11)

@bluss
Copy link
Member Author

bluss commented Jul 17, 2014

It's the inlining decisions -- inlining even the multibyte case makes the difference -- how does that look with the update?

Benching on a laptop always has you wondering, is that faster now because of some turbo boost stuff I can't control? Either way, the for loop test case sped up to the level of the others, using inline.

test str::bench::char_indicesator                          ... bench:        85 ns/iter (+/- 3)
test str::bench::char_indicesator_rev                      ... bench:        82 ns/iter (+/- 6)
test str::bench::char_iterator                             ... bench:        85 ns/iter (+/- 2)
test str::bench::char_iterator_ascii                       ... bench:       317 ns/iter (+/- 2)
test str::bench::char_iterator_for                         ... bench:        87 ns/iter (+/- 3)
test str::bench::char_iterator_rev                         ... bench:        79 ns/iter (+/- 1)
test str::bench::char_iterator_rev_for                     ... bench:        68 ns/iter (+/- 2)

@bluss
Copy link
Member Author

bluss commented Jul 17, 2014

Updating main PR text in case this is merged.

root added 2 commits July 17, 2014 20:21
Test using for ch s.chars() { black_box(ch) } to have a test that should
force the iterator to run its full decoding computations.
Re-use the vector iterator to implement the chars iterator.

The iterator uses our guarantee that the string contains valid UTF-8,
but its only unsafe code is transmuting the decoded u32 into char.
fn decode_multibyte<'a>(x: u8, it: &mut slice::Items<'a, u8>) -> char {
// NOTE: Performance is very sensitive to the exact formulation here
// Decode from a byte combination out of: [[[x y] z] w]
let cont_mask = 0x3F; // continuation byte mask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a static?

Test iterating (decoding) every codepoint.
match opt {
Some(&byte) => byte,
None => 0,
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this more performant than foo.unwrap_or(0) (the same width of characters basically)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comments! I didn't like the look of *foo.unwrap_or(&0), I think this is better

@alexcrichton
Copy link
Member

Awesome work @blake2-ppc! It's always nice to hear that unsafe code isn't necessary when safe code suffices.


static TAG_CONT_U8: u8 = 128u8;
/// Mask of the value bits of a continuation byte
static CONT_MASK: u8 = 0x3Fu8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could be binary constants too, 0b0011_1111 and 0b1000_0000.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@huonw
Copy link
Member

huonw commented Jul 18, 2014

Could you add the benchmarks (or at least mention loose percentages) to a commit message? (So that someone history diving can understand the reasons that this code is designed like this.)

root added 2 commits July 19, 2014 00:28
Thanks to comments from @huonw, clarify decoding details and use
statics for important constants for UTF-8 decoding. Convert some magic
numbers scattered in the same file to use the statics too.
Thanks to comments from @alexcrichton, write the next/next_back function
bodies without nested functions in a more top-to-bottom flow style.

Also improve comment style and motivate the unsafe blocks with comments.
@bluss
Copy link
Member Author

bluss commented Jul 19, 2014

Commits pushed to incorporate the excellent suggestions. Benchmarking results added to PR message (will be in merge commit log)

@bluss
Copy link
Member Author

bluss commented Jul 19, 2014

In-tree benchmarking is very time consuming, but I felt it was necessary to justify the change.

The development benchmark was using the html source of en.wikipedia.org/wiki/Main_Page, a typical almost ascii only document with small chunks of of UTF-8. Those benchmarks are even more convincing, for some reason. (Maybe code size? The size of the .next() method maybe pays off in a long-running simple loop like this)

new Chars vs old — s.chars().count()

NEW test bench::char_items_wikitext     ... bench:    126095 ns/iter (+/- 8568)
NEW test bench::char_items_wikitext_rev ... bench:    120053 ns/iter (+/- 2175)
OLD test bench::chars_wikitext          ... bench:    169647 ns/iter (+/- 4876)
OLD test bench::chars_wikitext_rev      ... bench:    228570 ns/iter (+/- 2267)

new Chars vs old — for ch in s.chars() { black_box(ch) }

NEW test bench::char_items_wikitext     ... bench:    120153 ns/iter (+/- 1482)
NEW test bench::char_items_wikitext_rev ... bench:     83574 ns/iter (+/- 1181)
OLD test bench::chars_wikitext          ... bench:    278724 ns/iter (+/- 2909)
OLD test bench::chars_wikitext_rev      ... bench:    202086 ns/iter (+/- 1790)

fn unwrap_or_0(opt: Option<&u8>) -> u8 {
match opt {
Some(&byte) => byte,
None => 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I idly wonder if this is a place where we could use an (hypothetical) unreachable intrinsic and possibly even gain a little more speed, since the str invariant guarantees this code will never be hit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thought. I didn't want to put fail!() or similar in there, it would add failure to an otherwise failure less loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot I did try

    unsafe {
        let ptr: *const u8 = mem::transmute(opt);
        *ptr
    }

as well, but it didn't improve anything much. @thestinger is a wizard and would know why, but I don't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the assembly/IR for that is certainly really nice: http://is.gd/MLxXeK

; Function Attrs: nounwind readonly uwtable
define i8 @deref(i8* nocapture readonly) unnamed_addr #0 {
entry-block:
  %1 = load i8* %0, align 1
  ret i8 %1
}

This branch will be really easy to predict, since the None arm is never taken, so maybe that is making it essentially costless?

Only one uint is needed to keep track of the offset from the original
full string.
bors added a commit that referenced this pull request Jul 19, 2014
Reimplement the string slice's `Iterator<char>` by wrapping the already efficient
slice iterator.

The iterator uses our guarantee that the string contains valid UTF-8, but its only unsafe
code is transmuting the decoded `u32` into `char`.

Benchmarks suggest that the runtime of `Chars` benchmarks are reduced by up to 30%,
runtime of `Chars` reversed reduced by up to 60%.

```
BEFORE
test str::bench::char_indicesator                          ... bench:       124 ns/iter (+/- 1)
test str::bench::char_indicesator_rev                      ... bench:       188 ns/iter (+/- 9)
test str::bench::char_iterator                             ... bench:       122 ns/iter (+/- 2)
test str::bench::char_iterator_ascii                       ... bench:       302 ns/iter (+/- 41)
test str::bench::char_iterator_for                         ... bench:       123 ns/iter (+/- 4)
test str::bench::char_iterator_rev                         ... bench:       189 ns/iter (+/- 14)
test str::bench::char_iterator_rev_for                     ... bench:       177 ns/iter (+/- 4)

AFTER
test str::bench::char_indicesator                          ... bench:        85 ns/iter (+/- 3)
test str::bench::char_indicesator_rev                      ... bench:        82 ns/iter (+/- 2)
test str::bench::char_iterator                             ... bench:       100 ns/iter (+/- 3)
test str::bench::char_iterator_ascii                       ... bench:       317 ns/iter (+/- 3)
test str::bench::char_iterator_for                         ... bench:        86 ns/iter (+/- 2)
test str::bench::char_iterator_rev                         ... bench:        80 ns/iter (+/- 6)
test str::bench::char_iterator_rev_for                     ... bench:        68 ns/iter (+/- 0)
```

Note: Branch name is no longer indicative of the implementation.
@bors bors closed this Jul 19, 2014
@bors bors merged commit c5e0736 into rust-lang:master Jul 19, 2014
@bluss bluss deleted the ptr-arithmetic-chars branch July 19, 2014 15:50
@japaric
Copy link
Member

japaric commented Jul 19, 2014

I repeated the last benchmark suite (iterating over the wikipedia main page) using my criterion library, and got a consistent 20%+ improvement in all the benchmarks.

Benchmarking chars/count
(...)
Comparing with previous sample
> Bootstrapping with 100489 resamples
  > mean    -20.243% ± 0.0606% [-20.355% -20.117%] 95% CI
  > median  -20.346% ± 0.0103% [-20.349% -20.322%] 95% CI
  > mean has improved by 20.24%
  > median has improved by 20.35%

Benchmarking chars/for
(...)
Comparing with previous sample
> Bootstrapping with 100489 resamples
  > mean    -26.500% ± 0.0571% [-26.608% -26.388%] 95% CI
  > median  -26.493% ± 0.0154% [-26.515% -26.461%] 95% CI
  > mean has improved by 26.50%
  > median has improved by 26.49%

Benchmarking chars/rev_count
(...)
Comparing with previous sample
> Bootstrapping with 100489 resamples
  > mean    -25.287% ± 0.0624% [-25.418% -25.174%] 95% CI
  > median  -25.298% ± 0.0318% [-25.380% -25.252%] 95% CI
  > mean has improved by 25.29%
  > median has improved by 25.30%

Benchmarking chars/rev_for
(...)
Comparing with previous sample
> Bootstrapping with 100489 resamples
  > mean    -22.016% ± 0.1673% [-22.364% -21.711%] 95% CI
  > median  -21.800% ± 0.0492% [-21.885% -21.698%] 95% CI
  > mean has improved by 22.02%
  > median has improved by 21.80%

@blake2-ppc Really nice speed-ups! 👍

@bluss
Copy link
Member Author

bluss commented Jul 19, 2014

Thanks for running the benchmark, that is cool!

@huonw huonw mentioned this pull request Aug 8, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants