Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str: Implement a faster Chars iterator for &str #15638

Merged
merged 6 commits into from
Jul 19, 2014
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 23 additions & 6 deletions src/libcollections/str.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2240,16 +2240,26 @@ mod tests {
#[cfg(test)]
mod bench {
use test::Bencher;
use test::black_box;
use super::*;
use std::option::{None, Some};
use std::iter::{Iterator, DoubleEndedIterator};
use std::collections::Collection;

#[bench]
fn char_iterator(b: &mut Bencher) {
let s = "ศไทย中华Việt Nam; Mary had a little lamb, Little lamb";
let len = s.char_len();

b.iter(|| assert_eq!(s.chars().count(), len));
b.iter(|| s.chars().count());
}

#[bench]
fn char_iterator_for(b: &mut Bencher) {
let s = "ศไทย中华Việt Nam; Mary had a little lamb, Little lamb";

b.iter(|| {
for ch in s.chars() { black_box(ch) }
});
}

#[bench]
Expand All @@ -2260,17 +2270,24 @@ mod bench {
Mary had a little lamb, Little lamb
Mary had a little lamb, Little lamb
Mary had a little lamb, Little lamb";
let len = s.char_len();

b.iter(|| assert_eq!(s.chars().count(), len));
b.iter(|| s.chars().count());
}

#[bench]
fn char_iterator_rev(b: &mut Bencher) {
let s = "ศไทย中华Việt Nam; Mary had a little lamb, Little lamb";
let len = s.char_len();

b.iter(|| assert_eq!(s.chars().rev().count(), len));
b.iter(|| s.chars().rev().count());
}

#[bench]
fn char_iterator_rev_for(b: &mut Bencher) {
let s = "ศไทย中华Việt Nam; Mary had a little lamb, Little lamb";

b.iter(|| {
for ch in s.chars().rev() { black_box(ch) }
});
}

#[bench]
Expand Down
158 changes: 114 additions & 44 deletions src/libcore/str.rs
Original file line number Diff line number Diff line change
Expand Up @@ -97,47 +97,121 @@ impl<'a> CharEq for &'a [char] {
Section: Iterators
*/

/// External iterator for a string's characters.
/// Use with the `std::iter` module.
/// Iterator for the char (representing *Unicode Scalar Values*) of a string
///
/// Created with the method `.chars()`.
#[deriving(Clone)]
pub struct Chars<'a> {
/// The slice remaining to be iterated
string: &'a str,
iter: slice::Items<'a, u8>
}

// Return the initial codepoint accumulator for the first byte.
// The first byte is special, only want bottom 5 bits for width 2, 4 bits
// for width 3, and 3 bits for width 4
macro_rules! utf8_first_byte(
($byte:expr, $width:expr) => (($byte & (0x7F >> $width)) as u32)
)

// return the value of $ch updated with continuation byte $byte
macro_rules! utf8_acc_cont_byte(
($ch:expr, $byte:expr) => (($ch << 6) | ($byte & 63u8) as u32)
)

macro_rules! utf8_is_cont_byte(
($byte:expr) => (($byte & 192u8) == 128)
)

#[inline]
fn unwrap_or_0(opt: Option<&u8>) -> u8 {
match opt {
Some(&byte) => byte,
None => 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I idly wonder if this is a place where we could use an (hypothetical) unreachable intrinsic and possibly even gain a little more speed, since the str invariant guarantees this code will never be hit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thought. I didn't want to put fail!() or similar in there, it would add failure to an otherwise failure less loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot I did try

    unsafe {
        let ptr: *const u8 = mem::transmute(opt);
        *ptr
    }

as well, but it didn't improve anything much. @thestinger is a wizard and would know why, but I don't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the assembly/IR for that is certainly really nice: http://is.gd/MLxXeK

; Function Attrs: nounwind readonly uwtable
define i8 @deref(i8* nocapture readonly) unnamed_addr #0 {
entry-block:
  %1 = load i8* %0, align 1
  ret i8 %1
}

This branch will be really easy to predict, since the None arm is never taken, so maybe that is making it essentially costless?

}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this more performant than foo.unwrap_or(0) (the same width of characters basically)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comments! I didn't like the look of *foo.unwrap_or(&0), I think this is better


impl<'a> Iterator<char> for Chars<'a> {
#[inline]
fn next(&mut self) -> Option<char> {
// Decode the next codepoint, then update
// the slice to be just the remaining part
if self.string.len() != 0 {
let CharRange {ch, next} = self.string.char_range_at(0);
// Decode UTF-8, using the valid UTF-8 invariant
#[inline]
fn decode_multibyte<'a>(x: u8, it: &mut slice::Items<'a, u8>) -> char {
// NOTE: Performance is very sensitive to the exact formulation here
// Decode from a byte combination out of: [[[x y] z] w]
let cont_mask = 0x3F; // continuation byte mask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a static?

let init = utf8_first_byte!(x, 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say above that with 3 only needs the bottom 5 bits, but I don't see where that happens. It seems that it's used directly as is, with all 5 bits below (init << 12), and then used with only 3 bits in the 4 case (init & 7).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will have to be commented. Initial byte in range 0xE0 .. 0xEF is the three byte case, and in that range the 5th bit is clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's cool.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should add my development testcase that tests the decoding of every valid char.

let y = unwrap_or_0(it.next());
let mut ch = utf8_acc_cont_byte!(init, y);
if x >= 0xE0 {
/* [[x y z] w] case */
let z = unwrap_or_0(it.next());

let y_z = (((y & cont_mask) as u32) << 6) | (z & cont_mask) as u32;
ch = init << 12 | y_z;
if x >= 0xF0 {
/* [x y z w] case */
let w = unwrap_or_0(it.next());
ch = (init & 7) << 18 | y_z << 6 | (w & cont_mask) as u32;
}
}
unsafe {
self.string = raw::slice_unchecked(self.string, next, self.string.len());
mem::transmute(ch)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a small comment saying why this transmute is OK (we already know it's a valid codepoint)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be formulated as:

fn next() {
    let next_byte = match self.iter.next() {
        None => return None,
        Some(&next_byte) if next_byte < 128 => return Some(next_byte as char),
        Some(&next_byte) => next_byte,
    };
    // decode_multibyte
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done.


match self.iter.next() {
None => None,
Some(&next_byte) => {
if next_byte < 128 {
Some(next_byte as char)
} else {
Some(decode_multibyte(next_byte, &mut self.iter))
}
}
Some(ch)
} else {
None
}
}

#[inline]
fn size_hint(&self) -> (uint, Option<uint>) {
(self.string.len().saturating_add(3)/4, Some(self.string.len()))
let (len, _) = self.iter.size_hint();
(len.saturating_add(3) / 4, Some(len))
}
}

impl<'a> DoubleEndedIterator<char> for Chars<'a> {
#[inline]
fn next_back(&mut self) -> Option<char> {
if self.string.len() != 0 {
let CharRange {ch, next} = self.string.char_range_at_reverse(self.string.len());
#[inline]
fn decode_multibyte_back<'a>(w: u8, it: &mut slice::Items<'a, u8>) -> char {
// Decode from a byte combination out of: [x [y [z w]]]
let mut ch;
let z = unwrap_or_0(it.next_back());
ch = utf8_first_byte!(z, 2);
if utf8_is_cont_byte!(z) {
let y = unwrap_or_0(it.next_back());
ch = utf8_first_byte!(y, 3);
if utf8_is_cont_byte!(y) {
let x = unwrap_or_0(it.next_back());
ch = utf8_first_byte!(x, 4);
ch = utf8_acc_cont_byte!(ch, y);
}
ch = utf8_acc_cont_byte!(ch, z);
}
ch = utf8_acc_cont_byte!(ch, w);

unsafe {
self.string = raw::slice_unchecked(self.string, 0, next);
mem::transmute(ch)
}
}

match self.iter.next_back() {
None => None,
Some(&back_byte) => {
if back_byte < 128 {
Some(back_byte as char)
} else {
Some(decode_multibyte_back(back_byte, &mut self.iter))
}
}
Some(ch)
} else {
None
}
}
}
Expand All @@ -146,18 +220,23 @@ impl<'a> DoubleEndedIterator<char> for Chars<'a> {
/// Use with the `std::iter` module.
#[deriving(Clone)]
pub struct CharOffsets<'a> {
/// The original string to be iterated
string: &'a str,
front: uint,
back: uint,
iter: Chars<'a>,
}

impl<'a> Iterator<(uint, char)> for CharOffsets<'a> {
#[inline]
fn next(&mut self) -> Option<(uint, char)> {
// Compute the byte offset by using the pointer offset between
// the original string slice and the iterator's remaining part
let offset = self.iter.string.as_ptr() as uint - self.string.as_ptr() as uint;
self.iter.next().map(|ch| (offset, ch))
match self.iter.next() {
None => None,
Some(ch) => {
let index = self.front;
let (len, _) = self.iter.iter.size_hint();
self.front += self.back - self.front - len;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems... more complicated than I would've thought necessary?

Can it be Some((self.iter.string.as_ptr() as uint - self.front, ch))? (If front is initialised to self.as_ptr() as uint below.)

(I think this iterator is a reasonably important one as a building block, so it's worth keeping it reasonably optimised.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Char Offsets no longer has access into the internals of the Chars iterator, because those are slice::Items and private (from another module).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could move the implementation of Chars into CharOffsets, and Chars would be the simpler Map-like adaptor of CharOffsets.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... I guess Items could also provide .start_ptr() .end_ptr(), or this could be rewritten to just self.front += ch.len_utf8_bytes(). Don't know if it's any faster, but it is certainly simpiler.

(Looking at the implementation of that method: almost certainly not, since it's fail!ing and also not using the char invariant.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update pushed, it can indeed be simpler with the information we have. This also means the CharOffsets iterator is one uint smaller than before the PR (doesn't really matter).

Some((index, ch))
}
}
}

#[inline]
Expand All @@ -169,11 +248,14 @@ impl<'a> Iterator<(uint, char)> for CharOffsets<'a> {
impl<'a> DoubleEndedIterator<(uint, char)> for CharOffsets<'a> {
#[inline]
fn next_back(&mut self) -> Option<(uint, char)> {
self.iter.next_back().map(|ch| {
let offset = self.iter.string.len() +
self.iter.string.as_ptr() as uint - self.string.as_ptr() as uint;
(offset, ch)
})
match self.iter.next_back() {
None => None,
Some(ch) => {
let (len, _) = self.iter.iter.size_hint();
self.back -= self.back - self.front - len;
Some((self.back, ch))
}
}
}
}

Expand Down Expand Up @@ -880,18 +962,6 @@ pub struct CharRange {
pub next: uint,
}

// Return the initial codepoint accumulator for the first byte.
// The first byte is special, only want bottom 5 bits for width 2, 4 bits
// for width 3, and 3 bits for width 4
macro_rules! utf8_first_byte(
($byte:expr, $width:expr) => (($byte & (0x7F >> $width)) as u32)
)

// return the value of $ch updated with continuation byte $byte
macro_rules! utf8_acc_cont_byte(
($ch:expr, $byte:expr) => (($ch << 6) | ($byte & 63u8) as u32)
)

static TAG_CONT_U8: u8 = 128u8;

/// Unsafe operations
Expand Down Expand Up @@ -1608,7 +1678,7 @@ impl<'a> StrSlice<'a> for &'a str {

#[inline]
fn chars(&self) -> Chars<'a> {
Chars{string: *self}
Chars{iter: self.as_bytes().iter()}
}

#[inline]
Expand All @@ -1618,7 +1688,7 @@ impl<'a> StrSlice<'a> for &'a str {

#[inline]
fn char_indices(&self) -> CharOffsets<'a> {
CharOffsets{string: *self, iter: self.chars()}
CharOffsets{front: 0, back: self.len(), iter: self.chars()}
}

#[inline]
Expand Down