Improve PartialEq for slices #26884

dotdash · 2015-07-08T12:54:44Z

Exploiting the fact that getting the length of the slices is known, we
can use a counted loop instead of iterators, which means that we only
need a single counter, instead of having to increment and check one
pointer for each iterator.

Benchmarks comparing vectors with 100,000 elements:

Before:

running 8 tests
test eq1_u8  ... bench:      66,757 ns/iter (+/- 113)
test eq2_u16 ... bench:     111,267 ns/iter (+/- 149)
test eq3_u32 ... bench:     126,282 ns/iter (+/- 111)
test eq4_u64 ... bench:     126,418 ns/iter (+/- 155)
test ne1_u8  ... bench:      88,990 ns/iter (+/- 161)
test ne2_u16 ... bench:      89,126 ns/iter (+/- 265)
test ne3_u32 ... bench:      96,901 ns/iter (+/- 92)
test ne4_u64 ... bench:      96,750 ns/iter (+/- 137)

After:

running 8 tests
test eq1_u8  ... bench:      46,413 ns/iter (+/- 521)
test eq2_u16 ... bench:      46,500 ns/iter (+/- 74)
test eq3_u32 ... bench:      50,059 ns/iter (+/- 92)
test eq4_u64 ... bench:      54,001 ns/iter (+/- 92)
test ne1_u8  ... bench:      47,595 ns/iter (+/- 53)
test ne2_u16 ... bench:      47,521 ns/iter (+/- 59)
test ne3_u32 ... bench:      44,889 ns/iter (+/- 74)
test ne4_u64 ... bench:      47,775 ns/iter (+/- 68)

Exploiting the fact that getting the length of the slices is known, we can use a counted loop instead of iterators, which means that we only need a single counter, instead of having to increment and check one pointer for each iterator. Benchmarks comparing vectors with 100,000 elements: Before: ``` running 8 tests test eq1_u8 ... bench: 66,757 ns/iter (+/- 113) test eq2_u16 ... bench: 111,267 ns/iter (+/- 149) test eq3_u32 ... bench: 126,282 ns/iter (+/- 111) test eq4_u64 ... bench: 126,418 ns/iter (+/- 155) test ne1_u8 ... bench: 88,990 ns/iter (+/- 161) test ne2_u16 ... bench: 89,126 ns/iter (+/- 265) test ne3_u32 ... bench: 96,901 ns/iter (+/- 92) test ne4_u64 ... bench: 96,750 ns/iter (+/- 137) ``` After: ``` running 8 tests test eq1_u8 ... bench: 46,413 ns/iter (+/- 521) test eq2_u16 ... bench: 46,500 ns/iter (+/- 74) test eq3_u32 ... bench: 50,059 ns/iter (+/- 92) test eq4_u64 ... bench: 54,001 ns/iter (+/- 92) test ne1_u8 ... bench: 47,595 ns/iter (+/- 53) test ne2_u16 ... bench: 47,521 ns/iter (+/- 59) test ne3_u32 ... bench: 44,889 ns/iter (+/- 74) test ne4_u64 ... bench: 47,775 ns/iter (+/- 68) ```

rust-highfive · 2015-07-08T12:54:47Z

r? @pcwalton

(rust_highfive has picked a reviewer for you, use r? to override)

alexcrichton · 2015-07-08T17:13:21Z

@bors: r+ 9f4d5b4

Nice wins!

Exploiting the fact that getting the length of the slices is known, we can use a counted loop instead of iterators, which means that we only need a single counter, instead of having to increment and check one pointer for each iterator. Benchmarks comparing vectors with 100,000 elements: Before: ``` running 8 tests test eq1_u8 ... bench: 66,757 ns/iter (+/- 113) test eq2_u16 ... bench: 111,267 ns/iter (+/- 149) test eq3_u32 ... bench: 126,282 ns/iter (+/- 111) test eq4_u64 ... bench: 126,418 ns/iter (+/- 155) test ne1_u8 ... bench: 88,990 ns/iter (+/- 161) test ne2_u16 ... bench: 89,126 ns/iter (+/- 265) test ne3_u32 ... bench: 96,901 ns/iter (+/- 92) test ne4_u64 ... bench: 96,750 ns/iter (+/- 137) ``` After: ``` running 8 tests test eq1_u8 ... bench: 46,413 ns/iter (+/- 521) test eq2_u16 ... bench: 46,500 ns/iter (+/- 74) test eq3_u32 ... bench: 50,059 ns/iter (+/- 92) test eq4_u64 ... bench: 54,001 ns/iter (+/- 92) test ne1_u8 ... bench: 47,595 ns/iter (+/- 53) test ne2_u16 ... bench: 47,521 ns/iter (+/- 59) test ne3_u32 ... bench: 44,889 ns/iter (+/- 74) test ne4_u64 ... bench: 47,775 ns/iter (+/- 68) ```

bors · 2015-07-09T05:15:49Z

⌛ Testing commit 9f4d5b4 with merge 66b9277...

bors · 2015-07-09T07:13:12Z

☀️ Test successful - auto-linux-32-nopt-t, auto-linux-32-opt, auto-linux-64-nopt-t, auto-linux-64-opt, auto-linux-64-x-android-t, auto-mac-32-opt, auto-mac-64-nopt-t, auto-mac-64-opt, auto-win-gnu-32-nopt-t, auto-win-gnu-32-opt, auto-win-gnu-64-nopt-t, auto-win-gnu-64-opt

lilianmoraru · 2015-07-14T13:42:52Z

That's interesting. Now I am just a little bit confused.
I thought that using iterators is faster in Rust than using for loops because of bounds checking.
Long time ago I saw a benchmark style test that was initially using a for loop and had pretty bad performance compared to C and C++, then somebody else switched it to iterators and the performance was comparable...

dotdash · 2015-07-14T14:09:46Z

@lilianmoraru In this case, the bounds checks get eliminated by the optimizer, because it can easily prove that it would never trigger, so there's no advantage for iterators here.

But using the zipped iterators, you get two pointers that both need to be incremented and checked, while the counted loop needs to increment and check just a single integer. Different problem ;-)

Reusing the same idea as in rust-lang#26884, we can exploit the fact that the length of slices is known, hence we can use a counted loop instead of iterators, which means that we only need a single counter, instead of having to increment and check one pointer for each iterator. Using the generic implementation of the boolean comparison operators (`lt`, `le`, `gt`, `ge`) provides further speedup for simple types. This happens because the loop scans elements checking for equality and dispatches to element comparison or length comparison depending on the result of the prefix comparison. ``` test u8_cmp ... bench: 14,043 ns/iter (+/- 1,732) test u8_lt ... bench: 16,156 ns/iter (+/- 1,864) test u8_partial_cmp ... bench: 16,250 ns/iter (+/- 2,608) test u16_cmp ... bench: 15,764 ns/iter (+/- 1,420) test u16_lt ... bench: 19,833 ns/iter (+/- 2,826) test u16_partial_cmp ... bench: 19,811 ns/iter (+/- 2,240) test u32_cmp ... bench: 15,792 ns/iter (+/- 3,409) test u32_lt ... bench: 18,577 ns/iter (+/- 2,075) test u32_partial_cmp ... bench: 18,603 ns/iter (+/- 5,666) test u64_cmp ... bench: 16,337 ns/iter (+/- 2,511) test u64_lt ... bench: 18,074 ns/iter (+/- 7,914) test u64_partial_cmp ... bench: 17,909 ns/iter (+/- 1,105) ``` ``` test u8_cmp ... bench: 6,511 ns/iter (+/- 982) test u8_lt ... bench: 6,671 ns/iter (+/- 919) test u8_partial_cmp ... bench: 7,118 ns/iter (+/- 1,623) test u16_cmp ... bench: 6,689 ns/iter (+/- 921) test u16_lt ... bench: 6,712 ns/iter (+/- 947) test u16_partial_cmp ... bench: 6,725 ns/iter (+/- 780) test u32_cmp ... bench: 7,704 ns/iter (+/- 1,294) test u32_lt ... bench: 7,611 ns/iter (+/- 3,062) test u32_partial_cmp ... bench: 7,640 ns/iter (+/- 1,149) test u64_cmp ... bench: 7,517 ns/iter (+/- 2,164) test u64_lt ... bench: 7,579 ns/iter (+/- 1,048) test u64_partial_cmp ... bench: 7,629 ns/iter (+/- 1,195) ```

This branch improves the performance of Ord and PartialOrd methods for slices compared to the iter-based implementation. Based on the approach used in #26884.

jstnlef · 2015-09-17T18:38:19Z

src/libcore/slice.rs

+            }
+        }
+
+        true
    }
    fn ne(&self, other: &[B]) -> bool {


I know this has already been merged and looks fantastic but I, as a Rust n00b, had a question as to why this isn't just

fn ne(&self, other: &[B]) -> bool { !self.eq(other); }

Is it a performance issue to avoid the duplication?

Or even just drop this and use the default implementation of ne() from the trait definition.

I would be interested to know this as well. Could it be because there is call to ne on line 1480, which may not be the same as !eq?

The documentation for PartialEq states: "Any manual implementation of ne must respect the rule that eq is a strict inverse of ne; that is, !(a == b) if and only if a != b."
Under this assumption, ne can always re-use the implementation from the trait definition. I would expect the compiler to do a good job optimising it as it can fold the negation either on the possible outcomes or on the ne usage. Unless some benchmarking suggests otherwise, I would assume that avoiding the duplication is preferred.

I suppose we'll have to wait to hear back from @dotdash then :-)

I don't remember exactly why I didn't change that, but probably I was just too lazy to come up with enough useful testcases (involving at least various calling- and inlining-scenarios) to make sure that relying on the simple negation implementation doesn't affect performance negatively. Benchmarking welcome!

I'll see if I can find some time this weekend to test it out.

rust-highfive assigned pcwalton Jul 8, 2015

bluss added the relnotes Marks issues that should be documented in the release notes of the next release. label Jul 8, 2015

bors merged commit 9f4d5b4 into rust-lang:master Jul 9, 2015

dotdash deleted the fast branch July 27, 2015 08:49

bluss mentioned this pull request Aug 12, 2015

Tracking issue for iter_order stabilization #27737

Closed

ranma42 mentioned this pull request Sep 16, 2015

Slice equality is slow #16913

Closed

ranma42 mentioned this pull request Sep 16, 2015

Faster slice PartialOrd #28436

Merged

bors added a commit that referenced this pull request Sep 16, 2015

Auto merge of #28436 - ranma42:faster-partialord, r=bluss

47d125d

This branch improves the performance of Ord and PartialOrd methods for slices compared to the iter-based implementation. Based on the approach used in #26884.

jstnlef reviewed Sep 17, 2015
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve PartialEq for slices #26884

Improve PartialEq for slices #26884

dotdash commented Jul 8, 2015

rust-highfive commented Jul 8, 2015

alexcrichton commented Jul 8, 2015

bors commented Jul 9, 2015

bors commented Jul 9, 2015

lilianmoraru commented Jul 14, 2015

dotdash commented Jul 14, 2015

jstnlef Sep 17, 2015

jstnlef Sep 17, 2015

hatahet Sep 17, 2015

ranma42 Sep 17, 2015

hatahet Sep 17, 2015

dotdash Sep 17, 2015

jstnlef Sep 18, 2015

Improve PartialEq for slices #26884

Improve PartialEq for slices #26884

Conversation

dotdash commented Jul 8, 2015

rust-highfive commented Jul 8, 2015

alexcrichton commented Jul 8, 2015

bors commented Jul 9, 2015

bors commented Jul 9, 2015

lilianmoraru commented Jul 14, 2015

dotdash commented Jul 14, 2015

jstnlef Sep 17, 2015

Choose a reason for hiding this comment

jstnlef Sep 17, 2015

Choose a reason for hiding this comment

hatahet Sep 17, 2015

Choose a reason for hiding this comment

ranma42 Sep 17, 2015

Choose a reason for hiding this comment

hatahet Sep 17, 2015

Choose a reason for hiding this comment

dotdash Sep 17, 2015

Choose a reason for hiding this comment

jstnlef Sep 18, 2015

Choose a reason for hiding this comment